Skip to content

Test precisely for specific compile-time errors and their source positions #36766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
whesse opened this issue Apr 26, 2019 · 17 comments
Closed
3 tasks
Assignees
Labels
area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. area-test Cross-cutting test issues (use area- labels for specific failures; not used for package:test). type-enhancement A request for a change that isn't a bug

Comments

@whesse
Copy link
Contributor

whesse commented Apr 26, 2019

Meta-issue to discuss and implement more precise testing for compile-time errors.

Testing for compile-time errors using test.py mainly uses multitests, since they have a way of specifying that the expected result of running a test will be a compile-time failure, and they allow
testing for multiple cases of compile failures in a single test files.

What is missing in test.py's evaluation of the result of running a (multi)test, is a check that the compile-time error is the expected error, and that it occurs at the right source code position in the (multi)test.

If there is a standard for specifying the expected error, and its position, in the test source, and standards for how our various compilers and parsers report these errors, then we can put the logic for checking that the result of a test matches the expectation, including the error type and position, into the test.py (tools/testing/dart) test framework.

We could also consider putting it into other testing frameworks, like package:test and package:testing.

I will start a checklist of the topics we need to consider, and we can create related issues for
the subtopics as we flesh them out. General comments on the topic should be posted as replies, or written in a document.

  • Standard way of marking compile-time errors with their type and position
  • Standardizing the way that our tools report these errors, so that test.py can have simpler code to check for them
  • Creating new multitests for static errors using this new syntax, while remaining compatible with existing tests that are unchanged.

@munificent @jcollins-g @sigmundch @whesse

@whesse whesse added area-test Cross-cutting test issues (use area- labels for specific failures; not used for package:test). area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. type-enhancement A request for a change that isn't a bug labels Apr 26, 2019
@whesse whesse self-assigned this Apr 26, 2019
@whesse
Copy link
Contributor Author

whesse commented Apr 26, 2019

An example of a request for this capability is #31883

@bwilkerson
Copy link
Member

This would be great! I have often been concerned that analyzer might be passing (or failing) some tests for the wrong reason.

It isn't clear what you're including as part of "Standardizing the way that our tools report these errors". It sounds kind of like you're asking for a standard output format so that it's easier for test.py to parse and use the information in the output. If so, I agree that that would be useful.

But I think there is another topic to address. You might also be including it as part of that line item, but if so I think it might be worth calling it out separately. And if not, then I think you need to add it.

Specifically, we need to get to a point where analyzer and the front end are producing the same errors (with the same identifiers) in the same places. Aside from the errors produced by the parser (which are unified by definition, though the identifiers might currently differ), there is, unfortunately, a significant amount of work to get to the point where the two tools agree. I really want to see them producing consistent, high quality diagnostics, and I think it's worth the effort it will take, but it isn't likely to happen without a concerted effort.

@sigmundch
Copy link
Member

... analyzer and the front end are producing the same errors (with the same identifiers) in the same places.

That would be awesome long term! I personally would like to not only be consistent, but to have a single source of through for all messages (e.g. generalize front_end/messages.yaml to eventually contain all messages from all tools. Maybe we could also generate C++ code so that the vm can use it to report errors as well!)

However, I hope we can make the compile-time error tests available more gradually. For instance, start first noting the exact offset where an error is reported, and later add support for checking that a specific error message is given?


Standardizing the way that our tools report these errors

Some ideas:

  • add to each tool a flag to produce errors in a standard machine readable format (analyzer already does)
  • provide a error-message parser from each tool, so we can convert them to a standard format

Standard way of marking compile-time errors with their type and position

Today analyzer, front_end, and dart2js, each have tests that that look for specific data at specific locations in a test file. The logic is different on each tool, but my hope is that we can leverage some of the existing work to solve this task.

Some pointers:

  • From analyzer: a good example is inferred_type_test.dart, it has many example snippets where errors are specified in comments of the form /*error:ERROR_ID*/ in the code. After parsing, the location of those errors is extracted and check against what analyzer produces. The extraction uses the token stream, so error locations are associated with the location of the following token.

  • From front_end: a good example are the strong-mode inference tests. The extraction logic here is more configurable and was used for testing multiple features. /cc @stereotype441 who authored this.

  • From dart2js: a good example are stack trace/source-map tests ( see deep_inlining.dart test). Unlike the other two, the extraction logic here is entirely text-based, which requires being a bit more careful about the effect of the comment to the offset information, but might be more flexible if we are testing syntax errors. I feel that using something like we have in analyzer or front_end is probably a better route to take, though.

@bwilkerson
Copy link
Member

From analyzer: a good example is inferred_type_test.dart ...

Actually, that's an uncommon way for the analyzer tests to be written. Most of them follow or are being converted to follow, the pattern in ambiguous_set_or_map_literal_test.dart. The problem with the one you picked is that you can only specify the offsets of the diagnostics, and I want our tests (not necessarily the shared tests) to also test the length of the range.

@munificent
Copy link
Member

Yes, I love everything about this issue. I should have a good chunk of time after IO to work on this and nothing would make me happier than to harvest the best of what the existing tools have and consolidate on to a single solution.

Most of them follow or are being converted to follow, the pattern in ambiguous_set_or_map_literal_test.dart.

Tracking length is useful, but having to hand-author those numbers looks tricky. As a total bikeshed, I recently hacked together a little error test framework for a hobby project that lets you author tests like this:

Map<int, int> map;
Set<int> set;
var c = {...set, ...map};
//      ^^^^^^^^^^^^^^^^ AMBIGUOUS_SET_OR_MAP_LITERAL_BOTH

Extending it to multi-line errors might be a little harder, but something like this is probably easier to maintain than having to hand-count characters.

@bwilkerson
Copy link
Member

Tracking length is useful, but having to hand-author those numbers looks tricky.

It is. Which is why the failure message prints the expected values as source code so that we can copy/paste the new values in if the offset or length changes for valid reasons.

@munificent
Copy link
Member

Which is why the failure message prints the expected values as source code so that we can copy/paste the new values in if the offset or length changes for valid reasons.

Ah, smart. 👍

@sigmundch
Copy link
Member

Which is why the failure message prints the expected values as source code so that we can copy/paste the new values in if the offset or length changes for valid reasons.

We should consider having mode where the test runner updates the test expectations in place. This might save you the copy/paste trouble (we don't have it on some of our dart2js tests and I keep wishing we did). When I have used something similar (e.g. with CFE .expect files), the diff in a code-review then is perfect for verifying that the change was intentional.

@kmillikin
Copy link

I have some questions.

Where is the specification of errors and line and column positions? Without a specification we don't have anything to test conformance to. We don't have a way to determine which, if any, of conflicting implementations is correct. We can't use a contractor to write tests for us.

What is the semantics of static errors? I guess a program that contains a single static error must signal that one statically for a suitable definition of "statically". What about a program with multiple static errors? Do they all need to be signaled? Is it enough to signal at least one of them? We should test programs with multiple errors, because these definitely occur in the real world.

When the language team designs new features will they also specify static error messages and positions? Will they please go back and add these to the existing static errors in the spec? If not, then the test suite (or one of the implementations) is our only spec.

Do we expect intentional violations of the error specification on the part of the implementations? How serious are these when they occur?

What is the process of changing the error specification (say, to improve an error message)? What parties have to sign off on it? Will it be considered a breaking change to start failing in a different way?

@bwilkerson
Copy link
Member

Good questions!

Where is the specification of errors and line and column positions?

There isn't one, at least not yet. We did get a good start at coming up with some guidelines to help answer some of these questions, and it might be useful to finish that work for this purpose.

What is the semantics of static errors? I guess a program that contains a single static error must signal that one statically for a suitable definition of "statically". What about a program with multiple static errors? Do they all need to be signaled? Is it enough to signal at least one of them? We should test programs with multiple errors, because these definitely occur in the real world.

Based on conversations that I've had with the language team, my understanding is that when the specification defines an error the only requirement it places on implementations is that some diagnostic be generated. The definition of an error does not dictate which diagnostic should be produced, or how many diagnostics should be produced, or what code fragment they should reference. (From that perspective, what we have today is actually a complete test framework.)

We do need to consider code with multiple errors. Do we want to support the testing of tools that ignore some portion of the remaining code when they encounter an error? If so, perhaps we can only specify the first diagnostic we expect to have produced. If not, then I would guess that we would annotate all of the expected errors.

But that raises another interesting question. The number of diagnostics is strongly dependent on the way a tool recovers in the face of an error. Will the test framework allow (ignore) diagnostics that are generated that are not annotated? While I'm completely in favor of eliminating follow-on errors, there's a fair bit of work involved in doing so (at least for analyzer) and I don't expect that we'll have the resources for that in the near future.

When the language team designs new features will they also specify static error messages and positions?

Based on conversations I've had with them I don't think they want to do that.

If not, then the test suite (or one of the implementations) is our only spec.

Well, technically, we could create a "diagnostics" team whose responsibility is to create such a specification. It would, presumably, be a specification that we hold ourselves to without requiring that other tooling necessarily follow it.

Do we expect intentional violations of the error specification on the part of the implementations? How serious are these when they occur?

What is the process of changing the error specification (say, to improve an error message)? What parties have to sign off on it? Will it be considered a breaking change to start failing in a different way?

I think one of the questions here is how we balance the desire for consistency between the tools with the needs of the implementation teams to change the implementation in ways that might impact the way diagnostics are produced. I don't have an answer for that, but personally I tend to place more importance on providing a good UX, and I think that inconsistent error messaging hurts the UX.

I think the bigger question is what it is that we're trying to accomplish (either short term or long term). Is this an effort to ensure consistency in messaging between our tools? (I'm personally in favor of consistency, but that doesn't mean that this is necessarily the right way to achieve that goal.) Are we trying to reduce or eliminate the possibility that tests are passing for the wrong reason? (Also a good goal, in my opinion.)

@eernstg
Copy link
Member

eernstg commented May 1, 2019

@whesse wrote:

check that the compile-time error is the expected error,
and that it occurs at the right source code position

We might be happy about a smaller step: Just standardize so much on the reported errors that tools like test.py can confirm that the expected error is raised.

We would then ignore the position of the error, and we'd ignore additional errors that might arise because of recovery efforts. This would allow us to avoid a significant amount of complexity.

With the current multi-test approach, it is reasonable to expect each subtest with a compile-time error expectation to process a program that contains exactly one error (in the sense that the none variant has no error, and the relevant subtest adds a few lines of code to the none variant, and we can strongly recommend that this causes the program to have exactly one error, insofar as that is well-defined). This basically allows us to ignore how tools deal with multiple errors, and that's another way to avoid complexity.

Wouldn't that be worth aiming for at first?

@munificent
Copy link
Member

Without a specification we don't have anything to test conformance to. We don't have a way to determine which, if any, of conflicting implementations is correct.

All of this is equally true of the core libraries, and yet we have tests for those which are shared across the team and implemented by contractors, so this doesn't seem like a problem to me.

You can think of this as less of a conformance test than a regression test. Each implementer of the errors ships something that end users actually do experience and care about. Presumably, we would want to know if that experience spontaneously changed or degraded in some way. If, for example, every type error accidentally got reported as "Syntax error", we'd probably want to know before we shipped that to users.

Likewise, our users don't know or care which front end generates the errors in the tools they use. From their perspective that is an implementation detail. If, say, analyzer moves from its own implementation of type checking to CFE's, users should neither know nor care. That again implies that we should test what that behavior is precisely and ensure our tools are consistent with each other, even if there is no independent specification for them to be consistent towards.

What is the semantics of static errors? I guess a program that contains a single static error must signal that one statically for a suitable definition of "statically".

The language spec has a minimal definition of this that includes only what is necessary for the language spec itself to be self-consistent. It defines that some constructs are compile-time errors and that any program containing at least one of those constructs has no runtime semantics. Otherwise, it would be obliged to define those runtime semantics, which we don't want to do.

Anything beyond that is beyond the purview of the spec, but is still relevant to our users. Our users are not invoking our tools purely to see how well it implements the specs. They are actually trying to write applications and find and fix bugs in their code.

What about a program with multiple static errors? Do they all need to be signaled? Is it enough to signal at least one of them?

Any tool that reports static errors and wants to have a good user experience needs to do error recovery and report as many distinct errors as it can while not reporting false positives that are cascaded from earlier errors. I think we want our tools to have a good user experience, so that applies to us too.

Likewise, I presume we do not want to spontaneously degrade that user experience without realizing it, which implies that we should have tests of that behavior. If you accept that, from the user's perspective, it is an implementation detail which of our front ends reports these errors, that implies we should share those tests across our front ends so that we ensure they provide the same errors.

We should test programs with multiple errors, because these definitely occur in the real world.

+1!

When the language team designs new features will they also specify static error messages and positions?

No, I don't think we plan to do that. We don't specify the behavior of List.add(), either, but that has tests which are shared across our implementations.

Do we expect intentional violations of the error specification on the part of the implementations? How serious are these when they occur?

Given our team's history, I think we'd all be surprised if every implementation behaved exactly the same. :) But I think that should be a reasonable goal for us to aim for. I don't think it provides much end user value to have different tools report different errors, and I'm not aware of technical reasons that would force them to.

What is the process of changing the error specification (say, to improve an error message)? What parties have to sign off on it?

I wouldn't expect this to be a big deal. We all want a great user experience, so if the various front end folks — there aren't that many of them — agree to a change, then I don't see why they couldn't land it.

Will it be considered a breaking change to start failing in a different way?

Turning a warning into an error is a breaking change. But turning an error of one kind into another or changing its location isn't "breaking", as far as I know. We have acted like we are free to change the diagnostic output of our tools even though in theory a user could have a script that relies on that output.

I think the bigger question is what it is that we're trying to accomplish (either short term or long term). Is this an effort to ensure consistency in messaging between our tools?

Personally, yes, that's a goal, but not a primary one.

Are we trying to reduce or eliminate the possibility that tests are passing for the wrong reason?

That is a high priority for me. This has always been a serious problem. During the Dart 2 test migration, I found many multitests that had been untouched for years that were failing for bogus reasons. That reduces the coverage and stability of our tools. I know there are more in there that we missed. During the UI-as-code stuff, it was a challenge to gauge progress because there were many false positives where type error tests "passed" because the parsing support wasn't even there so a syntax error was considered a pass.

Another priority is just making it easier to write and maintain correct tests of static error behavior. Multitests are not easy to write or read.

@whesse
Copy link
Contributor Author

whesse commented May 10, 2019

If we want a somewhat working system as soon as possible, we should match the errors and locations given by the analyzer in our first set of test annotations. They seem to have a stable set of static error names, and good location reporting.

We can start seeing how bad the conflicts are with the front-end parser, in terms of reference names and positions. There are open questions about compile-time errors that report a conflict between two lines: should we demand that one or both lines be reported?

@bwilkerson
Copy link
Member

... should we demand that one or both lines be reported?

We might want to only demand one line as a first step, but longer term I think we want to check both.

I recently added the ability to report both locations to analyzer, and we now have one concrete example of doing so, but we don't yet have all the cases covered where we'd want to do so.

@munificent
Copy link
Member

I have a simple rule I use to answer questions like "Should we have tests for X?":

If the next release of the SDK changes X, would we want to find out before some user does?

If the answer is yes, we better test it.

@athomas athomas assigned munificent and unassigned whesse Jul 2, 2019
@athomas
Copy link
Member

athomas commented Aug 12, 2019

@munificent Are we ready to close this issue? Or is there something left to do?

@munificent
Copy link
Member

Oh, yes! I forgot about this tracking issue. We have static error tests now. They are documented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-infrastructure Use area-infrastructure for SDK infrastructure issues, like continuous integration bot changes. area-test Cross-cutting test issues (use area- labels for specific failures; not used for package:test). type-enhancement A request for a change that isn't a bug
Projects
None yet
Development

No branches or pull requests

7 participants