-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Test precisely for specific compile-time errors and their source positions #36766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
An example of a request for this capability is #31883 |
This would be great! I have often been concerned that analyzer might be passing (or failing) some tests for the wrong reason. It isn't clear what you're including as part of "Standardizing the way that our tools report these errors". It sounds kind of like you're asking for a standard output format so that it's easier for test.py to parse and use the information in the output. If so, I agree that that would be useful. But I think there is another topic to address. You might also be including it as part of that line item, but if so I think it might be worth calling it out separately. And if not, then I think you need to add it. Specifically, we need to get to a point where analyzer and the front end are producing the same errors (with the same identifiers) in the same places. Aside from the errors produced by the parser (which are unified by definition, though the identifiers might currently differ), there is, unfortunately, a significant amount of work to get to the point where the two tools agree. I really want to see them producing consistent, high quality diagnostics, and I think it's worth the effort it will take, but it isn't likely to happen without a concerted effort. |
That would be awesome long term! I personally would like to not only be consistent, but to have a single source of through for all messages (e.g. generalize front_end/messages.yaml to eventually contain all messages from all tools. Maybe we could also generate C++ code so that the vm can use it to report errors as well!) However, I hope we can make the compile-time error tests available more gradually. For instance, start first noting the exact offset where an error is reported, and later add support for checking that a specific error message is given?
Some ideas:
Today analyzer, front_end, and dart2js, each have tests that that look for specific data at specific locations in a test file. The logic is different on each tool, but my hope is that we can leverage some of the existing work to solve this task. Some pointers:
|
Actually, that's an uncommon way for the analyzer tests to be written. Most of them follow or are being converted to follow, the pattern in ambiguous_set_or_map_literal_test.dart. The problem with the one you picked is that you can only specify the offsets of the diagnostics, and I want our tests (not necessarily the shared tests) to also test the length of the range. |
Yes, I love everything about this issue. I should have a good chunk of time after IO to work on this and nothing would make me happier than to harvest the best of what the existing tools have and consolidate on to a single solution.
Tracking length is useful, but having to hand-author those numbers looks tricky. As a total bikeshed, I recently hacked together a little error test framework for a hobby project that lets you author tests like this: Map<int, int> map;
Set<int> set;
var c = {...set, ...map};
// ^^^^^^^^^^^^^^^^ AMBIGUOUS_SET_OR_MAP_LITERAL_BOTH Extending it to multi-line errors might be a little harder, but something like this is probably easier to maintain than having to hand-count characters. |
It is. Which is why the failure message prints the expected values as source code so that we can copy/paste the new values in if the offset or length changes for valid reasons. |
Ah, smart. 👍 |
We should consider having mode where the test runner updates the test expectations in place. This might save you the copy/paste trouble (we don't have it on some of our dart2js tests and I keep wishing we did). When I have used something similar (e.g. with CFE .expect files), the diff in a code-review then is perfect for verifying that the change was intentional. |
I have some questions. Where is the specification of errors and line and column positions? Without a specification we don't have anything to test conformance to. We don't have a way to determine which, if any, of conflicting implementations is correct. We can't use a contractor to write tests for us. What is the semantics of static errors? I guess a program that contains a single static error must signal that one statically for a suitable definition of "statically". What about a program with multiple static errors? Do they all need to be signaled? Is it enough to signal at least one of them? We should test programs with multiple errors, because these definitely occur in the real world. When the language team designs new features will they also specify static error messages and positions? Will they please go back and add these to the existing static errors in the spec? If not, then the test suite (or one of the implementations) is our only spec. Do we expect intentional violations of the error specification on the part of the implementations? How serious are these when they occur? What is the process of changing the error specification (say, to improve an error message)? What parties have to sign off on it? Will it be considered a breaking change to start failing in a different way? |
Good questions!
There isn't one, at least not yet. We did get a good start at coming up with some guidelines to help answer some of these questions, and it might be useful to finish that work for this purpose.
Based on conversations that I've had with the language team, my understanding is that when the specification defines an error the only requirement it places on implementations is that some diagnostic be generated. The definition of an error does not dictate which diagnostic should be produced, or how many diagnostics should be produced, or what code fragment they should reference. (From that perspective, what we have today is actually a complete test framework.) We do need to consider code with multiple errors. Do we want to support the testing of tools that ignore some portion of the remaining code when they encounter an error? If so, perhaps we can only specify the first diagnostic we expect to have produced. If not, then I would guess that we would annotate all of the expected errors. But that raises another interesting question. The number of diagnostics is strongly dependent on the way a tool recovers in the face of an error. Will the test framework allow (ignore) diagnostics that are generated that are not annotated? While I'm completely in favor of eliminating follow-on errors, there's a fair bit of work involved in doing so (at least for analyzer) and I don't expect that we'll have the resources for that in the near future.
Based on conversations I've had with them I don't think they want to do that.
Well, technically, we could create a "diagnostics" team whose responsibility is to create such a specification. It would, presumably, be a specification that we hold ourselves to without requiring that other tooling necessarily follow it.
I think one of the questions here is how we balance the desire for consistency between the tools with the needs of the implementation teams to change the implementation in ways that might impact the way diagnostics are produced. I don't have an answer for that, but personally I tend to place more importance on providing a good UX, and I think that inconsistent error messaging hurts the UX. I think the bigger question is what it is that we're trying to accomplish (either short term or long term). Is this an effort to ensure consistency in messaging between our tools? (I'm personally in favor of consistency, but that doesn't mean that this is necessarily the right way to achieve that goal.) Are we trying to reduce or eliminate the possibility that tests are passing for the wrong reason? (Also a good goal, in my opinion.) |
@whesse wrote:
We might be happy about a smaller step: Just standardize so much on the reported errors that tools like We would then ignore the position of the error, and we'd ignore additional errors that might arise because of recovery efforts. This would allow us to avoid a significant amount of complexity. With the current multi-test approach, it is reasonable to expect each subtest with a Wouldn't that be worth aiming for at first? |
All of this is equally true of the core libraries, and yet we have tests for those which are shared across the team and implemented by contractors, so this doesn't seem like a problem to me. You can think of this as less of a conformance test than a regression test. Each implementer of the errors ships something that end users actually do experience and care about. Presumably, we would want to know if that experience spontaneously changed or degraded in some way. If, for example, every type error accidentally got reported as "Syntax error", we'd probably want to know before we shipped that to users. Likewise, our users don't know or care which front end generates the errors in the tools they use. From their perspective that is an implementation detail. If, say, analyzer moves from its own implementation of type checking to CFE's, users should neither know nor care. That again implies that we should test what that behavior is precisely and ensure our tools are consistent with each other, even if there is no independent specification for them to be consistent towards.
The language spec has a minimal definition of this that includes only what is necessary for the language spec itself to be self-consistent. It defines that some constructs are compile-time errors and that any program containing at least one of those constructs has no runtime semantics. Otherwise, it would be obliged to define those runtime semantics, which we don't want to do. Anything beyond that is beyond the purview of the spec, but is still relevant to our users. Our users are not invoking our tools purely to see how well it implements the specs. They are actually trying to write applications and find and fix bugs in their code.
Any tool that reports static errors and wants to have a good user experience needs to do error recovery and report as many distinct errors as it can while not reporting false positives that are cascaded from earlier errors. I think we want our tools to have a good user experience, so that applies to us too. Likewise, I presume we do not want to spontaneously degrade that user experience without realizing it, which implies that we should have tests of that behavior. If you accept that, from the user's perspective, it is an implementation detail which of our front ends reports these errors, that implies we should share those tests across our front ends so that we ensure they provide the same errors.
+1!
No, I don't think we plan to do that. We don't specify the behavior of
Given our team's history, I think we'd all be surprised if every implementation behaved exactly the same. :) But I think that should be a reasonable goal for us to aim for. I don't think it provides much end user value to have different tools report different errors, and I'm not aware of technical reasons that would force them to.
I wouldn't expect this to be a big deal. We all want a great user experience, so if the various front end folks — there aren't that many of them — agree to a change, then I don't see why they couldn't land it.
Turning a warning into an error is a breaking change. But turning an error of one kind into another or changing its location isn't "breaking", as far as I know. We have acted like we are free to change the diagnostic output of our tools even though in theory a user could have a script that relies on that output.
Personally, yes, that's a goal, but not a primary one.
That is a high priority for me. This has always been a serious problem. During the Dart 2 test migration, I found many multitests that had been untouched for years that were failing for bogus reasons. That reduces the coverage and stability of our tools. I know there are more in there that we missed. During the UI-as-code stuff, it was a challenge to gauge progress because there were many false positives where type error tests "passed" because the parsing support wasn't even there so a syntax error was considered a pass. Another priority is just making it easier to write and maintain correct tests of static error behavior. Multitests are not easy to write or read. |
If we want a somewhat working system as soon as possible, we should match the errors and locations given by the analyzer in our first set of test annotations. They seem to have a stable set of static error names, and good location reporting. We can start seeing how bad the conflicts are with the front-end parser, in terms of reference names and positions. There are open questions about compile-time errors that report a conflict between two lines: should we demand that one or both lines be reported? |
We might want to only demand one line as a first step, but longer term I think we want to check both. I recently added the ability to report both locations to analyzer, and we now have one concrete example of doing so, but we don't yet have all the cases covered where we'd want to do so. |
I have a simple rule I use to answer questions like "Should we have tests for X?": If the next release of the SDK changes X, would we want to find out before some user does? If the answer is yes, we better test it. |
@munificent Are we ready to close this issue? Or is there something left to do? |
Oh, yes! I forgot about this tracking issue. We have static error tests now. They are documented here. |
Meta-issue to discuss and implement more precise testing for compile-time errors.
Testing for compile-time errors using test.py mainly uses multitests, since they have a way of specifying that the expected result of running a test will be a compile-time failure, and they allow
testing for multiple cases of compile failures in a single test files.
What is missing in test.py's evaluation of the result of running a (multi)test, is a check that the compile-time error is the expected error, and that it occurs at the right source code position in the (multi)test.
If there is a standard for specifying the expected error, and its position, in the test source, and standards for how our various compilers and parsers report these errors, then we can put the logic for checking that the result of a test matches the expectation, including the error type and position, into the test.py (tools/testing/dart) test framework.
We could also consider putting it into other testing frameworks, like package:test and package:testing.
I will start a checklist of the topics we need to consider, and we can create related issues for
the subtopics as we flesh them out. General comments on the topic should be posted as replies, or written in a document.
@munificent @jcollins-g @sigmundch @whesse
The text was updated successfully, but these errors were encountered: