-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
errors: speedup for large error counts #12631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We have a legacy codebase with many errors across many files ``` Found 7995 errors in 2218 files (checked 21364 source files) ``` For historical reasons, it hasn't been practical to fix all of these yet, and we've been slowly chipping at them over time. Profiling shows that `is_blockers` is the biggest single hotspot, taking roughly 1min, and `total_errors` account for another 11s. Instead of computing those values on read by iterating over all errors, update auxiliary variables appropriately every time a new error is recorded.
This comment has been minimized.
This comment has been minimized.
Have you considered excluding files from being checked so that you get to a clean run, then iteratively enabling files? 8000 is a lot of errors! |
Have you considered excluding files from being checked so that you get to
a clean run, then iteratively enabling files? 8000 is a lot of errors!
I expect such an exclusion would result in more errors accumulating into
those files over time and would make our path towards 0 errors even longer.
… —
Reply to this email directly, view it on GitHub
<#12631 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABX3CSM6YHVPOU5W6YQ5BYLVGCBVRANCNFSM5T5MP74Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think the use case is reasonable; we should make sure it's pleasant to get a large codebase mypy-clean over time.
I do have some concerns; adding additional state like this diff does makes it more likely that bugs will creep in as different pieces of state aren't updated together. I am on board with new state for has_blockers
but the other two seem avoidable.
mypy/errors.py
Outdated
@@ -235,7 +240,7 @@ def copy(self) -> 'Errors': | |||
return new | |||
|
|||
def total_errors(self) -> int: | |||
return sum(len(errs) for errs in self.error_info_map.values()) | |||
return self.total_error_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in one place, where we check that the error count before and after some operation is the same. I wonder if we can do something simpler there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried something a little more complex but which doesn't require maintaining a state that might diverge
mypy/errors.py
Outdated
|
||
def is_blockers(self) -> bool: | ||
"""Are the any errors that are blockers?""" | ||
return any(err for errs in self.error_info_map.values() for err in errs if err.blocker) | ||
return self.has_blockers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one makes sense to me. We check for blockers after processing every single file, so with a lot of errors that naturally gets quadratic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to change from a single boolean to a path-indexed Set to accommodate removal of errors
de001c9
to
425b9f1
Compare
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
sub_result, method_type = self.check_op(method, left_type, right, e, | ||
allow_reverse=True) | ||
|
||
with ErrorWatcher(self.msg.errors) as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a similar mechanism being used in the infer_overload_return_type
method in this file (grep for .clean_copy
). Can we use that here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It seems like infer_overload_return_type
creates a new empty MessageBuilder
/ Errors
and uses that to check if any new errors where triggered, which could be achieved by using the new ErrorWatcher
. However it also looks like any such errors would then be ignored (not actually make it into the original Errors
), which the ErrorWatcher
approach alone cannot do, and if we're creating a new empty Errors
then there's not much benefit to the ErrorWatcher
approach. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the infer_overload_return_type
approach here, so we don't have to introduce the new ErrorWatcher
concept? Seems like we could do that here by getting a clean copy, then unconditionally doing self.msg.add_errors
like on line 2293 in visit_comparison_expr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think the current approach there needs to be rethought.
Consider on line 1757:
self.msg = overload_messages
self.chk.msg = overload_messages
try:
# Passing `overload_messages` as the `arg_messages` parameter doesn't
# seem to reliably catch all possible errors.
# TODO: Figure out why
which seems obviously pretty related to the way the MessageBuilder
in TypeChecker
is created:
self.msg = MessageBuilder(errors, modules)
self.plugin = plugin
self.expr_checker = mypy.checkexpr.ExpressionChecker(self, self.msg, self.plugin)
self.pattern_checker = PatternChecker(self, self.msg, self.plugin)
The explicit patching of the MessageBuilder
instance is incomplete (expr_checker and pattern_checker are not patched), but more importantly it is very fragile (there might be other places that need patching, the places that need patching are likely to change over time in ways that hard hard to keep track off), and rather ugly to boot. I think instead of using this pattern in more places we should replace it with some variant of ErrorWatcher
that can optionally intercept/discard errors in the original MessageBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I suspect bug #12665, which we found yesterday, is related to the problem you identify.
I think I will take a stab at replacing usage of MessageBuilder.copy with
an improved ErrorWatcher
…On Sun, Apr 24, 2022, 2:26 PM Jelle Zijlstra ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In mypy/checkexpr.py
<#12631 (comment)>:
> @@ -2293,12 +2293,15 @@ def visit_comparison_expr(self, e: ComparisonExpr) -> Type:
self.msg.add_errors(local_errors)
elif operator in operators.op_methods:
method = self.get_operator_method(operator)
- err_count = self.msg.errors.total_errors()
- sub_result, method_type = self.check_op(method, left_type, right, e,
- allow_reverse=True)
+
+ with ErrorWatcher(self.msg.errors) as w:
That's a good point. I suspect bug #12665
<#12665>, which we found yesterday,
is related to the problem you identify.
—
Reply to this email directly, view it on GitHub
<#12631 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABX3CSOH6ZXP7JUZYZBBU3TVGW37FANCNFSM5T5MP74Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Instead of maintaining two separate mechanism to filter out errors (boolean flag in MessageBuilder and explicit copy of MessageBuilder/Errors) expand ErrorWatcher to support all relevant usage patterns and update all usages accordingly. This is both cleaner and more robust than the previous approach, and should also offer a slight performance improvement by reducing allocations.
That was a somewhat more demanding refactor than I expected but it worked out. A follow-up to cleanup all those methods threading through a (sometimes optional) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉 |
Thanks, this is much cleaner! |
We have a legacy codebase with many errors across many files
For historical reasons, it hasn't been practical to fix all of
these yet, and we've been slowly chipping at them over time.
Profiling shows that
is_blockers
is the biggest single hotspot,taking roughly 1min, and
total_errors
account for another 11s.Instead of computing those values on read by iterating over all
errors, update auxiliary variables appropriately every time a
new error is recorded.
As part of this change, refactor the existing mechanisms to filter
out errors.
Instead of maintaining two separate mechanism to filter out errors
boolean flag in
MessageBuilder
and explicit copy ofMessageBuilder
or
Errors
), expandErrorWatcher
to support all relevant usagepatterns and update all usages accordingly.
This is both cleaner and more robust than the previous approach,
and should also offer a slight performance improvement by reducing
allocations.