Skip to content

errors: speedup for large error counts #12631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 29, 2022

Conversation

huguesb
Copy link
Contributor

@huguesb huguesb commented Apr 20, 2022

We have a legacy codebase with many errors across many files

Found 7995 errors in 2218 files (checked 21364 source files)

For historical reasons, it hasn't been practical to fix all of
these yet, and we've been slowly chipping at them over time.

Profiling shows that is_blockers is the biggest single hotspot,
taking roughly 1min, and total_errors account for another 11s.

Instead of computing those values on read by iterating over all
errors, update auxiliary variables appropriately every time a
new error is recorded.

As part of this change, refactor the existing mechanisms to filter
out errors.

Instead of maintaining two separate mechanism to filter out errors
boolean flag in MessageBuilder and explicit copy of MessageBuilder
or Errors), expand ErrorWatcher to support all relevant usage
patterns and update all usages accordingly.

This is both cleaner and more robust than the previous approach,
and should also offer a slight performance improvement by reducing
allocations.

We have a legacy codebase with many errors across many files
```
Found 7995 errors in 2218 files (checked 21364 source files)
```

For historical reasons, it hasn't been practical to fix all of
these yet, and we've been slowly chipping at them over time.

Profiling shows that `is_blockers` is the biggest single hotspot,
taking roughly 1min, and `total_errors` account for another 11s.

Instead of computing those values on read by iterating over all
errors, update auxiliary variables appropriately every time a
new error is recorded.
@huguesb huguesb mentioned this pull request Apr 20, 2022
@github-actions

This comment has been minimized.

@emmatyping
Copy link
Member

Have you considered excluding files from being checked so that you get to a clean run, then iteratively enabling files? 8000 is a lot of errors!

@huguesb
Copy link
Contributor Author

huguesb commented Apr 20, 2022 via email

Copy link
Member

@JelleZijlstra JelleZijlstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think the use case is reasonable; we should make sure it's pleasant to get a large codebase mypy-clean over time.

I do have some concerns; adding additional state like this diff does makes it more likely that bugs will creep in as different pieces of state aren't updated together. I am on board with new state for has_blockers but the other two seem avoidable.

mypy/errors.py Outdated
@@ -235,7 +240,7 @@ def copy(self) -> 'Errors':
return new

def total_errors(self) -> int:
return sum(len(errs) for errs in self.error_info_map.values())
return self.total_error_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in one place, where we check that the error count before and after some operation is the same. I wonder if we can do something simpler there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried something a little more complex but which doesn't require maintaining a state that might diverge

mypy/errors.py Outdated

def is_blockers(self) -> bool:
"""Are the any errors that are blockers?"""
return any(err for errs in self.error_info_map.values() for err in errs if err.blocker)
return self.has_blockers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one makes sense to me. We check for blockers after processing every single file, so with a lot of errors that naturally gets quadratic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to change from a single boolean to a path-indexed Set to accommodate removal of errors

@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@JelleZijlstra JelleZijlstra self-requested a review April 21, 2022 18:11
sub_result, method_type = self.check_op(method, left_type, right, e,
allow_reverse=True)

with ErrorWatcher(self.msg.errors) as w:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar mechanism being used in the infer_overload_return_type method in this file (grep for .clean_copy). Can we use that here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems like infer_overload_return_type creates a new empty MessageBuilder / Errors and uses that to check if any new errors where triggered, which could be achieved by using the new ErrorWatcher. However it also looks like any such errors would then be ignored (not actually make it into the original Errors), which the ErrorWatcher approach alone cannot do, and if we're creating a new empty Errors then there's not much benefit to the ErrorWatcher approach. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the infer_overload_return_type approach here, so we don't have to introduce the new ErrorWatcher concept? Seems like we could do that here by getting a clean copy, then unconditionally doing self.msg.add_errors like on line 2293 in visit_comparison_expr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think the current approach there needs to be rethought.

Consider on line 1757:

            self.msg = overload_messages
            self.chk.msg = overload_messages
            try:
                # Passing `overload_messages` as the `arg_messages` parameter doesn't
                # seem to reliably catch all possible errors.
                # TODO: Figure out why

which seems obviously pretty related to the way the MessageBuilder in TypeChecker is created:

        self.msg = MessageBuilder(errors, modules)
        self.plugin = plugin
        self.expr_checker = mypy.checkexpr.ExpressionChecker(self, self.msg, self.plugin)
        self.pattern_checker = PatternChecker(self, self.msg, self.plugin)

The explicit patching of the MessageBuilder instance is incomplete (expr_checker and pattern_checker are not patched), but more importantly it is very fragile (there might be other places that need patching, the places that need patching are likely to change over time in ways that hard hard to keep track off), and rather ugly to boot. I think instead of using this pattern in more places we should replace it with some variant of ErrorWatcher that can optionally intercept/discard errors in the original MessageBuilder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I suspect bug #12665, which we found yesterday, is related to the problem you identify.

@huguesb
Copy link
Contributor Author

huguesb commented Apr 24, 2022 via email

Instead of maintaining two separate mechanism to filter out errors (boolean flag
in MessageBuilder and explicit copy of MessageBuilder/Errors) expand ErrorWatcher
to support all relevant usage patterns and update all usages accordingly.

This is both cleaner and more robust than the previous approach, and should also
offer a slight performance improvement by reducing allocations.
@huguesb
Copy link
Contributor Author

huguesb commented Apr 26, 2022

I think I will take a stab at replacing usage of MessageBuilder.copy with an improved ErrorWatcher

On Sun, Apr 24, 2022, 2:26 PM Jelle Zijlstra @.> wrote: @.* commented on this pull request. ------------------------------ In mypy/checkexpr.py <#12631 (comment)>: > @@ -2293,12 +2293,15 @@ def visit_comparison_expr(self, e: ComparisonExpr) -> Type: self.msg.add_errors(local_errors) elif operator in operators.op_methods: method = self.get_operator_method(operator) - err_count = self.msg.errors.total_errors() - sub_result, method_type = self.check_op(method, left_type, right, e, - allow_reverse=True) + + with ErrorWatcher(self.msg.errors) as w: That's a good point. I suspect bug #12665 <#12665>, which we found yesterday, is related to the problem you identify. — Reply to this email directly, view it on GitHub <#12631 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABX3CSOH6ZXP7JUZYZBBU3TVGW37FANCNFSM5T5MP74Q . You are receiving this because you authored the thread.Message ID: @.***>

That was a somewhat more demanding refactor than I expected but it worked out. A follow-up to cleanup all those methods threading through a (sometimes optional) MessageBuilder seems worthwhile but I'd rather put that in a separate PR to make review and testing easier.

@JelleZijlstra JelleZijlstra self-requested a review April 26, 2022 04:02
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

1 similar comment
@github-actions
Copy link
Contributor

According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉

@JelleZijlstra JelleZijlstra merged commit a3abd36 into python:master Apr 29, 2022
@JelleZijlstra
Copy link
Member

Thanks, this is much cleaner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants