errors: speedup for large error counts #12631

huguesb · 2022-04-20T22:06:32Z

We have a legacy codebase with many errors across many files

Found 7995 errors in 2218 files (checked 21364 source files)

For historical reasons, it hasn't been practical to fix all of
these yet, and we've been slowly chipping at them over time.

Profiling shows that is_blockers is the biggest single hotspot,
taking roughly 1min, and total_errors account for another 11s.

Instead of computing those values on read by iterating over all
errors, update auxiliary variables appropriately every time a
new error is recorded.

As part of this change, refactor the existing mechanisms to filter
out errors.

Instead of maintaining two separate mechanism to filter out errors
boolean flag in MessageBuilder and explicit copy of MessageBuilder
or Errors), expand ErrorWatcher to support all relevant usage
patterns and update all usages accordingly.

This is both cleaner and more robust than the previous approach,
and should also offer a slight performance improvement by reducing
allocations.

We have a legacy codebase with many errors across many files ``` Found 7995 errors in 2218 files (checked 21364 source files) ``` For historical reasons, it hasn't been practical to fix all of these yet, and we've been slowly chipping at them over time. Profiling shows that `is_blockers` is the biggest single hotspot, taking roughly 1min, and `total_errors` account for another 11s. Instead of computing those values on read by iterating over all errors, update auxiliary variables appropriately every time a new error is recorded.

emmatyping · 2022-04-20T22:40:13Z

Have you considered excluding files from being checked so that you get to a clean run, then iteratively enabling files? 8000 is a lot of errors!

huguesb · 2022-04-20T23:00:18Z

Have you considered excluding files from being checked so that you get to a clean run, then iteratively enabling files? 8000 is a lot of errors!

I expect such an exclusion would result in more errors accumulating into those files over time and would make our path towards 0 errors even longer.

…

— Reply to this email directly, view it on GitHub <#12631 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABX3CSM6YHVPOU5W6YQ5BYLVGCBVRANCNFSM5T5MP74Q> . You are receiving this because you authored the thread.Message ID: ***@***.***>

JelleZijlstra

Thanks, I think the use case is reasonable; we should make sure it's pleasant to get a large codebase mypy-clean over time.

I do have some concerns; adding additional state like this diff does makes it more likely that bugs will creep in as different pieces of state aren't updated together. I am on board with new state for has_blockers but the other two seem avoidable.

mypy/errors.py

JelleZijlstra · 2022-04-20T23:24:09Z

mypy/errors.py

@@ -235,7 +240,7 @@ def copy(self) -> 'Errors':
        return new

    def total_errors(self) -> int:
-        return sum(len(errs) for errs in self.error_info_map.values())
+        return self.total_error_count


This is only used in one place, where we check that the error count before and after some operation is the same. I wonder if we can do something simpler there.

Tried something a little more complex but which doesn't require maintaining a state that might diverge

JelleZijlstra · 2022-04-20T23:24:46Z

mypy/errors.py


    def is_blockers(self) -> bool:
        """Are the any errors that are blockers?"""
-        return any(err for errs in self.error_info_map.values() for err in errs if err.blocker)
+        return self.has_blockers


This one makes sense to me. We check for blockers after processing every single file, so with a lot of errors that naturally gets quadratic.

Add to change from a single boolean to a path-indexed Set to accommodate removal of errors

JelleZijlstra · 2022-04-23T21:09:31Z

mypy/checkexpr.py

-                sub_result, method_type = self.check_op(method, left_type, right, e,
-                                                        allow_reverse=True)
+
+                with ErrorWatcher(self.msg.errors) as w:


There is a similar mechanism being used in the infer_overload_return_type method in this file (grep for .clean_copy). Can we use that here too?

Hmm. It seems like infer_overload_return_type creates a new empty MessageBuilder / Errors and uses that to check if any new errors where triggered, which could be achieved by using the new ErrorWatcher. However it also looks like any such errors would then be ignored (not actually make it into the original Errors), which the ErrorWatcher approach alone cannot do, and if we're creating a new empty Errors then there's not much benefit to the ErrorWatcher approach. Am I missing something?

Could we use the infer_overload_return_type approach here, so we don't have to introduce the new ErrorWatcher concept? Seems like we could do that here by getting a clean copy, then unconditionally doing self.msg.add_errors like on line 2293 in visit_comparison_expr.

hmm, I think the current approach there needs to be rethought.

Consider on line 1757:

self.msg = overload_messages self.chk.msg = overload_messages try: # Passing `overload_messages` as the `arg_messages` parameter doesn't # seem to reliably catch all possible errors. # TODO: Figure out why

which seems obviously pretty related to the way the MessageBuilder in TypeChecker is created:

self.msg = MessageBuilder(errors, modules) self.plugin = plugin self.expr_checker = mypy.checkexpr.ExpressionChecker(self, self.msg, self.plugin) self.pattern_checker = PatternChecker(self, self.msg, self.plugin)

The explicit patching of the MessageBuilder instance is incomplete (expr_checker and pattern_checker are not patched), but more importantly it is very fragile (there might be other places that need patching, the places that need patching are likely to change over time in ways that hard hard to keep track off), and rather ugly to boot. I think instead of using this pattern in more places we should replace it with some variant of ErrorWatcher that can optionally intercept/discard errors in the original MessageBuilder

That's a good point. I suspect bug #12665, which we found yesterday, is related to the problem you identify.

huguesb · 2022-04-24T22:09:24Z

I think I will take a stab at replacing usage of MessageBuilder.copy with an improved ErrorWatcher

…

On Sun, Apr 24, 2022, 2:26 PM Jelle Zijlstra ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In mypy/checkexpr.py <#12631 (comment)>: > @@ -2293,12 +2293,15 @@ def visit_comparison_expr(self, e: ComparisonExpr) -> Type: self.msg.add_errors(local_errors) elif operator in operators.op_methods: method = self.get_operator_method(operator) - err_count = self.msg.errors.total_errors() - sub_result, method_type = self.check_op(method, left_type, right, e, - allow_reverse=True) + + with ErrorWatcher(self.msg.errors) as w: That's a good point. I suspect bug #12665 <#12665>, which we found yesterday, is related to the problem you identify. — Reply to this email directly, view it on GitHub <#12631 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABX3CSOH6ZXP7JUZYZBBU3TVGW37FANCNFSM5T5MP74Q> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Instead of maintaining two separate mechanism to filter out errors (boolean flag in MessageBuilder and explicit copy of MessageBuilder/Errors) expand ErrorWatcher to support all relevant usage patterns and update all usages accordingly. This is both cleaner and more robust than the previous approach, and should also offer a slight performance improvement by reducing allocations.

huguesb · 2022-04-26T04:00:41Z

I think I will take a stab at replacing usage of MessageBuilder.copy with an improved ErrorWatcher
…
On Sun, Apr 24, 2022, 2:26 PM Jelle Zijlstra @.> wrote: @.* commented on this pull request. ------------------------------ In mypy/checkexpr.py <#12631 (comment)>: > @@ -2293,12 +2293,15 @@ def visit_comparison_expr(self, e: ComparisonExpr) -> Type: self.msg.add_errors(local_errors) elif operator in operators.op_methods: method = self.get_operator_method(operator) - err_count = self.msg.errors.total_errors() - sub_result, method_type = self.check_op(method, left_type, right, e, - allow_reverse=True) + + with ErrorWatcher(self.msg.errors) as w: That's a good point. I suspect bug #12665 <#12665>, which we found yesterday, is related to the problem you identify. — Reply to this email directly, view it on GitHub <#12631 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABX3CSOH6ZXP7JUZYZBBU3TVGW37FANCNFSM5T5MP74Q . You are receiving this because you authored the thread.Message ID: @.***>

That was a somewhat more demanding refactor than I expected but it worked out. A follow-up to cleanup all those methods threading through a (sometimes optional) MessageBuilder seems worthwhile but I'd rather put that in a separate PR to make review and testing easier.

mypy/errors.py

github-actions · 2022-04-29T13:34:07Z

According to mypy_primer, this change has no effect on the checked open source code. 🤖🎉

JelleZijlstra · 2022-04-29T14:21:34Z

Thanks, this is much cleaner!

huguesb mentioned this pull request Apr 20, 2022

Release 0.950 planning #12579

Closed

This comment has been minimized.

Sign in to view

JelleZijlstra reviewed Apr 20, 2022

View reviewed changes

address comments

425b9f1

huguesb force-pushed the pr-errors-faster branch from de001c9 to 425b9f1 Compare April 21, 2022 05:41

This comment has been minimized.

Sign in to view

JelleZijlstra self-requested a review April 21, 2022 18:11

JelleZijlstra reviewed Apr 23, 2022

View reviewed changes

JelleZijlstra self-requested a review April 26, 2022 04:02

This comment has been minimized.

Sign in to view

JelleZijlstra approved these changes Apr 29, 2022

View reviewed changes

mypy/errors.py Outdated Show resolved Hide resolved

Update mypy/errors.py

bab5a3a

JelleZijlstra reviewed Apr 29, 2022

View reviewed changes

mypy/errors.py Outdated Show resolved Hide resolved

Update mypy/errors.py

f40d0eb

This comment has been minimized.

Sign in to view

JelleZijlstra merged commit a3abd36 into python:master Apr 29, 2022

huguesb deleted the pr-errors-faster branch April 29, 2022 18:47

huguesb mentioned this pull request Apr 30, 2022

cleanups after recent refactor of error filtering #12699

Merged

JelleZijlstra mentioned this pull request May 1, 2022

Report unreachable except blocks #12086

Closed

emosenkis mentioned this pull request Jun 29, 2022

Crash with mypy plugin on mypy 0.961 dry-python/returns#1433

Open

hauntsaninja mentioned this pull request Aug 22, 2022

sum(List[Decimal]) + sum(List[Decimal]) incompatible type error #8814

Closed

AlexWaygood mentioned this pull request Nov 22, 2022

Bytes formatting error gets produced during failed overload match #12665

Closed

hauntsaninja mentioned this pull request Jun 1, 2023

Incorrect type inference in yoda-style comparison leads to spurious error #9333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors: speedup for large error counts #12631

errors: speedup for large error counts #12631

huguesb commented Apr 20, 2022 •

edited

Loading

This comment has been minimized.

emmatyping commented Apr 20, 2022

huguesb commented Apr 20, 2022 via email

JelleZijlstra left a comment

JelleZijlstra Apr 20, 2022

huguesb Apr 21, 2022

JelleZijlstra Apr 20, 2022

huguesb Apr 21, 2022

This comment has been minimized.

This comment has been minimized.

JelleZijlstra Apr 23, 2022

huguesb Apr 23, 2022

JelleZijlstra Apr 24, 2022

huguesb Apr 24, 2022

JelleZijlstra Apr 24, 2022

huguesb commented Apr 24, 2022 via email

huguesb commented Apr 26, 2022

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Apr 29, 2022

JelleZijlstra commented Apr 29, 2022

errors: speedup for large error counts #12631

errors: speedup for large error counts #12631

Conversation

huguesb commented Apr 20, 2022 • edited Loading

This comment has been minimized.

emmatyping commented Apr 20, 2022

huguesb commented Apr 20, 2022 via email

JelleZijlstra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huguesb commented Apr 24, 2022 via email

huguesb commented Apr 26, 2022

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Apr 29, 2022

JelleZijlstra commented Apr 29, 2022

huguesb commented Apr 20, 2022 •

edited

Loading