Skip to content

GH-115060: Speed up pathlib.Path.glob() by removing redundant regex matching #115061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 10, 2024

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Feb 6, 2024

When expanding and filtering paths for a ** wildcard segment, build an re.Pattern object from the subsequent pattern parts, rather than the entire pattern, and match against the os.DirEntry object prior to instantiating a path object.

Also skip compiling a pattern when expanding a * wildcard segment.

… regex matching

When expanding and filtering paths for a `**` wildcard segment, build an
`re.Pattern` object from the subsequent pattern parts, rather than the
entire pattern.

Also skip compiling a pattern when expanding a `*` wildcard segment.
@barneygale
Copy link
Contributor Author

barneygale commented Feb 6, 2024

Notable improvements:

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('*', follow_symlinks=False))"
2000 loops, best of 5: 180 usec per loop  # before
2000 loops, best of 5: 159 usec per loop  # after
# --> 1.13x faster

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*.py', follow_symlinks=False))"
5 loops, best of 5: 54   msec per loop  # before
5 loops, best of 5: 40.9 msec per loop  # after
# --> 1.32x faster

Everything else is about the same.

@zooba
Copy link
Member

zooba commented Feb 8, 2024

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

Since it doesn't require changing any test cases, and I know the tests cases are pretty thorough for this area, I don't think there's any reason to not sign off. Maybe trigger a buildbot run with the tag to make sure it doesn't behave strangely on any of those setups - they can occasionally be a bit unusual and find some edge cases.

@barneygale
Copy link
Contributor Author

barneygale commented Feb 8, 2024

Thanks Steve.

For whatever reason, every time I try to review this, I struggle to figure out what the change is doing :D

The algorithm might be worthy of a blog post at this point!

The main change is that we now filter partial paths through a regex corresponding to a partial pattern in _select_recursive, rather than complete paths through a regex corresponding to a complete pattern in PathBase.glob(). We can do this because previous parts have already been filtered by _select_children(), and so there's no need to re-filter them.

The secondary change (which includes the addition of _entry_str()) is to match against os.DirEntry.path directly, which allows us to skip construction of path objects for files that don't match.

@zooba
Copy link
Member

zooba commented Feb 8, 2024

Okay, today it made sense :) Guess I'm more awake right now. Reading the changes from the bottom up might have helped as well.

Personally, I don't think you can have too many comments in an algorithm like this, particularly when it's recursive and split between a couple of functions. I'll suggest a few comments that would've helped me, but I don't think there are any code changes needed.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments that may help make it more understandable. No changes required

@barneygale barneygale merged commit 6f93b4d into python:main Feb 10, 2024
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this pull request Feb 14, 2024
… regex matching (python#115061)

When expanding and filtering paths for a `**` wildcard segment, build an `re.Pattern` object from the subsequent pattern parts, rather than the entire pattern, and match against the `os.DirEntry` object prior to instantiating a path object. Also skip compiling a pattern when expanding a `*` wildcard segment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants