Skip to content

[WIP] Track visited files and directories when collecting #4203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wjt
Copy link

@wjt wjt commented Oct 19, 2018

There are two parts to the fix:

  • Don't visit directories which have already been visited. This fixes the exponential aspect of the bug.
  • Don't visit files which have already been visited. This fixes the more minor problem that, without this additional change, test_noop would be run twice, once as test_noop.py and once as symlink-0/test_noop.py. I decided against this because without more refactoring, it broke --keep-duplicates.

Fixes #624.

I also included a fix for tox.ini that makes the instructions in CONTRIBUTING.rst for how to run just a single test work; happy to split that to a separate branch if preferred.

@wjt
Copy link
Author

wjt commented Oct 19, 2018

This may seem a bit far-fetched but I did actually hit this in practice (albeit while writing a test for exactly the same bug in a project tested with pytest gcovr/gcovr#284 but I did hit that bug in practice!).

I see that test_cmdline_python_package_symlink has some special handling to test that symlinks are supported before running the test – I guess I need the same here, for Windows' benefit. (I don't have a Windows development system at the moment, so I'm being lazy and waiting for AppVeyor to test this.)

wjt added 2 commits October 19, 2018 13:10
CONTRIBUTING.rst claims the following:

   Or to only run tests in a particular test module on Python 3.6::

    $ tox -e py36 -- testing/test_config.py

But without this patch, this doesn't work: the arguments after -- are
ignored and all tests are run.
This fixes trying to traverse exponentially many paths in the presence
of symlink loops, and trying to run any tests discovered in that tree
exponentially many times if collecting ever finishes.

I wanted to also prevent visiting files more than once, but my first
attempt broke --keep-duplicates.

Fixes pytest-dev#624
@wjt
Copy link
Author

wjt commented Oct 19, 2018

I'm told that, on Python 2.7 on Windows, st_dev and st_ino are always 0 which probably explains why this works so poorly! I guess I'll need to make the cycle-checking conditional somehow.

@@ -558,7 +560,17 @@ def _collectfile(self, path):
return ()
return ihook.pytest_collect_file(path=path, parent=self)

def _check_visited(self, path):
st = path.stat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boxed will not like that.. ;)

You might be interested in #4237 and the discussion at #2206.

There is also _recurse in the python plugin.

I've also added seen_dirs in #4237, and wonder if a combination of realpath and this would be better maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think stat() (a single syscall) is costly, why would realpath() be any better? (Its implementation is, roughly, split the path on the path separator, and call os.lstat() on each component.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather annoyingly, os.DirEntry (yielded by os.scandir()) includes the inode number but not the device number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a joke I think. I have been trying to cut down on stat() calls because pytest has made millions of them in some rather simple test scenarios. A single stat() call here and there won't be a big deal. The problem is that there has been very many "just a single" things in pytest from 3.4 to 3.9 and some of those weren't really "single" because they were used in a loop (in a loop!).

So I don't know about this case, but you could try running my test script #2206 (comment) and see if performance is impacted significantly or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was joking.
But I also think that since we're doing realpath already for symlink-resolving it might not be necessary to do any stat anymore on top.

I think this should be rebased on features, and then maybe only small changes are required to make the test pass.

@nicoddemus
Copy link
Member

I think this should target features for the same reasons as mentioned in #4237 (comment). 👍

@nicoddemus nicoddemus changed the title Track visited files and directories when collecting [WIP] Track visited files and directories when collecting Nov 7, 2018
@nicoddemus
Copy link
Member

Hi @wjt,

It has been a long time since it has last seen activity, plus we have made sweeping changes on master to drop Python 2.7 and 3.4 support, so this PR has some conflicts which require attention.

In order to clear up our queue and let us focus on the active PRs, I'm closing this PR for now.

Please don't consider this a rejection of your PR, we just want to get this out of sight until you have the time to tackle this again. If you get around to work on this in the future, please don't hesitate in re-opening this!

Thanks for your work, the team definitely appreciates it!

@nicoddemus nicoddemus closed this Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recursive symlink significantly slows collection
4 participants