Skip to content

Improved time counter used to compute test durations. #6939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Mar 29, 2020

Conversation

smarie
Copy link
Contributor

@smarie smarie commented Mar 19, 2020

CallInfo.start is now retro-computed as CallInfo.stop - duration, where duration is counted using time.perf_counter(). Fixes #4391

  • Include documentation when adding new features.
  • Include new tests or update existing tests when applicable.
  • Allow maintainers to push and squash when merging my commits. Please uncheck this if you prefer to squash the commits yourself.

Unless your change is trivial or a small documentation fix (e.g., a typo or reword of a small section) please:

  • Create a new changelog file in the changelog folder, with a name like <ISSUE NUMBER>.<TYPE>.rst. See changelog/README.rst for details.

    Write sentences in the past or present tense, examples:

    • Improved verbose diff output with sequences.
    • Terminal summary statistics now use multiple colors.

    Also make sure to end the sentence with a ..

  • Add yourself to AUTHORS in alphabetical order.
    -->

Sylvain MARIE added 2 commits March 19, 2020 15:08
…` is now retro-computed as `CallInfo.stop - duration`, where `duration` is counted using `time.perf_counter()`. Fixes pytest-dev#4391
@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2020

As of now I rely on time.perf_counter() as suggested by @Zac-HD in #4391 (comment), and not on timeit.default_timer as used in pytest-benchmark

@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2020

Failing tests seem to be caused by more precise duration counting. I'll check this and fix.

@bluetech
Copy link
Member

The retro-computing trick is clever but I'm not completely sure about it. From what I understand:

  1. If start and stop are not defined to have a reference point, then just straight using a monotonic clock (time.perf_counter) is more appropriate, since any hybrid solution will inevitably lead to misunderstanding and incorrect assumptions.
  2. If start and stop are defined to be real/system time, then this is a breaking change since start is no longer that.

Since I've learned that Hyrum's Law definitely holds for pytest, I don't think (1) is an option.

One possible solution is to add a new duration attribute to CallInfo, which is calculated using perf_counter, and use that in the duration reports. And then document that start and stop use system time (clock can jump), but duration is precise.

@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2020

Thanks for the feedback @bluetech !

Well it seems that this suggestion from @Zac-HD (#4391 (comment)) was made with that concern in mind: we preserve the start/stop API because users may rely on it, and we just make sure that their difference is more accurate than before.

I do not see any reason why this would be considered breaking API retrocompatibility, do you ?

Sorry not to refer to the "law" you mention, I did not know this name and in general am not very fond of developers defining laws after their own name :) . But I admit that this makes a good reference point if users need to be convinced that an API should be stable.

@smarie
Copy link
Contributor Author

smarie commented Mar 19, 2020

Something very strange is happening: all windows and macOS targets of the continuous integration fail, while the linux ones are ok.

On MacOS targets the issue is not related with my modifications in pytest code, but with my improvement of the test itself. The test acceptance_tests.py::TestDurations.test_calls fails because the three tests always appear in the terminal, while test 1 should not (it should have a too small duration and therefore be skipped). If I revert the test to do a less precise match, it succeeds as previously, but should we be satisfied with these durations ? I do not think so:

0.03s call test_calls.py::test_2
0.03s call test_calls.py::test_3
0.02s call test_calls.py::test_1

(I remind that test 1 is doing a sleep(0.005), test 2 is doing sleep(0.01) and the 3 is sleep(0.02). This is clearly not reflected in the numbers above.

On Windows targets things behave much more randomly, with one or several tests below failing:

  • acceptance_tests.py::TestDurations.test_calls
  • acceptance_tests.py::TestDurations.test_calls_showall
  • acceptance_tests.py::TestDurations.test_calls
  • acceptance_tests.py::TestDurations.test_setup_function
  • acceptance_tests.py::TestDurationWithFixture.test_setup_function

I do not manage to reproduce on my platform. Could this be related to parallelization with x-dist ?

Another thing is maybe that perf_counter is maybe less precise than the doc states ?

@bluetech
Copy link
Member

I do not see any reason why this would be considered breaking API retrocompatibility, do you ?

Suppose a test runs for 10 minutes, and in the middle of that the system time is adjusted (correctly, or not). Before, start would have the (system) time as it was when the test started. After, it will be the (system) time when the test ended minus the duration, which is not the same.

I know it seems unlikely that someone relies on this, but that's why I cited the "law" and offered an alternative.

@@ -915,7 +915,8 @@ def test_calls(self, testdir):
["*durations*", "*call*test_3*", "*call*test_2*"]
)
result.stdout.fnmatch_lines(
["(0.00 durations hidden. Use -vv to show these durations.)"]
["(8 items with duration < 0.005s hidden. Use -vv to show these "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come it shows 8 items I wonder? I count 4 items in total, and out of only 2 are supposed to count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it also includes setup/teardown durations, here's the run with --durations=999 -vv:

===================== slowest 99 test durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2
0.00s call     test_durations.py::test_1
0.00s setup    test_durations.py::test_3
0.00s setup    test_durations.py::test_something
0.00s teardown test_durations.py::test_2
0.00s setup    test_durations.py::test_1
0.00s teardown test_durations.py::test_something
0.00s teardown test_durations.py::test_3
0.00s call     test_durations.py::test_something
0.00s teardown test_durations.py::test_1
0.00s setup    test_durations.py::test_2
========================= 4 passed in 0.08s =========================

With --durations=10 we get:

===================== slowest 10 test durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2

(8 items with duration < 0.005s hidden.  Use -vv to show these durations.)
========================= 4 passed in 0.06s =========================

Perhaps "items" is a bit misleading? How about "entries" instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i propose the word reports since i believe it may match better

Copy link
Contributor Author

@smarie smarie Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RonnyPfannschmidt , I also prefer "entries" over "reports" : even if the latter is more accurately reflecting the inner objects (TestReport), it is a bit misleading for the "lambda" user. But obviously that is not my call here, I let the team decide.

Shall I also modify the slowest xx test durations header to remove the word test ?

For example:

===================== slowest 10 durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2

(8 entries with duration < 0.005s hidden.  Use -vv to show these durations.)
========================= 4 passed in 0.06s =========================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarie of course, im sure that with more discussion and input we will reach a conclusion that will serve the users best,

i do have to admit that my bias is structurally technical (aka i like to have the ux and the technicalities to match in such a sense that we do not need to have a dissonance between what appens on technical level and what happens on ux level), but it may be better to change the technical aspect to match the comprehensible ux rather than the other way around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I eventually modified it that way:

===================== slowest 10 durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
========================= 4 passed in 0.06s =========================

Let me know if that works for everyone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it 👍

@bluetech
Copy link
Member

Something very strange is happening: all windows and macOS targets of the continuous integration fail, while the linux ones are ok.

I don't that test is a good idea, if it relies on actual system behavior. The system can be slow of switch threads etc. and the time of a single test can be unbounded. So using an exact count would be flaky. Maybe just monkeypatch the time functions for this test to get consistent results?

@@ -915,7 +915,8 @@ def test_calls(self, testdir):
["*durations*", "*call*test_3*", "*call*test_2*"]
)
result.stdout.fnmatch_lines(
["(0.00 durations hidden. Use -vv to show these durations.)"]
["(8 items with duration < 0.005s hidden. Use -vv to show these "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it also includes setup/teardown durations, here's the run with --durations=999 -vv:

===================== slowest 99 test durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2
0.00s call     test_durations.py::test_1
0.00s setup    test_durations.py::test_3
0.00s setup    test_durations.py::test_something
0.00s teardown test_durations.py::test_2
0.00s setup    test_durations.py::test_1
0.00s teardown test_durations.py::test_something
0.00s teardown test_durations.py::test_3
0.00s call     test_durations.py::test_something
0.00s teardown test_durations.py::test_1
0.00s setup    test_durations.py::test_2
========================= 4 passed in 0.08s =========================

With --durations=10 we get:

===================== slowest 10 test durations =====================
0.02s call     test_durations.py::test_3
0.01s call     test_durations.py::test_2

(8 items with duration < 0.005s hidden.  Use -vv to show these durations.)
========================= 4 passed in 0.06s =========================

Perhaps "items" is a bit misleading? How about "entries" instead?

@smarie
Copy link
Contributor Author

smarie commented Mar 20, 2020

For information I did some tests (gist available here) with the python tools for timing, and I have to say that I am quite puzzled with the results: there is no apparent stability on windows 10. So I will not try to re-invent the wheel and will assume that the official python documentation is the source to trust: therefore we should use perf_counter or perf_counter_ns for best performance.

Now,

  • I agree with @bluetech : for the long run it will probably be much clearer / readable to have a duration attribute in CallInfo, and use it in the TestReport. I also suggest to add a duration_ns attribute when python >= 3.7. EDIT I implemented this in the PR now, you can check it out. Note that I could not find any test in the acceptance test checking if "duration" was available as an attribute on the report object, so I did not know where I should test for existence of the new duration_ns and comparison with duration (duration_ns is 1e9 times the other).

  • I am ok to monkeypatch the time module for the tests, but... what can we possibly use as a replacement so that the results are ok ? An alternative is to have more permissive tests on windows and macOS targets.

""" Result/Exception info a function invocation. """
""" Result/Exception info a function invocation.

:param excinfo: `ExceptionInfo` describing the captured exception if any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param excinfo: `ExceptionInfo` describing the captured exception if any
:param excinfo: (Optional[ExceptionInfo]) The captured exception, if any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of using typing symbols in docstring, they are not for humans. Especially as the real official meaning of Optional is that the type is optional (hence the attribute is nonable).

Also, Sphinx might like the use of backticks, no ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sphinx docs already display type annotations for annotated functions, for example: https://docs.pytest.org/en/latest/reference.html#pytest-exit. Normally with annotations we don't need to write types in comments, but attrs is a special case.

I would not say type annotations are not for humans -- programmers are humans too, at least for now :) I except a lot of Python programmers will eventually learn what they mean.

I like to have the Optional, it tells me immediately it might be None, no need to read the text.

I'm not sure about the backticks, I suggested to remove it because you didn't add it for the others. I'm fine with whatever works...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thanks I forgot to include them for the others.

@bluetech
Copy link
Member

Seems like another rebase is needed to fix CI, now due to #6976 -- bad luck :)

@smarie
Copy link
Contributor Author

smarie commented Mar 27, 2020

Regarding duration_ns, I can think of two reasons to have it, but looking closer I'm not sure it's very beneficial:

You are right, this is probably overkill. I had a look at PEP564 and it seems to mostly state that the issue is with keeping nanosecond resolution when measuring very long durations (> 104 days).

So completely out of scope for pytest. I'll remove this.

@smarie
Copy link
Contributor Author

smarie commented Mar 27, 2020

Everything should now be ok. Let me know !

@@ -226,7 +226,7 @@ def call_runtest_hook(item, when: "Literal['setup', 'call', 'teardown']", **kwds
class CallInfo:
""" Result/Exception info a function invocation.

:param excinfo: (`ExceptionInfo`, optional) describing the captured exception if any
:param excinfo: (`ExceptionInfo`, optional) describing the captured exception if any.
:param start: (`float`) The system time when the call started, in seconds since the epoch.
:param stop: (`float`) The system time when the call ended, in seconds since the epoch.
:param duration: (`float`) The call duration, in seconds.
Copy link
Contributor

@blueyed blueyed Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format is :param type name:, e.g. :param float start:.
You could also add it below the attribs:

start = attr.ib()
"""The system time when the call started, in seconds since the epoch.
:type: float"""

Check with make -C doc/en html / tox -e docs though, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'll simply update the class docstring then, if it does not make much difference.

Note that I cant check the doc generation as tox -e docs does not seem to work on windows

docs run-test: commands[0] | sh -c 'towncrier --draft > doc/en/_changelog_towncrier_draft.rst'
ERROR: InvocationError for command could not find executable sh

And with make I seem to get an error

(tools_py37) C:\_dev\python_ws\_Libs_OpenSource\pytest>make -C doc/en html
make: Entering directory `C:/_dev/python_ws/_Libs_OpenSource/pytest/doc/en'
Running Sphinx v2.4.4

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "c:\miniconda3\envs\tools_py37\lib\site-packages\sphinx\config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "c:\miniconda3\envs\tools_py37\lib\site-packages\sphinx\util\pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "C:\_dev\python_ws\_Libs_OpenSource\pytest\doc\en\conf.py", line 22, in <module>
    from _pytest.compat import TYPE_CHECKING
ImportError: cannot import name 'TYPE_CHECKING' from '_pytest.compat' (c:\miniconda3\envs\tools_py37\lib\site-packages\_pytest\compat.py)

make: *** [html] Error 2
make: Leaving directory `C:/_dev/python_ws/_Libs_OpenSource/pytest/doc/en'

So for the sake of completing the PR I chose the easy way as you seem to say that any way works ok :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also on Windows, and I use tox -e docs to check/generate documentation. 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicoddemus which installer do you use for the missing sh required by tox -e docs ? (see error message above)
git bash, cygwin, or some conda package ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry missed that one. My bash comes from git:

λ where bash
C:\Program Files\Git\usr\bin\bash.exe

It is a bit of annoying to require bash on windows just to generate the docs...

Copy link
Member

@nicoddemus nicoddemus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great work and patience @smarie!

@bluetech bluetech merged commit 95fadd5 into pytest-dev:master Mar 29, 2020
@bluetech
Copy link
Member

Squashed and merged. Thanks @smarie, and @nicoddemus and @blueyed for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get the best available time counter for node call duration calculation
5 participants