Description
Context: I write Hypothesis, a randomized testing library for Python. It works "well" under py.test, but only in the sense that it ignores py.test almost completely other than doing its best to expose functions in a way that py.test fixtures can understand.
A major problem with using Hypothesis with py.test is that function level fixtures get evaluated once per top-level function, not once per example. When these fixtures are mutable and mutated by the test this is really bad, because you end up running the test against the fixture many times, changing it each time.
People keep running into this as an issue, but currently it seems to be impossible to fix without significant changes to py.test. @RonnyPfannschmidt asked me to write a ticket about this as an example use-case of subtests, so here I am.
So what's the problem?
A test using Hypothesis looks something like:
@given(b=integers())
def test_some_stuff(a, b):
...
This translates into something approximately like:
def test_some_stuff(a, b=special_default):
if b == special_default:
for b in examples():
...
else:
...
The key problem here is that examples() cannot be evaluated at collect time because it depends on the results of previous test execution.
The reasons of this in order of decreasing amount of "this seems to be impossible" (i.e. with the current feature set of py.test I have no idea how to solve the first and neither does anyone else, could maybe solve the second, could definitely do something about the third):
- The fundamental blocker is that this is a two-phase process. You've got an initial generate phase, but then if a failure is found you have a "simplify" phase, which runs multiple simplify passes over the failing example. The space of possible examples to explore here is essentially infinite and depends intimately on the structure of the failing test.
- The number of examples run depends on both timing (Hypothesis stops running examples after a configurable timeout) and what the test does. In particular tests can throw an UnsatisfiedAssumption exception which causes the example to not count towards the maximum number of examples to run (there is an additional cap which is larger but does count these).
- Some examples may be skipped if they come from the same batch as something that produced an UnsatisfiedAssumption error.