Break down tests having multiple checks and xfail decorated #4497

matteo-pallini · 2021-03-02T18:51:10Z

Xfail is going to care only about the first failed assert statement and
ignore all the other ones in the same test.

Sometimes xfail is being used to decorate a test or class function calling
several other sub-functions. This makes it hard to monitor what is happening
for the sub-fuctions that are not failing (given that the failure will take
priority and mark the test function as xfailed). By breaking down each
function into individual sub-functions it's easier to monitor individual
tests behavior

Thank your for opening a PR!

Depending on what your PR does, here are a few things you might want to address in the description:

important background, or details about the implementation
linting/style checks have been run

ricardoV94

I checked it now. Great work disentangling things!

I left a couple of comments that need to be addressed.

Importantly, we have to make sure that all the removed xfails around check_logp and check_logcdf are warranted. For this we have to temporarily set n_samples=-1 in the check_logp and check_logcdf methods to force pytest to test all combinations of parameter values. After confirming the tests do not fail on float32, we can revert back to the original n_samples.

pymc3/tests/test_distributions.py

ricardoV94 · 2021-03-03T08:44:38Z

pymc3/tests/test_models_utils.py

@@ -51,6 +50,8 @@ def test_dict_input(self):
        m, l = utils.any_to_tensor_and_labels(self.data.to_dict("list"))
        self.assertMatrixLabels(m, l, mt=self.data[l].values, lt=l)

+    @pytest.mark.xfail


Can we dig the reason for this one?

I think that pymc3.glm.utils.any_to_tensor_and_labels doesn't support converting data structures containing TensorType (mostly because theano.tensor.as_tensor_variable called on line 132 doesn't). So, in this case inp will cause an error to be raised

I haven't been able to understand what conditions made this test pass. I guess that if we need pymc3.glm.utils.any_to_tensor_and_labels to be able to handle data structure containing tensor objects we should probably change that

I tracked it down to this commit: 4f67d63

@rpgoldman could you chime in? What is the source of this failure? It would be helpful to have a more clear reason specified in the xfail.

The xfail was there before, so this shouldn't block us from moving forward with this PR.

pymc3/tests/test_distributions.py

ricardoV94 · 2021-03-03T11:59:10Z

Let's keep an eye on this one: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551266

It will show us whether the removed xfails were safe, and whether the added one for the normal logcdf was needed or not. After this we can go ahead an test the removed xfails that are still not getting n_samples=-1.

pymc3/tests/test_distributions.py

ricardoV94 · 2021-03-03T12:20:02Z

This one on float64 is new: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551515

FAILED pymc3/tests/test_distributions.py::TestMatchesScipy::test_moyal_logp

E           AssertionError: 
E           Arrays are not almost equal to 6 decimals
E           {'mu': array(2.1), 'sigma': array(0.5), 'value': array(-2.1)}
E           x and y -inf location mismatch:
E            x: array(-2219.559165)
E            y: array(-inf)

Looks like a scipy numerical issue:

st.moyal(2.1, 0.5).logpdf(-2.1)
# -inf

ricardoV94

I went ahead and checked what Passed/Failed as expected. Everywhere I wrote "Passed/XPassed" as expected you can revert the n_samples=-1, and mark the conversation as resolved.

Almost there!!!!

pymc3/tests/test_distributions.py

ricardoV94 · 2021-03-03T15:38:11Z

Great work. Do you want to remove the unnecessary n_samples changes and comments, and check those that we skipped on the first run?

pymc3/tests/test_distributions.py

Xfail is going to care only about the first failed assert statement and ignore all the other ones in the same test. Sometimes xfail is being used to decorate a test or class function calling several other sub-functions. This makes it hard to monitor what is happening for the sub-fuctions that are not failing (given that the failure will take priority and mark the test function as xfailed). By breaking down each function into individual sub-functions it's easier to monitor individual tests behavior

Also add comment regarding test with unpredictable outcome

… parameters

ricardoV94 · 2021-03-04T10:39:51Z

Okay this should be it. If nobody chimes in about the unclear xfail, I say we let it stay as a fossil. Your changes recoverd 3/4 of those tests so that's an improvement :)

Anything missing?

matteo-pallini · 2021-03-04T10:58:43Z

It seems good to me. Thanks for assisting me while working the PR and reviewing it!

Out of curiosity how did you realize that moving the domain from R (which I assumes stands for real) to a custom one would have fixed the issue? specifically, how did you pick the custom one Domain([-inf, -1.5, -1, -0.01, 0.0, 0.01, 1, 1.5, inf])?

ricardoV94 · 2021-03-04T11:20:47Z

It seems good to me. Thanks for assisting me while working the PR and reviewing it!

Out of curiosity how did you realize that moving the domain from R (which I assumes stands for real) to a custom one would have fixed the issue? specifically, how did you pick the custom one Domain([-inf, -1.5, -1, -0.01, 0.0, 0.01, 1, 1.5, inf])?

I just tweaked the edge values of the R domain (the -inf, +inf are excluded). Since the issue was on the scipy side it was easy to test what was causing it to fail and whether the new values would fix it:

import numpy as np
import scipy.stats as st                                                
                     
R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)                                   
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)     

for mu in R: 
    for sigma in Rplusbig: 
        for value in R: 
            if st.moyal(mu, sigma).logcdf(value) == -np.inf: 
                print(f'{mu=}, {sigma=}, {value=}') 
        
#  mu=2.1, sigma=0.5, value=-2.1

R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)                                   
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)      
Custom = (-1.5, -1, -.01, 0, .01, 1, 1.5) 

for mu in R:  
     for sigma in Rplusbig:  
         for value in Custom:  
             if st.moyal(mu, sigma).logcdf(value) == -np.inf:  
                print(f'{mu=}, {sigma=}, {value=}')  

# Nothing is printed

michaelosthege

I see that you two already had a thorough discussion.

Just one thing: Instead of testing with a precision of 1 or 2, shouldn't we xfail these tests tighter precision on float32?
When the xfail stops failing we'll then know that the underlying numerics were improved and can tighten the test.

michaelosthege · 2021-03-04T19:49:21Z

pymc3/tests/test_distributions.py

@@ -902,6 +902,7 @@ def test_normal(self):
            R,
            {"mu": R, "sigma": Rplus},
            lambda value, mu, sigma: sp.norm.logcdf(value, mu, sigma),
+            decimal=select_by_precision(float64=6, float32=2),


float32 precision of 2 digits sounds awfully bad. Are you sure about this?

It's actually better than the normal::logp (this one is the normal::logcdf) test which uses a precision of 1! (Previous to this PR)

This used to pass before because these tests were all seeded until very recently and the conditions with low precision were not appearing / failing since then

michaelosthege · 2021-03-04T19:50:26Z

pymc3/tests/test_distributions.py

+            Rplus,
+            {"mu": Rplus, "alpha": Rplus},
+            lambda value, mu, alpha: sp.invgauss.logpdf(value, mu=mu, loc=alpha),
+            decimal=select_by_precision(float64=6, float32=1),


Here too - 1 digit precision?!

This preceeded this PR

By the way these 1 precision comparisons are not necessarily as bad as they may sound. It's usually when the log becomes very negative, close to underflowing to -inf. things like -213.161 vs -213.172 that mean basically nothing.

Also many times we are dealing with loss of precision on the scipy side, not on our side.

ricardoV94 · 2021-03-04T20:43:47Z

I doubt we will ever notice xfails becoming xpasses specially because only a subset of combinations are tested in each run and xpasses already happen now and then by chance alone.

I feel that little precision is better than no checking at all (i.e. marking as xfail) provided this is only happening for the 32 bit version. If changes are done on our side that improve performance I think it's natural to try to increase precision in the tests when that happens.

ricardoV94 · 2021-03-04T20:59:43Z

The case could be made to select testing points more carefully so that we don't test so often values that are numerically unstable in either our version or scipy's.

That would deserve it's own PR, I think.

ricardoV94 · 2021-03-05T07:31:03Z

Also, worth mentioning that it's really difficult to test float32 locally on a 64bit installation. For aesara we can change the config.floatX, but for the scipy counterpart I have not found a way to reliably emulate a float32 computation. Wrapping the output in a float32 does not suffice (in this PR I sometimes tried it locally and had the tests succeed with some precision, just to find they would still fail on the actual float32 CI runs here).

ricardoV94 · 2021-03-05T07:39:37Z

For completeness here is a list of tests in the test_distributions.py module which are being evaluated with a precision of less than 3 decimals on float32:

test_bound_normal::logp (float32=-1)
test_truncated_normal::logp (float32=1)
test_normal::logp (float32=-1)
test_normal::logcdf (float32=2) # Changed in this PR
test_truncated_normal::logp (float32=1)
test_half_normal::logp (float32=1)
test_wald_custom_points::logp (float32=1)
test_wald::logp (float32=1)
test_mvnormal::logp (float32=-1 and float32=0 twice)
test_matrixnormal::logp (float32=-1 and float32=0)
test_ex_gaussian::logp (float32=2)
test_ex_gaussian::logcdf (float32=2)
test_logistic::logp (float32=1)
test_logistic::logcdf (float32=1)
test_logitnormal::logp (float32=1)

Additionally we have 14 tests marked as xfail on float32

Total number of tests: 201

michaelosthege · 2021-03-05T11:31:48Z

You're right that the precision doesn't deal with numbers before the comma.

Nevertheless we should make sure that XPASS tests show up as failures: https://docs.pytest.org/en/reorganize-docs/new-docs/user/xfail.html

ricardoV94 · 2021-03-05T11:51:06Z

I didn't know about the strict mode to xfail. That looks useful.

I am afraid it would require some extra refactoring, because the tests (at least these we were looking at in test_distributions.py) don't always xfail, depending on which combination of parameters is tested.

Should we open a new issue to address this possibility?

michaelosthege · 2021-03-05T12:51:14Z

Should we open a new issue to address this possibility?

Looking at the PR title I'd say it would be in the scope. But the PR has a lot of discussion & history already, so @DRabbit17 @ricardoV94 you should decide how to proceed.
On their own, the current changes are also fine to merge, just don't fix the problem entirely.

ricardoV94 · 2021-03-05T13:19:07Z

As I understood the goal of this PR was simply to "unshadow" tests that were unnecessarily bundled inside a shared xfail. This is done, I think.

What you are proposing is to now look at the xfails that remain relevant and see if we can make them strict so that xPasses would trigger them. There are some tests that look deterministic so in theory they could, but I would guess the majority cannot be made strict, unless we change the actual tests as they fail randomly. This would require looking at all the xfails, not just the ones that were changed in this PR due to "bundling".

What do you think @DRabbit17?

PS: I agree this would be valuable!

matteo-pallini · 2021-03-05T14:14:58Z

Thanks for suggesting introducing the strict mode for the test. I also think that using that option would be valuable.

I think having that change in another PR would make the process tidier, as mentioned we already have a lot of discussion in this PR.

Also, I think that in the original scope of this PR we were only meant to extract the non-failing sub-checks from the xfail decorated ones and not change the behavior of any individual test/check. Although we ended up changing by modifying the support of some tests and the check precision (and possibly some other stuff I cannot think of). Adding the strict fail would add further changes to how the tests behave.

Having said that, I feel that these are mostly philosophical considerations :-). In practice, I don't have a strong opinion on whether to add the strict change in this PR or in another one. Happy to do it however you guys feel it's better (or to flip a coin to decide). I also agree that's a change worth adding.

michaelosthege

Let's merge

ricardoV94 · 2021-03-09T05:42:36Z

Thanks @DRabbit17! Looking forward to your next PR

matteo-pallini changed the title ~~Break down tests having multiple checks and xfail decorated~~ [WIP] Break down tests having multiple checks and xfail decorated Mar 2, 2021

matteo-pallini changed the title ~~[WIP] Break down tests having multiple checks and xfail decorated~~ WIP: Break down tests having multiple checks and xfail decorated Mar 2, 2021

matteo-pallini changed the title ~~WIP: Break down tests having multiple checks and xfail decorated~~ [WIP] Break down tests having multiple checks and xfail decorated Mar 2, 2021

matteo-pallini force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch 4 times, most recently from cb187be to 40bbaef Compare March 2, 2021 20:33

matteo-pallini mentioned this pull request Mar 2, 2021

@pytest.mark.xfail is being applied too broadly #4425

Closed

ricardoV94 requested changes Mar 3, 2021

View reviewed changes

ricardoV94 reviewed Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Show resolved Hide resolved

ricardoV94 reviewed Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Outdated Show resolved Hide resolved

ricardoV94 reviewed Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Show resolved Hide resolved

ricardoV94 requested changes Mar 3, 2021

View reviewed changes

ricardoV94 reviewed Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Show resolved Hide resolved

matteo-pallini commented Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Outdated Show resolved Hide resolved

matteo-pallini commented Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Outdated Show resolved Hide resolved

matteo-pallini commented Mar 3, 2021

View reviewed changes

pymc3/tests/test_distributions.py Outdated Show resolved Hide resolved

matteo-pallini force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch from 1dbcf79 to e6c1c4c Compare March 3, 2021 15:57

matteo-pallini changed the title ~~[WIP] Break down tests having multiple checks and xfail decorated~~ Break down tests having multiple checks and xfail decorated Mar 3, 2021

matteo-pallini added 8 commits March 3, 2021 16:01

Remove xfail which seems unnecessary

99c6e49

Also add comment regarding test with unpredictable outcome

Break down test with multiple sub-checks and xfail where appropriate

80d2711

Remove Scipy Xfail from tests not using sp.betabinom

56d619f

Add back xfail to potentially failing test when float32 is used

060d164

Temporatily add n_samples=-1 to check tests behavior for all possible…

c15751b

… parameters

Remove temporary n_samples and a couple more to do some quick tests

09b9ff0

Remove unnecessary comments

e54934a

ricardoV94 added 2 commits March 4, 2021 11:22

Small renaming and reordering

99b119a

Reintroduce accidentally removed n_samples from binomial test

32de1a4

ricardoV94 force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch from a7cc69b to 32de1a4 Compare March 4, 2021 10:34

ricardoV94 approved these changes Mar 4, 2021

View reviewed changes

ricardoV94 requested review from MarcoGorelli and michaelosthege March 4, 2021 11:29

michaelosthege reviewed Mar 4, 2021

View reviewed changes

michaelosthege approved these changes Mar 8, 2021

View reviewed changes

ricardoV94 merged commit d248a0e into pymc-devs:master Mar 9, 2021

ricardoV94 mentioned this pull request Mar 10, 2021

Refactor basic Distributions #4508

Merged

matteo-pallini deleted the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch March 13, 2021 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break down tests having multiple checks and xfail decorated #4497

Break down tests having multiple checks and xfail decorated #4497

matteo-pallini commented Mar 2, 2021 •

edited by ricardoV94

Loading

ricardoV94 left a comment •

edited

Loading

ricardoV94 Mar 3, 2021

matteo-pallini Mar 3, 2021 •

edited

Loading

ricardoV94 Mar 3, 2021

michaelosthege Mar 8, 2021

ricardoV94 commented Mar 3, 2021

ricardoV94 commented Mar 3, 2021

ricardoV94 left a comment •

edited

Loading

ricardoV94 commented Mar 3, 2021

ricardoV94 commented Mar 4, 2021 •

edited

Loading

matteo-pallini commented Mar 4, 2021

ricardoV94 commented Mar 4, 2021

michaelosthege left a comment

michaelosthege Mar 4, 2021

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 Mar 4, 2021

michaelosthege Mar 4, 2021

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 commented Mar 4, 2021 •

edited

Loading

ricardoV94 commented Mar 4, 2021

ricardoV94 commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 •

edited

Loading

michaelosthege commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 •

edited

Loading

michaelosthege commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 •

edited

Loading

matteo-pallini commented Mar 5, 2021 •

edited

Loading

michaelosthege left a comment

ricardoV94 commented Mar 9, 2021

Break down tests having multiple checks and xfail decorated #4497

Break down tests having multiple checks and xfail decorated #4497

Conversation

matteo-pallini commented Mar 2, 2021 • edited by ricardoV94 Loading

ricardoV94 left a comment • edited Loading

Choose a reason for hiding this comment

ricardoV94 Mar 3, 2021

Choose a reason for hiding this comment

matteo-pallini Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

ricardoV94 Mar 3, 2021

Choose a reason for hiding this comment

michaelosthege Mar 8, 2021

Choose a reason for hiding this comment

ricardoV94 commented Mar 3, 2021

ricardoV94 commented Mar 3, 2021

ricardoV94 left a comment • edited Loading

Choose a reason for hiding this comment

ricardoV94 commented Mar 3, 2021

ricardoV94 commented Mar 4, 2021 • edited Loading

matteo-pallini commented Mar 4, 2021

ricardoV94 commented Mar 4, 2021

michaelosthege left a comment

Choose a reason for hiding this comment

michaelosthege Mar 4, 2021

Choose a reason for hiding this comment

ricardoV94 Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

ricardoV94 Mar 4, 2021

Choose a reason for hiding this comment

michaelosthege Mar 4, 2021

Choose a reason for hiding this comment

ricardoV94 Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

ricardoV94 Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

ricardoV94 commented Mar 4, 2021 • edited Loading

ricardoV94 commented Mar 4, 2021

ricardoV94 commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 • edited Loading

michaelosthege commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 • edited Loading

michaelosthege commented Mar 5, 2021

ricardoV94 commented Mar 5, 2021 • edited Loading

matteo-pallini commented Mar 5, 2021 • edited Loading

michaelosthege left a comment

Choose a reason for hiding this comment

ricardoV94 commented Mar 9, 2021

matteo-pallini commented Mar 2, 2021 •

edited by ricardoV94

Loading

ricardoV94 left a comment •

edited

Loading

matteo-pallini Mar 3, 2021 •

edited

Loading

ricardoV94 left a comment •

edited

Loading

ricardoV94 commented Mar 4, 2021 •

edited

Loading

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 Mar 4, 2021 •

edited

Loading

ricardoV94 commented Mar 4, 2021 •

edited

Loading

ricardoV94 commented Mar 5, 2021 •

edited

Loading

ricardoV94 commented Mar 5, 2021 •

edited

Loading

ricardoV94 commented Mar 5, 2021 •

edited

Loading

matteo-pallini commented Mar 5, 2021 •

edited

Loading