Skip to content

Break down tests having multiple checks and xfail decorated #4497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

matteo-pallini
Copy link
Contributor

@matteo-pallini matteo-pallini commented Mar 2, 2021

Closes #4425

Xfail is going to care only about the first failed assert statement and
ignore all the other ones in the same test.

Sometimes xfail is being used to decorate a test or class function calling
several other sub-functions. This makes it hard to monitor what is happening
for the sub-fuctions that are not failing (given that the failure will take
priority and mark the test function as xfailed). By breaking down each
function into individual sub-functions it's easier to monitor individual
tests behavior

Thank your for opening a PR!

Depending on what your PR does, here are a few things you might want to address in the description:

@matteo-pallini matteo-pallini changed the title Break down tests having multiple checks and xfail decorated [WIP] Break down tests having multiple checks and xfail decorated Mar 2, 2021
@matteo-pallini matteo-pallini changed the title [WIP] Break down tests having multiple checks and xfail decorated WIP: Break down tests having multiple checks and xfail decorated Mar 2, 2021
@matteo-pallini matteo-pallini changed the title WIP: Break down tests having multiple checks and xfail decorated [WIP] Break down tests having multiple checks and xfail decorated Mar 2, 2021
@matteo-pallini matteo-pallini force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch 4 times, most recently from cb187be to 40bbaef Compare March 2, 2021 20:33
Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked it now. Great work disentangling things!

I left a couple of comments that need to be addressed.

Importantly, we have to make sure that all the removed xfails around check_logp and check_logcdf are warranted. For this we have to temporarily set n_samples=-1 in the check_logp and check_logcdf methods to force pytest to test all combinations of parameter values. After confirming the tests do not fail on float32, we can revert back to the original n_samples.

@@ -51,6 +50,8 @@ def test_dict_input(self):
m, l = utils.any_to_tensor_and_labels(self.data.to_dict("list"))
self.assertMatrixLabels(m, l, mt=self.data[l].values, lt=l)

@pytest.mark.xfail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we dig the reason for this one?

Copy link
Contributor Author

@matteo-pallini matteo-pallini Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that pymc3.glm.utils.any_to_tensor_and_labels doesn't support converting data structures containing TensorType (mostly because theano.tensor.as_tensor_variable called on line 132 doesn't). So, in this case inp will cause an error to be raised

I haven't been able to understand what conditions made this test pass. I guess that if we need pymc3.glm.utils.any_to_tensor_and_labels to be able to handle data structure containing tensor objects we should probably change that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tracked it down to this commit: 4f67d63

@rpgoldman could you chime in? What is the source of this failure? It would be helpful to have a more clear reason specified in the xfail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xfail was there before, so this shouldn't block us from moving forward with this PR.

@ricardoV94
Copy link
Member

Let's keep an eye on this one: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551266

It will show us whether the removed xfails were safe, and whether the added one for the normal logcdf was needed or not. After this we can go ahead an test the removed xfails that are still not getting n_samples=-1.

@ricardoV94
Copy link
Member

This one on float64 is new: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551515

FAILED pymc3/tests/test_distributions.py::TestMatchesScipy::test_moyal_logp
E           AssertionError: 
E           Arrays are not almost equal to 6 decimals
E           {'mu': array(2.1), 'sigma': array(0.5), 'value': array(-2.1)}
E           x and y -inf location mismatch:
E            x: array(-2219.559165)
E            y: array(-inf)

Looks like a scipy numerical issue:

st.moyal(2.1, 0.5).logpdf(-2.1)
# -inf

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and checked what Passed/Failed as expected. Everywhere I wrote "Passed/XPassed" as expected you can revert the n_samples=-1, and mark the conversation as resolved.

Almost there!!!!

@ricardoV94
Copy link
Member

Great work. Do you want to remove the unnecessary n_samples changes and comments, and check those that we skipped on the first run?

@matteo-pallini matteo-pallini force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch from 1dbcf79 to e6c1c4c Compare March 3, 2021 15:57
@matteo-pallini matteo-pallini changed the title [WIP] Break down tests having multiple checks and xfail decorated Break down tests having multiple checks and xfail decorated Mar 3, 2021
Xfail is going to care only about the first failed assert statement and
ignore all the other ones in the same test.

Sometimes xfail is being used to decorate a test or class function calling
several other sub-functions. This makes it hard to monitor what is happening
for the sub-fuctions that are not failing (given that the failure will take
priority and mark the test function as xfailed). By breaking down each
function into individual sub-functions it's easier to monitor individual
tests behavior
Also add comment regarding test with unpredictable outcome
@ricardoV94 ricardoV94 force-pushed the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch from a7cc69b to 32de1a4 Compare March 4, 2021 10:34
@ricardoV94
Copy link
Member

ricardoV94 commented Mar 4, 2021

Okay this should be it. If nobody chimes in about the unclear xfail, I say we let it stay as a fossil. Your changes recoverd 3/4 of those tests so that's an improvement :)

Anything missing?

@matteo-pallini
Copy link
Contributor Author

It seems good to me. Thanks for assisting me while working the PR and reviewing it!

Out of curiosity how did you realize that moving the domain from R (which I assumes stands for real) to a custom one would have fixed the issue? specifically, how did you pick the custom one Domain([-inf, -1.5, -1, -0.01, 0.0, 0.01, 1, 1.5, inf])?

@ricardoV94
Copy link
Member

It seems good to me. Thanks for assisting me while working the PR and reviewing it!

Out of curiosity how did you realize that moving the domain from R (which I assumes stands for real) to a custom one would have fixed the issue? specifically, how did you pick the custom one Domain([-inf, -1.5, -1, -0.01, 0.0, 0.01, 1, 1.5, inf])?

I just tweaked the edge values of the R domain (the -inf, +inf are excluded). Since the issue was on the scipy side it was easy to test what was causing it to fail and whether the new values would fix it:

import numpy as np
import scipy.stats as st                                                
                     
R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)                                   
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)     

for mu in R: 
    for sigma in Rplusbig: 
        for value in R: 
            if st.moyal(mu, sigma).logcdf(value) == -np.inf: 
                print(f'{mu=}, {sigma=}, {value=}') 
        
#  mu=2.1, sigma=0.5, value=-2.1
R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)                                   
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)      
Custom = (-1.5, -1, -.01, 0, .01, 1, 1.5) 

for mu in R:  
     for sigma in Rplusbig:  
         for value in Custom:  
             if st.moyal(mu, sigma).logcdf(value) == -np.inf:  
                print(f'{mu=}, {sigma=}, {value=}')  

# Nothing is printed                                                       

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you two already had a thorough discussion.

Just one thing: Instead of testing with a precision of 1 or 2, shouldn't we xfail these tests tighter precision on float32?
When the xfail stops failing we'll then know that the underlying numerics were improved and can tighten the test.

@@ -902,6 +902,7 @@ def test_normal(self):
R,
{"mu": R, "sigma": Rplus},
lambda value, mu, sigma: sp.norm.logcdf(value, mu, sigma),
decimal=select_by_precision(float64=6, float32=2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float32 precision of 2 digits sounds awfully bad. Are you sure about this?

Copy link
Member

@ricardoV94 ricardoV94 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually better than the normal::logp (this one is the normal::logcdf) test which uses a precision of 1! (Previous to this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to pass before because these tests were all seeded until very recently and the conditions with low precision were not appearing / failing since then

Rplus,
{"mu": Rplus, "alpha": Rplus},
lambda value, mu, alpha: sp.invgauss.logpdf(value, mu=mu, loc=alpha),
decimal=select_by_precision(float64=6, float32=1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too - 1 digit precision?!

Copy link
Member

@ricardoV94 ricardoV94 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preceeded this PR

Copy link
Member

@ricardoV94 ricardoV94 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way these 1 precision comparisons are not necessarily as bad as they may sound. It's usually when the log becomes very negative, close to underflowing to -inf. things like -213.161 vs -213.172 that mean basically nothing.

Also many times we are dealing with loss of precision on the scipy side, not on our side.

@ricardoV94
Copy link
Member

ricardoV94 commented Mar 4, 2021

I doubt we will ever notice xfails becoming xpasses specially because only a subset of combinations are tested in each run and xpasses already happen now and then by chance alone.

I feel that little precision is better than no checking at all (i.e. marking as xfail) provided this is only happening for the 32 bit version. If changes are done on our side that improve performance I think it's natural to try to increase precision in the tests when that happens.

@ricardoV94
Copy link
Member

The case could be made to select testing points more carefully so that we don't test so often values that are numerically unstable in either our version or scipy's.

That would deserve it's own PR, I think.

@ricardoV94
Copy link
Member

Also, worth mentioning that it's really difficult to test float32 locally on a 64bit installation. For aesara we can change the config.floatX, but for the scipy counterpart I have not found a way to reliably emulate a float32 computation. Wrapping the output in a float32 does not suffice (in this PR I sometimes tried it locally and had the tests succeed with some precision, just to find they would still fail on the actual float32 CI runs here).

@ricardoV94
Copy link
Member

ricardoV94 commented Mar 5, 2021

For completeness here is a list of tests in the test_distributions.py module which are being evaluated with a precision of less than 3 decimals on float32:

  • test_bound_normal::logp (float32=-1)
  • test_truncated_normal::logp (float32=1)
  • test_normal::logp (float32=-1)
  • test_normal::logcdf (float32=2) # Changed in this PR
  • test_truncated_normal::logp (float32=1)
  • test_half_normal::logp (float32=1)
  • test_wald_custom_points::logp (float32=1)
  • test_wald::logp (float32=1)
  • test_mvnormal::logp (float32=-1 and float32=0 twice)
  • test_matrixnormal::logp (float32=-1 and float32=0)
  • test_ex_gaussian::logp (float32=2)
  • test_ex_gaussian::logcdf (float32=2)
  • test_logistic::logp (float32=1)
  • test_logistic::logcdf (float32=1)
  • test_logitnormal::logp (float32=1)

Additionally we have 14 tests marked as xfail on float32

Total number of tests: 201

@michaelosthege
Copy link
Member

You're right that the precision doesn't deal with numbers before the comma.

Nevertheless we should make sure that XPASS tests show up as failures: https://docs.pytest.org/en/reorganize-docs/new-docs/user/xfail.html

@ricardoV94
Copy link
Member

ricardoV94 commented Mar 5, 2021

I didn't know about the strict mode to xfail. That looks useful.

I am afraid it would require some extra refactoring, because the tests (at least these we were looking at in test_distributions.py) don't always xfail, depending on which combination of parameters is tested.

Should we open a new issue to address this possibility?

@michaelosthege
Copy link
Member

Should we open a new issue to address this possibility?

Looking at the PR title I'd say it would be in the scope. But the PR has a lot of discussion & history already, so @DRabbit17 @ricardoV94 you should decide how to proceed.
On their own, the current changes are also fine to merge, just don't fix the problem entirely.

@ricardoV94
Copy link
Member

ricardoV94 commented Mar 5, 2021

As I understood the goal of this PR was simply to "unshadow" tests that were unnecessarily bundled inside a shared xfail. This is done, I think.

What you are proposing is to now look at the xfails that remain relevant and see if we can make them strict so that xPasses would trigger them. There are some tests that look deterministic so in theory they could, but I would guess the majority cannot be made strict, unless we change the actual tests as they fail randomly. This would require looking at all the xfails, not just the ones that were changed in this PR due to "bundling".

What do you think @DRabbit17?

PS: I agree this would be valuable!

@matteo-pallini
Copy link
Contributor Author

matteo-pallini commented Mar 5, 2021

Thanks for suggesting introducing the strict mode for the test. I also think that using that option would be valuable.

I think having that change in another PR would make the process tidier, as mentioned we already have a lot of discussion in this PR.

Also, I think that in the original scope of this PR we were only meant to extract the non-failing sub-checks from the xfail decorated ones and not change the behavior of any individual test/check. Although we ended up changing by modifying the support of some tests and the check precision (and possibly some other stuff I cannot think of). Adding the strict fail would add further changes to how the tests behave.

Having said that, I feel that these are mostly philosophical considerations :-). In practice, I don't have a strong opinion on whether to add the strict change in this PR or in another one. Happy to do it however you guys feel it's better (or to flip a coin to decide). I also agree that's a change worth adding.

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge

@ricardoV94 ricardoV94 merged commit d248a0e into pymc-devs:master Mar 9, 2021
@ricardoV94
Copy link
Member

Thanks @DRabbit17! Looking forward to your next PR

@matteo-pallini matteo-pallini deleted the fix-single-pytest-xfail-calls-being-applied-to-multiple-test-subfunctions branch March 13, 2021 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

@pytest.mark.xfail is being applied too broadly
3 participants