-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Break down tests having multiple checks and xfail decorated #4497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Break down tests having multiple checks and xfail decorated #4497
Conversation
cb187be
to
40bbaef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked it now. Great work disentangling things!
I left a couple of comments that need to be addressed.
Importantly, we have to make sure that all the removed xfails around check_logp
and check_logcdf
are warranted. For this we have to temporarily set n_samples=-1
in the check_logp
and check_logcdf
methods to force pytest to test all combinations of parameter values. After confirming the tests do not fail on float32, we can revert back to the original n_samples.
@@ -51,6 +50,8 @@ def test_dict_input(self): | |||
m, l = utils.any_to_tensor_and_labels(self.data.to_dict("list")) | |||
self.assertMatrixLabels(m, l, mt=self.data[l].values, lt=l) | |||
|
|||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we dig the reason for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that pymc3.glm.utils.any_to_tensor_and_labels
doesn't support converting data structures containing TensorType (mostly because theano.tensor.as_tensor_variable
called on line 132 doesn't). So, in this case inp
will cause an error to be raised
I haven't been able to understand what conditions made this test pass. I guess that if we need pymc3.glm.utils.any_to_tensor_and_labels
to be able to handle data structure containing tensor objects we should probably change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tracked it down to this commit: 4f67d63
@rpgoldman could you chime in? What is the source of this failure? It would be helpful to have a more clear reason specified in the xfail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xfail was there before, so this shouldn't block us from moving forward with this PR.
Let's keep an eye on this one: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551266 It will show us whether the removed xfails were safe, and whether the added one for the normal logcdf was needed or not. After this we can go ahead an test the removed xfails that are still not getting n_samples=-1. |
This one on float64 is new: https://github.com/pymc-devs/pymc3/pull/4497/checks?check_run_id=2021551515
Looks like a scipy numerical issue:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and checked what Passed/Failed as expected. Everywhere I wrote "Passed/XPassed" as expected you can revert the n_samples=-1, and mark the conversation as resolved.
Almost there!!!!
Great work. Do you want to remove the unnecessary n_samples changes and comments, and check those that we skipped on the first run? |
1dbcf79
to
e6c1c4c
Compare
Xfail is going to care only about the first failed assert statement and ignore all the other ones in the same test. Sometimes xfail is being used to decorate a test or class function calling several other sub-functions. This makes it hard to monitor what is happening for the sub-fuctions that are not failing (given that the failure will take priority and mark the test function as xfailed). By breaking down each function into individual sub-functions it's easier to monitor individual tests behavior
Also add comment regarding test with unpredictable outcome
a7cc69b
to
32de1a4
Compare
Okay this should be it. If nobody chimes in about the unclear xfail, I say we let it stay as a fossil. Your changes recoverd 3/4 of those tests so that's an improvement :) Anything missing? |
It seems good to me. Thanks for assisting me while working the PR and reviewing it! Out of curiosity how did you realize that moving the domain from |
I just tweaked the edge values of the R domain (the -inf, +inf are excluded). Since the issue was on the scipy side it was easy to test what was causing it to fail and whether the new values would fix it: import numpy as np
import scipy.stats as st
R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)
for mu in R:
for sigma in Rplusbig:
for value in R:
if st.moyal(mu, sigma).logcdf(value) == -np.inf:
print(f'{mu=}, {sigma=}, {value=}')
# mu=2.1, sigma=0.5, value=-2.1 R = (-2.1, -1, -.01, 0, 0.01, 1, 2.1)
Rplusbig = (0.5, 0.9, 0.99, 1, 1.5, 2, 20)
Custom = (-1.5, -1, -.01, 0, .01, 1, 1.5)
for mu in R:
for sigma in Rplusbig:
for value in Custom:
if st.moyal(mu, sigma).logcdf(value) == -np.inf:
print(f'{mu=}, {sigma=}, {value=}')
# Nothing is printed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you two already had a thorough discussion.
Just one thing: Instead of testing with a precision of 1 or 2, shouldn't we xfail
these tests tighter precision on float32
?
When the xfail
stops failing we'll then know that the underlying numerics were improved and can tighten the test.
@@ -902,6 +902,7 @@ def test_normal(self): | |||
R, | |||
{"mu": R, "sigma": Rplus}, | |||
lambda value, mu, sigma: sp.norm.logcdf(value, mu, sigma), | |||
decimal=select_by_precision(float64=6, float32=2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float32
precision of 2 digits sounds awfully bad. Are you sure about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually better than the normal::logp (this one is the normal::logcdf) test which uses a precision of 1! (Previous to this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to pass before because these tests were all seeded until very recently and the conditions with low precision were not appearing / failing since then
Rplus, | ||
{"mu": Rplus, "alpha": Rplus}, | ||
lambda value, mu, alpha: sp.invgauss.logpdf(value, mu=mu, loc=alpha), | ||
decimal=select_by_precision(float64=6, float32=1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too - 1 digit precision?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preceeded this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way these 1 precision comparisons are not necessarily as bad as they may sound. It's usually when the log becomes very negative, close to underflowing to -inf. things like -213.161 vs -213.172 that mean basically nothing.
Also many times we are dealing with loss of precision on the scipy side, not on our side.
I doubt we will ever notice xfails becoming xpasses specially because only a subset of combinations are tested in each run and xpasses already happen now and then by chance alone. I feel that little precision is better than no checking at all (i.e. marking as xfail) provided this is only happening for the 32 bit version. If changes are done on our side that improve performance I think it's natural to try to increase precision in the tests when that happens. |
The case could be made to select testing points more carefully so that we don't test so often values that are numerically unstable in either our version or scipy's. That would deserve it's own PR, I think. |
Also, worth mentioning that it's really difficult to test float32 locally on a 64bit installation. For aesara we can change the config.floatX, but for the scipy counterpart I have not found a way to reliably emulate a float32 computation. Wrapping the output in a float32 does not suffice (in this PR I sometimes tried it locally and had the tests succeed with some precision, just to find they would still fail on the actual float32 CI runs here). |
For completeness here is a list of tests in the test_distributions.py module which are being evaluated with a precision of less than 3 decimals on float32:
Additionally we have 14 tests marked as xfail on float32 Total number of tests: 201 |
You're right that the precision doesn't deal with numbers before the comma. Nevertheless we should make sure that |
I didn't know about the strict mode to xfail. That looks useful. I am afraid it would require some extra refactoring, because the tests (at least these we were looking at in Should we open a new issue to address this possibility? |
Looking at the PR title I'd say it would be in the scope. But the PR has a lot of discussion & history already, so @DRabbit17 @ricardoV94 you should decide how to proceed. |
As I understood the goal of this PR was simply to "unshadow" tests that were unnecessarily bundled inside a shared xfail. This is done, I think. What you are proposing is to now look at the xfails that remain relevant and see if we can make them strict so that xPasses would trigger them. There are some tests that look deterministic so in theory they could, but I would guess the majority cannot be made strict, unless we change the actual tests as they fail randomly. This would require looking at all the xfails, not just the ones that were changed in this PR due to "bundling". What do you think @DRabbit17? PS: I agree this would be valuable! |
Thanks for suggesting introducing the strict mode for the test. I also think that using that option would be valuable. I think having that change in another PR would make the process tidier, as mentioned we already have a lot of discussion in this PR. Also, I think that in the original scope of this PR we were only meant to extract the non-failing sub-checks from the xfail decorated ones and not change the behavior of any individual test/check. Although we ended up changing by modifying the support of some tests and the check precision (and possibly some other stuff I cannot think of). Adding the strict fail would add further changes to how the tests behave. Having said that, I feel that these are mostly philosophical considerations :-). In practice, I don't have a strong opinion on whether to add the strict change in this PR or in another one. Happy to do it however you guys feel it's better (or to flip a coin to decide). I also agree that's a change worth adding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge
Thanks @DRabbit17! Looking forward to your next PR |
Closes #4425
Xfail is going to care only about the first failed assert statement and
ignore all the other ones in the same test.
Sometimes xfail is being used to decorate a test or class function calling
several other sub-functions. This makes it hard to monitor what is happening
for the sub-fuctions that are not failing (given that the failure will take
priority and mark the test function as xfailed). By breaking down each
function into individual sub-functions it's easier to monitor individual
tests behavior
Thank your for opening a PR!
Depending on what your PR does, here are a few things you might want to address in the description: