fix prototype transforms tests with set agg_method#6934
fix prototype transforms tests with set agg_method#6934pmeier merged 10 commits intopytorch:mainfrom
Conversation
| self._make_error_meta(AssertionError, "aggregated mismatch") | ||
| agg_abs_diff = float(self.agg_method(abs_diff.to(torch.float64))) | ||
| if agg_abs_diff > self.atol: | ||
| raise self._make_error_meta( |
There was a problem hiding this comment.
This was the source of the actual bug. Without the raise we just created an exception, but never did anything with it. Thus, all tests that set agg_method in their closeness_kwargs passed without a value check.
|
|
||
|
|
||
| class PILImagePair(TensorLikePair): | ||
| class ImagePair(TensorLikePair): |
There was a problem hiding this comment.
I've refactored this to not handle mixed PIL / tensor image pairs. It was a problem if the tolerance is set for floating point images, i.e. in the range [0.0, 1.0], but the comparison converted to uint8, which needs higher tolerances.
There was a problem hiding this comment.
These are crazy tolerances and render the test effectively useless. This PR just lays the foundation to fix these in the near future. I'll open an issue ASAP with a game plan to review all the operators again that need these tolerances for some reason.
There was a problem hiding this comment.
These are crazy tolerances indeed. From your earlier comment, I understand that the tests were effectively not throwing exceptions, so here you adjust the values to make them pass and then revisit all kernels that fail. Is my understanding correct?
| # 2D mask shenanigans | ||
| if output_tensor.ndim == 2 and input_tensor.ndim == 3: | ||
| output_tensor = output_tensor.unsqueeze(0) | ||
| elif output_tensor.ndim == 3 and input_tensor.ndim == 2: | ||
| output_tensor = output_tensor.squeeze(0) |
There was a problem hiding this comment.
For all other shape mismatches we can let the comparison logic handle the error.
| yield ArgsKwargs(video_loader, size=size) | ||
|
|
||
|
|
||
| def multi_crop_pil_reference_wrapper(pil_kernel): |
There was a problem hiding this comment.
Small helper since the regular pil_reference_wrapper cannot handle conversion of these tuple or list outputs.
| expected_image, expected_mask = t_ref(*dp_ref) | ||
| if isinstance(actual_image, torch.Tensor) and not isinstance(expected_image, torch.Tensor): | ||
| expected_image = legacy_F.pil_to_tensor(expected_image) | ||
| expected_mask = legacy_F.pil_to_tensor(expected_mask).squeeze(0) | ||
| expected = (expected_image, expected_mask) |
There was a problem hiding this comment.
This was relying on the mixed PIL / tensor image comparison that was removed above. Thus, we do it manually here.
| assert_close( | ||
| actual, | ||
| expected, | ||
| check_dtype=False, |
There was a problem hiding this comment.
We can be strict here now, since we perform the conversion correctly.
|
Even with the crazy tolerances, there are still failures in CI: I can't reproduce locally, so this is probably flaky. Given that I will look into this ASAP, I wouldn't block over this. Only other safe options is to disable the reference tests for these kernels for now. Your choice. |
|
@pmeier I think it's worth fixing the issues in this PR and merging later. What's the purpose of making a bunch of test corrections and then disabling the tests? No strong opinion though. |
I don't know the extend of the changes needed yet. This PR might get out of hand. But I'm not the one that needs to review it, sol I'll leave it up to you. |
datumbox
left a comment
There was a problem hiding this comment.
Offering a stamp to unblock the work. Up to you @pmeier if you want to make the changes here to make all tests pass or do it gradually on follow ups. My only concern is turning off all references might be dangerous. Perhaps we could skip only those that fail?
pmeier
left a comment
There was a problem hiding this comment.
With the latest commit I removed the DEFAULT_PIL_REFERENCE_CLOSENESS_KWARGS in favor of setting the tolerances for each kernel separately. Some of them are pretty wild and I'm going to highlight them below. Given that everything works in our E2E tests, IMO this is either caused by parameter combinations that we don't use or a problematic test setup.
We need to investigate them in any case. Let's use #6937 to coordinate. I'll update it soon to check everything that was already done in this PR.
datumbox
left a comment
There was a problem hiding this comment.
Still LGTM.
A few comments:
- We do know that accuracy is not affected but it might be the case that a kernel is broken/buggy on setups we haven't tested (floats for instance) or that the specific augmentation doesn't have much effect on accuracy. So as you highlighted we must check things very thoroughly to understand if this is a test problem or an actual bug.
- I wonder if it's worth it to focus on single pixel differences rather than averages. As averages can accumulate, it might worth examining if max single pixel tolerances can be stricter.
- Some of the tolerances look scary. We should bisect and revert any speed optimizations that cause them. It might be also worth checking our tensor kernels against PIL. Stable tests do that now and this is how we ensure that stable and PIL stay aligned. V2 doesn't have such tests (of course it compares itself with V1) but due to the increased tolerances we might be missing something important.
Let's merge on green CI and take this as top priority task to resolve.
|
In the end I've refactored this by quite a bit so we get a clearer picture. Here are the main changes this PR brings:
With these changes, there are only a couple of operators with wild tolerances left to check:
They are tracked in #6937 and that will be used after this PR. Since these tests are pretty flaky right now, I cranked up the tolerances to avoid problems with the CI. Let's see if I need more. |
Summary: * fix prototype transforms tests with set agg_method * use individual tolerances * refactor PIL reference test * increase tolerance for elastic_mask * fix autocontrast tolerances * increase tolerance for RandomAutocontrast Reviewed By: NicolasHug Differential Revision: D41265197 fbshipit-source-id: 57eba523c4e6672d8a1d0cae8b7b95f1d52f13bf
assert_{close, equal}was buggy for prototype tests if one set theagg_methodas we do for a few tests throughvision/test/prototype_transforms_kernel_infos.py
Lines 64 to 67 in f32600b
I only wanted to fix that, but realized fixing this entails a lot more. A lot of the reference test that used these tolerances failed. Thus, this PR only lays the foundation to fix them in the near future.
cc @vfdev-5 @datumbox @bjuncek