-
Notifications
You must be signed in to change notification settings - Fork 772
Open
Labels
Milestone
Description
On both our V100 (Intel Cascade Lake) and A100 (AMD Milan) systems (both RHEL 8.4 currently), I'm seeing too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0.
On both systems, I get Too many failed tests (437), maximum allowed is 400 with:
WARNING: 285 test failures, 152 test errors (out of 86678):
distributions/test_constraints (2 failed, 128 passed, 2 skipped, 2 warnings)
distributed/fsdp/test_distributed_checkpoint (2 total tests, failures=2)
distributed/fsdp/test_fsdp_apply (3 total tests, failures=3)
distributed/fsdp/test_fsdp_input (2 total tests, failures=2)
distributed/fsdp/test_fsdp_meta (14 total tests, failures=14)
distributed/fsdp/test_fsdp_misc (9 total tests, failures=9)
distributed/fsdp/test_fsdp_mixed_precision (90 total tests, failures=88)
distributed/fsdp/test_fsdp_state_dict (61 total tests, failures=61)
distributed/fsdp/test_fsdp_summon_full_params (73 total tests, failures=65)
distributions/test_distributions (219 total tests, failures=1)
test_autograd (484 total tests, failures=1, skipped=16, expected failures=2)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_jit_cuda_fuser (147 total tests, errors=1, skipped=19)
test_jit_legacy (2661 total tests, failures=12, errors=8, skipped=133, expected failures=7)
test_jit_profiling (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_ops_gradients (6968 total tests, errors=1, skipped=3597, expected failures=85)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=3, errors=40, skipped=51)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=131)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)
That seems to be significantly more than what @casparvl and @smoors observed in #15924 (although not all test reports were using the enhanced PyTorch easyblock from easybuilders/easybuild-easyblocks#2803 which counts failing tests correctly, I guess), so I'm a bit puzzled here...
@Flamefire Do some of these failing tests happen to run a bell for you?
In #15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get #16453 and #16484 merged first?).