Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 14, 2023

(created using eb --new-pr)

When pytorch runs tests in subprocess (added in 1.2.0) the output will look like:

Traceback (most recent call last):
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/distributed/test_c10d_nccl.py", line 2894, in <module>
    run_tests()
  File "/tmp/easybuild-tmp/eb-CNzkIQ/tmp9hARar/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 726, in run_tests
    assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
AssertionError: 2 unit test(s) failed:
        DistributedDataParallelTest.test_find_unused_parameters_kwarg_debug_detail
        DistributedDataParallelTest.test_find_unused_parameters_kwarg_grad_is_view_debug_detail

FINISHED PRINTING LOG FILE of distributed/test_c10d_nccl (/dev/shm/s3248973-EasyBuild/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test-reports/distributed-test_c10d_nccl_mmfy71m4)

distributed/test_c10d_nccl failed!

This adds an error-counting regexp for this for use in e.g. easybuilders/easybuild-easyconfigs#18424

I also enhanced the extra log part when we have found failing test suites with the generic regexp but those don't match those we found using the specific regexps, that also count individual failures.

Previous output was like:

Failed tests (suites/files):
* asuite1
* distributed/test_c10d_nccl
* suite2
distributed/test_c10d_nccl (2 unit test(s) failed)

New output should be:

Failed tests (suites/files):
distributed/test_c10d_nccl (2 unit test(s) failed)
+ asuite1
+ suite2

Note the de-duplication.
In extension I added a list of suites to the bottom of this list for which it is the other way round: Not counted in the generic regexp but by individual ones.
Should help with easier identifications of issues with the EasyBlock

@branfosj
Copy link
Member

Test report by @branfosj

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.13.1-foss-2022a.eb
  • SUCCESS PyTorch-1.12.1-foss-2021a.eb
  • SUCCESS PyTorch-1.10.0-foss-2021a.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/97dba61c5642a84f75140a23b2dbb494 for a full test report.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants