Skip to content

PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb build failure on A100 GPU within Singularity container #14665

@sassy-crick

Description

@sassy-crick

I am having the problem to build the recent Pytorch versions using either PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb or PyTorch-1.9.0-fosscuda-2020b.eb. The builds are done by requesting an interactive session from SLRUM like this srun -p 4gpu --nodes=1 --ntasks-per-node=36 --mem=480G --gres=gpu:1 --time=7-0:0 --pty /bin/bash, the Singularity container is started with including --nv , any SLURM related environment variables are removed from the container, and the actual build is done like this eb -d --cuda-compute-capabilities=8.0 PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb.

Thanks to the help of Kenneth the error messages are:

======================================================================
ERROR: test_ddp_comm_hook_register_just_once (__main__.DistributedDataParallelTest)
DDP communication hook can only be registered once. This test validates whether
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 610.0295519828796 seconds

======================================================================
ERROR: test_ddp_invalid_comm_hook_init (__main__.DistributedDataParallelTest)
This unit test makes sure that register_comm_hook properly checks the format
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 610.0408778190613 seconds

======================================================================
ERROR: test_round_robin (__main__.ProcessGroupGlooTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 605.0606961250305 seconds

======================================================================
ERROR: test_round_robin_create_destroy (__main__.ProcessGroupGlooTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
    getattr(self, test_name)()
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
    fn()
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2850, in wrapper
    return func(*args, **kwargs)
  File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2850, in wrapper
    return func(*args, **kwargs)
  File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1438, in test_round_robin_create_destroy
    pg = create(num=num_process_groups, prefix=i)
  File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1424, in create
    [
  File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1425, in <listcomp>
    c10d.ProcessGroupGloo(
RuntimeError: Wait timeout

----------------------------------------------------------------------
Ran 85 tests in 2073.886s

FAILED (errors=4)
distributed/test_c10d_gloo failed!

and

======================================================================
FAIL: test_thread_shutdown (__main__.TestAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_autograd.py", line 4305, in test_thread_shutdown
    self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown")
AssertionError: Regex didn't match: 'PYTORCH_API_USAGE torch.autograd.thread_shutdown' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE tensor.create\n'

----------------------------------------------------------------------
Ran 476 tests in 19.531s

FAILED (failures=1, skipped=65, expected failures=1)
test_autograd failed!

At the end of the logfile I get:

distributed/test_c10d_gloo failed!
test_autograd failed!

The actual build works and it is failing at the testing stage:

== building...
== ... (took 7 hours 30 mins 26 secs)
== testing...
== ... (took 3 hours 50 mins 23 secs)
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1): build failed (first 300 chars): cmd "export 
PYTHONPATH=/dev/shm/easybuild/eb-5y1vttp7/tmpl5jqz6uw/lib/python3.9/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 
/apps/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python run_test.py --continue-through-error  --verbose -x distributed/elastic/utils/distributed_test di (took 11 hours 20 mins 59 
secs)
== Results of the build can be found in the log file(s) /dev/shm/easybuild/eb-5y1vttp7/easybuild-PyTorch-1.10.0-20220104.113117.sbDKU.log

ERROR: Build of /apps/easybuild/software/EasyBuild/4.5.1/easybuild/easyconfigs/p/PyTorch/PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb failed (err: 'build failed (first 300 chars): cmd "export PYTHONPATH=/dev/shm/easybuild/eb-5y1vttp7/tmpl5jqz6uw/lib/python3.9/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /apps/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python run_test.py --continue-through-error  --verbose -x distributed/elastic/utils/distributed_test di')

Kenneth suggested: This looks like it could be related, not sure: https://discuss.pytorch.org/t/gloo-nccl-connection-issues-build-from-source/137264/3

I am stuck. I am happy to provide more information if requested.
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions