-
Notifications
You must be signed in to change notification settings - Fork 772
Description
I am having the problem to build the recent Pytorch versions using either PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb or PyTorch-1.9.0-fosscuda-2020b.eb. The builds are done by requesting an interactive session from SLRUM like this srun -p 4gpu --nodes=1 --ntasks-per-node=36 --mem=480G --gres=gpu:1 --time=7-0:0 --pty /bin/bash, the Singularity container is started with including --nv , any SLURM related environment variables are removed from the container, and the actual build is done like this eb -d --cuda-compute-capabilities=8.0 PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb.
Thanks to the help of Kenneth the error messages are:
======================================================================
ERROR: test_ddp_comm_hook_register_just_once (__main__.DistributedDataParallelTest)
DDP communication hook can only be registered once. This test validates whether
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
self._join_processes(fn)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
self._check_return_codes(elapsed_time)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 610.0295519828796 seconds
======================================================================
ERROR: test_ddp_invalid_comm_hook_init (__main__.DistributedDataParallelTest)
This unit test makes sure that register_comm_hook properly checks the format
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
self._join_processes(fn)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
self._check_return_codes(elapsed_time)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 610.0408778190613 seconds
======================================================================
ERROR: test_round_robin (__main__.ProcessGroupGlooTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
self._join_processes(fn)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
self._check_return_codes(elapsed_time)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 687, in _check_return_codes
raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 605.0606961250305 seconds
======================================================================
ERROR: test_round_robin_create_destroy (__main__.ProcessGroupGlooTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
self._join_processes(fn)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
self._check_return_codes(elapsed_time)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
getattr(self, test_name)()
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
fn()
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2850, in wrapper
return func(*args, **kwargs)
File "/dev/shm/easybuild/eb-5azo6i3u/tmpxe0ioidc/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2850, in wrapper
return func(*args, **kwargs)
File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1438, in test_round_robin_create_destroy
pg = create(num=num_process_groups, prefix=i)
File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1424, in create
[
File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_gloo.py", line 1425, in <listcomp>
c10d.ProcessGroupGloo(
RuntimeError: Wait timeout
----------------------------------------------------------------------
Ran 85 tests in 2073.886s
FAILED (errors=4)
distributed/test_c10d_gloo failed!
and
======================================================================
FAIL: test_thread_shutdown (__main__.TestAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_autograd.py", line 4305, in test_thread_shutdown
self.assertRegex(s, "PYTORCH_API_USAGE torch.autograd.thread_shutdown")
AssertionError: Regex didn't match: 'PYTORCH_API_USAGE torch.autograd.thread_shutdown' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE tensor.create\n'
----------------------------------------------------------------------
Ran 476 tests in 19.531s
FAILED (failures=1, skipped=65, expected failures=1)
test_autograd failed!
At the end of the logfile I get:
distributed/test_c10d_gloo failed!
test_autograd failed!
The actual build works and it is failing at the testing stage:
== building...
== ... (took 7 hours 30 mins 26 secs)
== testing...
== ... (took 3 hours 50 mins 23 secs)
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/easybuild/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1): build failed (first 300 chars): cmd "export
PYTHONPATH=/dev/shm/easybuild/eb-5y1vttp7/tmpl5jqz6uw/lib/python3.9/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1
/apps/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test di (took 11 hours 20 mins 59
secs)
== Results of the build can be found in the log file(s) /dev/shm/easybuild/eb-5y1vttp7/easybuild-PyTorch-1.10.0-20220104.113117.sbDKU.log
ERROR: Build of /apps/easybuild/software/EasyBuild/4.5.1/easybuild/easyconfigs/p/PyTorch/PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb failed (err: 'build failed (first 300 chars): cmd "export PYTHONPATH=/dev/shm/easybuild/eb-5y1vttp7/tmpl5jqz6uw/lib/python3.9/site-packages:$PYTHONPATH && cd test && PYTHONUNBUFFERED=1 /apps/easybuild/software/Python/3.9.5-GCCcore-10.3.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test di')
Kenneth suggested: This looks like it could be related, not sure: https://discuss.pytorch.org/t/gloo-nccl-connection-issues-build-from-source/137264/3
I am stuck. I am happy to provide more information if requested.
Thanks!