[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/sdxl_example.py FAILED
Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454
Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453