Skip to content

I have a question about running code. There is an error when running the command torchrun --nproc_per_node=2 scripts/sdxl_example.py. My torch version is 2.2.1, cuda version is 11.8, and python version is 3.10. #13

@CharvinMei

Description

@CharvinMei

[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454

Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions