Muti node CUDA error: invalid device ordinal

### System Info


Two nodes of 8 H100s

Accelerate commit: 8b49352

```Shell
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true
  fsdp_version: 2
machine_rank: 0
main_process_ip: i201
main_process_port: 5000
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
parallelism_config:
  parallelism_config_cp_size: 1
  parallelism_config_dp_replicate_size: 2
  parallelism_config_dp_shard_size: 8
  parallelism_config_tp_size: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [x] My own task or dataset (give details below)

### Reproduction

accelerate launch --config_file config_file.yaml ft_trl.py

where ft_trl is a basic script that uses trl.SFTTrainer

```
trainer = SFTTrainer(
        model=my_args.model,
        train_dataset=dataset,
        args=trl_args,
)
trainer.train(
    resume_from_checkpoint=resume_from_checkpoint
)
```

Errors out on the second node
```
[Gloo] Rank 8 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 9 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 10 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 11 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 12 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 13 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 15 is connected to [Gloo] Rank 15 peer ranks. Expected number of connected peer ranks is : 15
14 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...

Size of the raw dataset: 84555Size of the raw dataset: 84555

`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards:   0%|                                                                                                                                                     | 0/2 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 50.11it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 243.98it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 241.14it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 123.88it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 250.70it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 239.44it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 177.79it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 251.85it/s]
[rank8]: Traceback (most recent call last):

[rank9]: Traceback (most recent call last): 
[rank9]: File "/lustre/home/rolmedo/swe-tune/ft_trl.py", line 102, in <module> 
[rank9]: trainer = SFTTrainer( 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 830, in __init__ 
[rank9]: train_dataset = self._prepare_dataset( 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 927, in _prepare_dataset
[rank9]: with PartialState().main_process_first(): 
[rank9]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ 
[rank9]: return next(self.gen) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 526, in main_process_first 
[rank9]: yield from self._goes_first(self.is_main_process) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 409, in _goes_first 
[rank9]: self.wait_for_everyone() 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 403, in wait_for_everyone 
[rank9]: torch.distributed.barrier(device_ids=[self.process_index]) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper 
[rank9]: return func(*args, **kwargs) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier 
[rank9]: work = group.barrier(opts=opts) [rank9]: torch.AcceleratorError: CUDA error: invalid device ordinal
```

### Expected behavior

The expected behavior is that training proceeds normally. The issue dissapears when changing [state.py L#403](https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L403) to

```
torch.distributed.barrier(device_ids=[self.process_index])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muti node CUDA error: invalid device ordinal #3775

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Muti node CUDA error: invalid device ordinal #3775

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions