Skip to content

Muti node CUDA error: invalid device ordinal #3775

@RicardoDominguez

Description

@RicardoDominguez

System Info

Two nodes of 8 H100s

Accelerate commit: 8b49352

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true
  fsdp_version: 2
machine_rank: 0
main_process_ip: i201
main_process_port: 5000
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
parallelism_config:
  parallelism_config_cp_size: 1
  parallelism_config_dp_replicate_size: 2
  parallelism_config_dp_shard_size: 8
  parallelism_config_tp_size: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

accelerate launch --config_file config_file.yaml ft_trl.py

where ft_trl is a basic script that uses trl.SFTTrainer

trainer = SFTTrainer(
        model=my_args.model,
        train_dataset=dataset,
        args=trl_args,
)
trainer.train(
    resume_from_checkpoint=resume_from_checkpoint
)

Errors out on the second node

[Gloo] Rank 8 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 9 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 10 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 11 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 12 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 13 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 15 is connected to [Gloo] Rank 15 peer ranks. Expected number of connected peer ranks is : 15
14 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...

Size of the raw dataset: 84555Size of the raw dataset: 84555

`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards:   0%|                                                                                                                                                     | 0/2 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 50.11it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 243.98it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 241.14it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 123.88it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 250.70it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 239.44it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 177.79it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 251.85it/s]
[rank8]: Traceback (most recent call last):

[rank9]: Traceback (most recent call last): 
[rank9]: File "/lustre/home/rolmedo/swe-tune/ft_trl.py", line 102, in <module> 
[rank9]: trainer = SFTTrainer( 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 830, in __init__ 
[rank9]: train_dataset = self._prepare_dataset( 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 927, in _prepare_dataset
[rank9]: with PartialState().main_process_first(): 
[rank9]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ 
[rank9]: return next(self.gen) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 526, in main_process_first 
[rank9]: yield from self._goes_first(self.is_main_process) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 409, in _goes_first 
[rank9]: self.wait_for_everyone() 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 403, in wait_for_everyone 
[rank9]: torch.distributed.barrier(device_ids=[self.process_index]) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper 
[rank9]: return func(*args, **kwargs) 
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier 
[rank9]: work = group.barrier(opts=opts) [rank9]: torch.AcceleratorError: CUDA error: invalid device ordinal

Expected behavior

The expected behavior is that training proceeds normally. The issue dissapears when changing state.py L#403 to

torch.distributed.barrier(device_ids=[self.process_index])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions