-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
System Info
Two nodes of 8 H100s
Accelerate commit: 8b49352
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
offload_params: false
cpu_ram_efficient_loading: true
auto_wrap_policy: TRANSFORMER_BASED_WRAP
state_dict_type: FULL_STATE_DICT
reshard_after_forward: true
fsdp_version: 2
machine_rank: 0
main_process_ip: i201
main_process_port: 5000
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
parallelism_config:
parallelism_config_cp_size: 1
parallelism_config_dp_replicate_size: 2
parallelism_config_dp_shard_size: 8
parallelism_config_tp_size: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: falseInformation
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
accelerate launch --config_file config_file.yaml ft_trl.py
where ft_trl is a basic script that uses trl.SFTTrainer
trainer = SFTTrainer(
model=my_args.model,
train_dataset=dataset,
args=trl_args,
)
trainer.train(
resume_from_checkpoint=resume_from_checkpoint
)
Errors out on the second node
[Gloo] Rank 8 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 9 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 10 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 11 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 12 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 13 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
[Gloo] Rank 15 is connected to [Gloo] Rank 15 peer ranks. Expected number of connected peer ranks is : 15
14 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Loading 250 datasets from /fast/rolmedo/swesmith/datasets/qwen3-tokenized-30b-rep8-trunc151-packed/
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
Checking for valid checkpoints in /fast/rolmedo/swe-models//4b-rep8-trunc151-6e-5/...
Size of the raw dataset: 84555
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 50.11it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 243.98it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 241.14it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 123.88it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 250.70it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 239.44it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 177.79it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 251.85it/s]
[rank8]: Traceback (most recent call last):
[rank9]: Traceback (most recent call last):
[rank9]: File "/lustre/home/rolmedo/swe-tune/ft_trl.py", line 102, in <module>
[rank9]: trainer = SFTTrainer(
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 830, in __init__
[rank9]: train_dataset = self._prepare_dataset(
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 927, in _prepare_dataset
[rank9]: with PartialState().main_process_first():
[rank9]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
[rank9]: return next(self.gen)
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 526, in main_process_first
[rank9]: yield from self._goes_first(self.is_main_process)
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 409, in _goes_first
[rank9]: self.wait_for_everyone()
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/accelerate/state.py", line 403, in wait_for_everyone
[rank9]: torch.distributed.barrier(device_ids=[self.process_index])
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank9]: return func(*args, **kwargs)
[rank9]: File "/lustre/home/rolmedo/tflatest/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier
[rank9]: work = group.barrier(opts=opts) [rank9]: torch.AcceleratorError: CUDA error: invalid device ordinal
Expected behavior
The expected behavior is that training proceeds normally. The issue dissapears when changing state.py L#403 to
torch.distributed.barrier(device_ids=[self.process_index])
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels