-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Closed
Labels
Description
System Info
transformers==4.51.3
Python version: 3.11
Who can help?
@ArthurZucker @amyeroberts @qubvel
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Load Llama4ForCausalLM model with FSDP auto-wrap policy enabled, e.g.:
model = Llama4ForCausalLM.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct", torch_dtype="auto")
# Training
trainer = SFTTrainer(
# model=model_args.model_name_or_path,
model=model,
args=training_args,
...
)
This produces the following error:
[rank0]: Traceback (most recent call last):
[rank0]: File "/tmp/tmp.RYY4AI2EBM/ephemeral_script.py", line 137, in <module>
[rank0]: main({'model_name_or_path': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'model_revision': 'main', 'torch_dtype': 'bfloat16', 'attn_implementation': 'flex_attention', 'use_liger': False, 'use_peft': False, 'lora_r': 16, 'lora_alpha': 8, 'lora_dropout': 0.05, 'lora_target_modules': ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], 'lora_modules_to_save': ['lm_head', 'embed_tokens'], 'load_in_4bit': False, 'load_in_8bit': False, 'dataset_name': 'gsm8k', 'dataset_config': 'main', 'dataset_train_split': 'train', 'dataset_test_split': 'test', 'dataset_text_field': 'text', 'dataset_kwargs': {'add_special_tokens': False, 'append_concat_token': False}, 'max_seq_length': 8192, 'dataset_batch_size': 1000, 'packing': False, 'padding_free': False, 'num_train_epochs': 10, 'per_device_train_batch_size': 64, 'per_device_eval_batch_size': 64, 'auto_find_batch_size': False, 'eval_strategy': 'epoch', 'bf16': True, 'tf32': False, 'learning_rate': 0.0002, 'warmup_steps': 10, 'lr_scheduler_type': 'inverse_sqrt', 'optim': 'adamw_torch_fused', 'max_grad_norm': 1.0, 'seed': 42, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'gradient_checkpointing_kwargs': {'use_reentrant': False}, 'fsdp': 'full_shard auto_wrap', 'fsdp_config': {'activation_checkpointing': True, 'cpu_ram_efficient_loading': False, 'sync_module_states': True, 'use_orig_params': True, 'limit_all_gathers': False}, 'save_strategy': 'no', 'save_total_limit': 1, 'resume_from_checkpoint': False, 'log_level': 'info', 'logging_strategy': 'steps', 'logging_steps': 1, 'report_to': ['tensorboard'], 'output_dir': '/mnt/shared/Llama-4-Scout-17B-16E-Instruct'})
[rank0]: File "/tmp/tmp.RYY4AI2EBM/ephemeral_script.py", line 130, in main
[rank0]: trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer.py", line 2238, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop
[rank0]: self.model = self.accelerator.prepare(self.model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1446, in prepare
[rank0]: result = tuple(
[rank0]: ^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1447, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1630, in prepare_model
[rank0]: self.state.fsdp_plugin.set_auto_wrap_policy(model)
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/utils/dataclasses.py", line 1903, in set_auto_wrap_policy
[rank0]: raise ValueError(f"Could not find the transformer layer class {layer_class} in the model.")
[rank0]: ValueError: Could not find the transformer layer class Llama4VisionEncoderLayer in the model.
Expected behavior
The model should load and be wrapped by FSDP successfully.