Skip to content

save_only_model with FSDP throws FileNotFoundError error #36626

@kmehant

Description

@kmehant

System Info

  • Transformers (4.50.0.dev0) main branch at commit 94ae1ba
  • (also tried) transformers==4.49
  • python==3.12
  • accelerate==1.0.1

Who can help?

@muellerzr @SunMarc @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run a simple FSDP training with state dict type SHARDED_STATE_DICT with save_only_model option.

  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2639, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time)
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial)
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3211, in _save_checkpoint
    self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer_callback.py", line 144, in save_to_json
    with open(json_path, "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './train_output/checkpoint-1/trainer_state.json'

Expected behavior

Report the incompatibility early in the training lifecycle rather erroring out at the first checkpoint save event.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions