-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Closed
Labels
Description
System Info
- Transformers (4.50.0.dev0) main branch at commit 94ae1ba
- (also tried) transformers==4.49
- python==3.12
- accelerate==1.0.1
Who can help?
@muellerzr @SunMarc @ArthurZucker
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run a simple FSDP training with state dict type SHARDED_STATE_DICT with save_only_model option.
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2639, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time)
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3092, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial)
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3211, in _save_checkpoint
self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
File "/opt/conda/lib/python3.11/site-packages/transformers/trainer_callback.py", line 144, in save_to_json
with open(json_path, "w", encoding="utf-8") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './train_output/checkpoint-1/trainer_state.json'
Expected behavior
Report the incompatibility early in the training lifecycle rather erroring out at the first checkpoint save event.