-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
I get this error:
[rank9]: Traceback (most recent call last):
[rank9]: File "/workspace/home/lab/osilkin/training/src/instructlab/training/main_ds.py", line 821, in <module>
[rank9]: main(args)
[rank9]: File "/workspace/home/lab/osilkin/training/src/instructlab/training/main_ds.py", line 432, in main
[rank9]: load_latest_full_state(args=args, accelerator=accelerator)
[rank9]: File "/workspace/home/lab/osilkin/training/src/instructlab/training/utils.py", line 891, in load_latest_full_state
[rank9]: accelerator.load_state(latest)
[rank9]: File "/workspace/home/lab/osilkin/training_hub/.venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 3685, in load_state
[rank9]: load_fsdp_model(self.state.fsdp_plugin, self, model, input_dir, i)
[rank9]: File "/workspace/home/lab/osilkin/training_hub/.venv/lib/python3.12/site-packages/accelerate/utils/fsdp_utils.py", line 197, in load_fsdp_model
[rank9]: state_dict = torch.load(input_model_file, weights_only=True)
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/workspace/home/lab/osilkin/training_hub/.venv/lib/python3.12/site-packages/torch/serialization.py", line 1484, in load
[rank9]: with _open_file_like(f, "rb") as opened_file:
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/workspace/home/lab/osilkin/training_hub/.venv/lib/python3.12/site-packages/torch/serialization.py", line 759, in _open_file_like
[rank9]: return _open_file(name_or_buffer, mode)
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/workspace/home/lab/osilkin/training_hub/.venv/lib/python3.12/site-packages/torch/serialization.py", line 740, in __init__
[rank9]: super().__init__(open(name, mode))
[rank9]: ^^^^^^^^^^^^^^^^
[rank9]: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/nvme0n1/output-checkpoints/sft/full_state/epoch_0/pytorch_model_fsdp.bin'
When using 2 nodes
Metadata
Metadata
Assignees
Labels
No labels