Add kwargs to optimizer, scheduler and dataloader using function accelerator().load_state()#3540
Conversation
accelerator().load state()
accelerator().load state()accelerator().load_state()
SunMarc
left a comment
There was a problem hiding this comment.
Thanks ! Shared about another solution in the issue.
src/accelerate/checkpointing.py
Outdated
| optimizer_load_kwargs (`dict`, *optional*): | ||
| Additional arguments that can be passed to the optimizer's `load` function. | ||
| scheduler_load_kwargs (`dict`, *optional*): | ||
| Additional arguments that can be passed to the scheduler's `load` function. | ||
| dataloader_load_kwargs (`dict`, *optional*): | ||
| Additional arguments that can be passed to the dataloader's `load` function. |
There was a problem hiding this comment.
instead of create kwargs for each one of them, let's just have load_kwargs that we use for all load method in this function. Also it is better to set load_kwargs to None by default and in the function set it to dict() as it is mutable.
There was a problem hiding this comment.
I am not sure if I understood correct your suggestions. You want to use load_kwargs for all load functions? I implemented that but I am not sure if this is the best way to go. On my case the error only happend with the optimizer load. Anyway, let me know what you think
| state_dict = load(input_dataloader_state_dict_file, map_location=None, **dataloader_load_kwargs) | ||
| dataloader.load_state_dict(state_dict) | ||
| logger.info("All dataloader sampler states loaded successfully") | ||
|
|
There was a problem hiding this comment.
let's also include it for scaler and states
|
@luiz0992, let me know if you are planning to finish this pr or not |
src/accelerate/accelerator.py
Outdated
| optimizer_load_kwargs: dict[str, Any] = {}, | ||
| scheduler_load_kwargs: dict[str, Any] = {}, | ||
| dataloader_load_kwargs: dict[str, Any] = {}, | ||
| **load_model_func_kwargs, |
There was a problem hiding this comment.
please keep load_model_func_kwargs as load_model have different kwargs compared to load
There was a problem hiding this comment.
The load_model does not accept kwargs. Let me know if I am mistaken.
There was a problem hiding this comment.
in safetensors, we have the following
def load_model(
model: torch.nn.Module, filename: Union[str, os.PathLike], strict: bool = True, device: Union[str, int] = "cpu"
) -> Tuple[List[str], List[str]]:If you can revert the changes related to load_model_func_kwargs and only update load_kwargs to where we use load, it will be better
There was a problem hiding this comment.
The way it is being used it is missing only the strict argument, but that's ok. I reverted the load_model_func_kwargs.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
commit 2f8fd72 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b03 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b05 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a37 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691d Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d8 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb7 Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cb Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0de Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f053 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>
commit 2f8fd72 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b03 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b05 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a37 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691d Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d8 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb7 Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cb Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0de Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f053 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>
commit 2f8fd72 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b03 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b05 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a37 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691d Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d8 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb7 Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cb Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0de Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f053 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>
commit 2f8fd72 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b03 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b05 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a37 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691d Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d8 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604 Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb7 Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cb Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0de Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f053 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>
What does this PR do?
This PR fixes issue #3539
This PR adds support for passing keyword arguments to the load functions of optimizers, schedulers, and dataloaders when using the Accelerator's load_state method. This enhancement allows for more
flexibility and control when loading optimizer, scheduler, and dataloader states.
The changes include:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@BenjaminBossan @SunMarc @zach-huggingface