Skip to content

Conversation

@cloudnoize
Copy link
Contributor

Address #1367

if the Aggregator crashes after collecting all data needed to finish round 0 but before it saved the model, on restart the model proto is not constructed by the recovery process, leading to a crash.

The PR initialize the proto model with the initial tensors that can be overridden by the recovery process if needed.
For safety, I moved other initializations before the recovery process as well.

@noopurintel
Copy link
Collaborator

noopurintel commented Feb 12, 2025

Thanks for fixing this @cloudnoize . I have successfully tested these changes and provided the run details on the Issue page itself.

Copy link
Collaborator

@noopurintel noopurintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor review comment provided. Changes look good otherwise. Tested successfully.

@cloudnoize cloudnoize force-pushed the elerer/1367_restart_on_save_model branch from 645d29b to 24089d5 Compare February 12, 2025 14:34
@cloudnoize cloudnoize force-pushed the elerer/1367_restart_on_save_model branch from 24089d5 to 75ebfc2 Compare February 13, 2025 10:17
@noopurintel noopurintel merged commit c679d64 into securefederatedai:develop Feb 13, 2025
27 of 28 checks passed
yuliasherman pushed a commit to yuliasherman/openfl that referenced this pull request Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants