-
Notifications
You must be signed in to change notification settings - Fork 234
Description
Describe the bug
This is an intermittent issue where after restart of aggregator process (by killing the process id and starting it again), it gets killed on its own with no error/exception logs to indicate the reason.
The resiliency test failing because of this is part of PR and PQ pipelines which are otherwise quite stable.
To Reproduce
Steps to reproduce the behavior:
- Start the federation with
torch/mnist, 2 collaborators and 10+ rounds. - Ensure that the rounds are increasing.
- Restart aggregator
- Aggregator is silently gone with collaborators running and still trying to connect to it.
Example failures -
-
When only aggregator restarts - https://github.com/securefederatedai/openfl/actions/runs/14839141823/job/41657945065#step:4:205
-
When aggregator and all collaborators restart - https://github.com/securefederatedai/openfl/actions/runs/15014267823/job/42188592296#step:4:322
aggregator.log - where
Starting the Aggregator Service.appears thrice indicating 3 start/restarts, but no error/exception etc.
Expected behavior
Irrespective of number/stage of restart for any participant, it should be able to come up and join the federation.