Skip to content

Task Runner: Intermittent: Aggregator process is killed after restart without any error/exception log #1620

@noopurintel

Description

@noopurintel

Describe the bug
This is an intermittent issue where after restart of aggregator process (by killing the process id and starting it again), it gets killed on its own with no error/exception logs to indicate the reason.

The resiliency test failing because of this is part of PR and PQ pipelines which are otherwise quite stable.

To Reproduce
Steps to reproduce the behavior:

  1. Start the federation with torch/mnist, 2 collaborators and 10+ rounds.
  2. Ensure that the rounds are increasing.
  3. Restart aggregator
  4. Aggregator is silently gone with collaborators running and still trying to connect to it.

Example failures -

  1. When only aggregator restarts - https://github.com/securefederatedai/openfl/actions/runs/14839141823/job/41657945065#step:4:205

    aggregator.log

  2. When aggregator and all collaborators restart - https://github.com/securefederatedai/openfl/actions/runs/15014267823/job/42188592296#step:4:322

    aggregator.log - where Starting the Aggregator Service. appears thrice indicating 3 start/restarts, but no error/exception etc.

Expected behavior
Irrespective of number/stage of restart for any participant, it should be able to come up and join the federation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    aggregatorresiliencytask_runner_e2eCovers all the changes done as part of Task Runner E2E testing be it native or dockerized env.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions