Skip to content

Add failing tests proving heartbeat surrender bug (#3404)#3407

Merged
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/lease-actor-heartbeat-repro-3404
Mar 9, 2026
Merged

Add failing tests proving heartbeat surrender bug (#3404)#3407
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/lease-actor-heartbeat-repro-3404

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb commented Mar 9, 2026

Summary

What these tests assert

  • After successful acquisition and heartbeat start, a single transient heartbeat failure should not:
    • set Granted to false,
    • invoke leaseLostCallback, or
    • stop heartbeat retry attempts.
  • Instead, actor should remain granted and schedule another heartbeat write attempt.

Current behavior (bug)

Both implementations immediately surrender lease on Status.Failure while in Granted:

  • Azure: LeaseActor.cs in Granted handler (Status.Failure branch)
  • Kubernetes: LeaseActor.cs in Granted handler (Status.Failure branch)

Observed failure in both test suites:

  • Expected retry update message within heartbeat interval window
  • Actual: no retry; actor transitions to idle and marks lease lost

Why this PR

Previous fix attempts went straight to implementation. This PR is repro-only to build a deterministic, reviewable proof that the bug is real in both providers before proposing fixes.

Fixes #3404

Closes #3405

@Aaronontheweb
Copy link
Copy Markdown
Member Author

Integrated Gregorius' original Azure heartbeat-retry fix into this branch and extended it to Kubernetes for parity.\n\nIncluded with attribution via cherry-pick:\n- 0cf2b4e (Retry heartbeat on transient failure instead of immediately surrendering lease)\n\nAdditional commit in this branch:\n- Kubernetes heartbeat retry implementation + matching test updates.\n\nThis preserves the repro-first evidence trail while carrying forward prior work with proper credit.

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) March 9, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient heartbeat failure immediately surrenders lease, causing unnecessary ~96s shard outage

2 participants