Skip to content

Add failing tests proving premature LeaseAcquired bug (#3402)#3406

Merged
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/lease-actor-premature-acquire
Mar 9, 2026
Merged

Add failing tests proving premature LeaseAcquired bug (#3402)#3406
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/lease-actor-premature-acquire

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb commented Mar 9, 2026

Summary

Adds deterministic reproduction tests for the premature LeaseAcquired split-brain bug in both Azure and Kubernetes LeaseActor implementations.

These tests intentionally FAIL to prove the bug exists. The fix will come in a subsequent commit.

The Bug

When a CAS conflict occurs during the Granting state and the blob/configmap has no owner (version moved on), LeaseActor sends LeaseAcquired to the caller before the retry write completes (K8s line 365, Azure line 362). If the retry then fails (another node takes the lease), the caller already believes it holds the lease — but it doesn't. localGranted is never set to true, heartbeat never starts.

Test Results

Test K8s Azure Expected
ShouldNotSendPrematureLeaseAcquiredWhenConflictRetryIsStolen FAIL FAIL Will pass after fix
ShouldGrantLeaseOnlyAfterConflictRetrySucceeds PASS PASS Happy-path verification

Failure Output

Expected a message of type LeaseTaken, but received LeaseAcquired

This directly proves the premature LeaseAcquired is sent before the retry write, confirming the split-brain scenario described in #3402.

All existing tests pass

  • K8s: 47 passed, 1 skipped, 0 failed
  • Azure: 45 passed, 1 skipped, 0 failed

Fixes #3402

Closes #3403

Tests reproduce the split-brain scenario: when a CAS conflict occurs during
granting and the blob/configmap has no owner, LeaseActor sends LeaseAcquired
BEFORE the retry write completes. If the retry fails (another node takes the
lease), the caller believes it holds the lease but it doesn't.

ShouldNotSendPrematureLeaseAcquiredWhenConflictRetryIsStolen:
  - FAILS: receives LeaseAcquired instead of expected LeaseTaken
  - Proves the bug exists in both Azure and Kubernetes implementations

ShouldGrantLeaseOnlyAfterConflictRetrySucceeds:
  - PASSES: verifies happy-path still works correctly
@Aaronontheweb Aaronontheweb force-pushed the fix/lease-actor-premature-acquire branch from 43d0c9e to 5c73d3f Compare March 9, 2026 18:10
@Aaronontheweb
Copy link
Copy Markdown
Member Author

Integrated Gregorius' original fix commits into this branch and extended them to Kubernetes for parity.\n\nIncluded with attribution via cherry-pick:\n- 611112c (Fix premature LeaseAcquired in LeaseActor Granting state)\n- a2ed0e2 (Harden test)\n\nAdditional commit in this branch:\n- Kubernetes equivalent fix for premature LeaseAcquired in the null-owner conflict retry path.\n\nThis keeps the repro-first record intact while preserving prior work and author credit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Premature LeaseAcquired in LeaseActor Granting state causes shard-level split brain

2 participants