registry: switch to fine-grained leasing for flow lifecycle#2127
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cc @evacchi |
|
@LukeAVanDrie: GitHub didn't allow me to request PR reviews from the following users: evacchi. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This commit refactors the FlowController's request lifecycle management to hold the Flow Registry lease (`WithConnection`) for the entire duration of the request, including the queueing phase. Previously, the lease was only held during the instantaneous distribution phase. If a flow had requests waiting in a queue (e.g., during scale-from-zero) but no new incoming traffic, the registry would incorrectly identify the flow as Idle and garbage collect it, orphaning the queued requests. Changes: - Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop and `awaitFinalization`. - Updated `ActiveFlowConnection` interface to expose `FlowKey()`, preventing data clumps in internal signatures. - Refactored `selectDistributionCandidates` to use the active connection instead of re-acquiring it. - Added a regression test (`Regression_LeaseHeldDuringQueueing`) ensuring the lease remains valid while the processor blocks. This change relies on the optimistic concurrency model introduced in PR kubernetes-sigs#2127 to ensure that holding long-lived leases does not block the Garbage Collector or cause writer starvation.
|
/ok-to-test |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold Want to address @evacchi 's feedback and double check one more condition. |
Replaces the `flowState.gcLock` mutex with an atomic reference counting
mechanism ("leasing") and optimistic concurrency loop.
This change decouples flow lifecycle management from request processing,
solving two critical issues:
1. Deadlock Potential: Removes the lock hierarchy between Registry,
Shards, and Flows, eliminating the risk of deadlocks during complex
GC or scaling operations.
2. Prereq for Request Starvation (Issue kubernetes-sigs#1982): Allows long-running
queued requests to maintain a "lease" on a flow without holding a
blocking mutex, preventing the GC from orphaning active queues.
The `WithConnection` hot path is now lock-free, using a `sync.Map` and
atomic CAS loop to pin flow state objects safely.
Updates:
- `pkg/epp/flowcontrol/registry`: Switch to optimistic model.
- `pkg/epp/flowcontrol/registry`: Update tests to verify atomic behavior.
Switches flowState internal management from independent atomics to a `sync.Mutex`. This ensures state transitions (lease updates and timestamp clearing) are atomic, simplifying the logic and preventing inconsistent states. Introduces a `markedForDeletion` flag to fix the stalled GC race, where a request could acquire a flow that the GC had already decided to delete. Hardens cleanupFlowResources by using a Write Lock and re-checking the map. This prevents the physical deletion of queues for flows that were resurrected (JIT provisioned) during the cleanup phase.
3fa82e3 to
1814907
Compare
Adds a targeted test case `ShouldBackOff_WhenFlowIsMarkedForDeletion_ButStillInMap` to verify the robustness of the flow acquisition retry loop. This ensures that if a request encounters a flow object that is logically deleted (marked) but physically present in the map, it correctly backs off and waits for a fresh object rather than resurrecting the dead one.
aishukamal
left a comment
There was a problem hiding this comment.
Thanks for fixing the race! The changes look good to me but just want to point out that with this fix, we are going back to pessimistic concurrency control (use of locks), so I suggest updating the PR title to reflect that.
|
/remove-hold |
|
/lgtm |
This commit refactors the FlowController's request lifecycle management to hold the Flow Registry lease (`WithConnection`) for the entire duration of the request, including the queueing phase. Previously, the lease was only held during the instantaneous distribution phase. If a flow had requests waiting in a queue (e.g., during scale-from-zero) but no new incoming traffic, the registry would incorrectly identify the flow as Idle and garbage collect it, orphaning the queued requests. Changes: - Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop and `awaitFinalization`. - Updated `ActiveFlowConnection` interface to expose `FlowKey()`, preventing data clumps in internal signatures. - Refactored `selectDistributionCandidates` to use the active connection instead of re-acquiring it. - Added a regression test (`Regression_LeaseHeldDuringQueueing`) ensuring the lease remains valid while the processor blocks. This change relies on the optimistic concurrency model introduced in PR kubernetes-sigs#2127 to ensure that holding long-lived leases does not block the Garbage Collector or cause writer starvation.
This commit refactors the FlowController's request lifecycle management to hold the Flow Registry lease (`WithConnection`) for the entire duration of the request, including the queueing phase. Previously, the lease was only held during the instantaneous distribution phase. If a flow had requests waiting in a queue (e.g., during scale-from-zero) but no new incoming traffic, the registry would incorrectly identify the flow as Idle and garbage collect it, orphaning the queued requests. Changes: - Hoisted `WithConnection` in `EnqueueAndWait` to wrap the retry loop and `awaitFinalization`. - Updated `ActiveFlowConnection` interface to expose `FlowKey()`, preventing data clumps in internal signatures. - Refactored `selectDistributionCandidates` to use the active connection instead of re-acquiring it. - Added a regression test (`Regression_LeaseHeldDuringQueueing`) ensuring the lease remains valid while the processor blocks. This change relies on the optimistic concurrency model introduced in PR #2127 to ensure that holding long-lived leases does not block the Garbage Collector or cause writer starvation.
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it:
This PR refactors the
FlowRegistryconcurrency model, shifting from a rigid, sharded locking hierarchy to a decoupled, fine-grained leasing model for flow lifecycle management.Context
The previous architecture relied on a brittle lock hierarchy (
Registry->Shard->Flow). This coupling made it difficult to safely implement Garbage Collection for parent resources (like Priority Bands) without risking deadlocks. Additionally, the previous GC logic relied on a mutex held during the entire distribution phase. This made it impossible to hold a reference during the (potentially long) queueing phase without blocking the GC completely.Changes
sync.Map) paired with fine-grained per-flow locking (sync.Mutex). This isolates contention to individual flows, allowing the hot path (WithConnection) to proceed without blocking on shard-level or global locks.markedForDeletion) and optimistic retry loops. This allows the system to safely detect and recover from race conditions (such as a flow being deleted by GC while a request is attempting to acquire it) without requiring "stop-the-world" pauses.Note: This is a surgical fix intended to stabilize the flow control layer and unblock future work (like Band GC). While we are currently evaluating the performance profile of the sharded architecture, this change improves maintainability and correctness in the immediate term without requiring a full structural rewrite.
Which issue(s) this PR fixes:
Does this PR introduce a user-facing change?: