Increase default FlowGCTimeout to 1h to prevent premature GC#2143
Increase default FlowGCTimeout to 1h to prevent premature GC#2143kfswain merged 1 commit intokubernetes-sigs:mainfrom
Conversation
This commit increases the default Flow Garbage Collection timeout from from 5 minutes to 1 hour. This serves as a mitigation for a race condition where requests pending in the queue for longer than the GC timeout (e.g., during scale-from-zero) could result in the underlying flow state being deleted while the request was still active.
👷 Deploy Preview for gateway-api-inference-extension processing.
|
|
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cc @aishukamal and @kfswain |
|
@LukeAVanDrie: GitHub didn't allow me to request PR reviews from the following users: aishukamal, and. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/lgtm (I don't think I'm allowed to lgtm PRs yet, but the change looks good to me) |
|
@aishukamal: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
CI/CD seems to be hanging -- /retest |
|
/restest |
|
/cherrypick release-1.3 |
|
@kfswain: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@kfswain: new pull request created: #2154 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This commit increases the default Flow Garbage Collection timeout from from 5 minutes to 1 hour. This serves as a mitigation for a race condition where requests pending in the queue for longer than the GC timeout (e.g., during scale-from-zero) could result in the underlying flow state being deleted while the request was still active.

What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR increases the default
FlowGCTimeoutto 1 hour.Context
Currently, the Flow Registry's garbage collector relies on a
leaseCountthat tracks the distribution phase but not the queueing phase. If a request sits in the queue waiting for a backend (e.g., waiting for a Pod to spin up) longer than the configured GC timeout (and no new traffic for the flow arrives during this time), the Registry mistakenly identifies the flow as "Idle" and deletes the queue resources, causing the request to be orphaned. This is difficult to trigger under normal load, but it is relevant for Scale from Zero.The Fix
By increasing the default timeout to
1h, we ensure that the GC timeout is strictly larger than any realistic queueing duration (which will likely hit client timeouts or other limits first). This makes the race condition unreachable in practice without requiring complex architectural changes in the release candidate.A full architectural fix (switching to optimistic concurrency and lifecycle-aware leasing) is targeted for the next release cycle.
Which issue(s) this PR fixes:
Hack for #1982
Does this PR introduce a user-facing change?: