Skip to content

Increase default FlowGCTimeout to 1h to prevent premature GC#2143

Merged
kfswain merged 1 commit intokubernetes-sigs:mainfrom
LukeAVanDrie:fix/1982
Jan 14, 2026
Merged

Increase default FlowGCTimeout to 1h to prevent premature GC#2143
kfswain merged 1 commit intokubernetes-sigs:mainfrom
LukeAVanDrie:fix/1982

Conversation

@LukeAVanDrie
Copy link
Copy Markdown
Contributor

@LukeAVanDrie LukeAVanDrie commented Jan 13, 2026

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR increases the default FlowGCTimeout to 1 hour.

Context

Currently, the Flow Registry's garbage collector relies on a leaseCount that tracks the distribution phase but not the queueing phase. If a request sits in the queue waiting for a backend (e.g., waiting for a Pod to spin up) longer than the configured GC timeout (and no new traffic for the flow arrives during this time), the Registry mistakenly identifies the flow as "Idle" and deletes the queue resources, causing the request to be orphaned. This is difficult to trigger under normal load, but it is relevant for Scale from Zero.

The Fix

By increasing the default timeout to 1h, we ensure that the GC timeout is strictly larger than any realistic queueing duration (which will likely hit client timeouts or other limits first). This makes the race condition unreachable in practice without requiring complex architectural changes in the release candidate.

A full architectural fix (switching to optimistic concurrency and lifecycle-aware leasing) is targeted for the next release cycle.

Which issue(s) this PR fixes:
Hack for #1982

Does this PR introduce a user-facing change?:

Increased the default Flow Control garbage collection timeout to 1 hour. This prevents the accidental deletion of active flows during long queueing periods, improving stability during scale-from-zero.

This commit increases the default Flow Garbage Collection timeout from
from 5 minutes to 1 hour.

This serves as a mitigation for a race condition where requests pending
in the queue for longer than the GC timeout (e.g., during
scale-from-zero) could result in the underlying flow state being deleted
while the request was still active.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 13, 2026

👷 Deploy Preview for gateway-api-inference-extension processing.

Name Link
🔨 Latest commit 2c03ee7
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6966ae6c875eb200084300d2

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 13, 2026
@LukeAVanDrie LukeAVanDrie changed the title epp: increase default FlowGCTimeout to 1h to prevent premature GC increase default FlowGCTimeout to 1h to prevent premature GC Jan 13, 2026
@LukeAVanDrie LukeAVanDrie changed the title increase default FlowGCTimeout to 1h to prevent premature GC Increase default FlowGCTimeout to 1h to prevent premature GC Jan 13, 2026
@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

/cc @aishukamal and @kfswain

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@LukeAVanDrie: GitHub didn't allow me to request PR reviews from the following users: aishukamal, and.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @aishukamal and @kfswain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

Actual fix for this is #2127 and #2131. These just seem risky to cherry-pick.

@aishukamal
Copy link
Copy Markdown
Contributor

/lgtm

(I don't think I'm allowed to lgtm PRs yet, but the change looks good to me)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@aishukamal: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

(I don't think I'm allowed to lgtm PRs yet, but the change looks good to me)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 14, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2026
@ahg-g
Copy link
Copy Markdown
Contributor

ahg-g commented Jan 14, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2026
@LukeAVanDrie
Copy link
Copy Markdown
Contributor Author

CI/CD seems to be hanging -- /retest

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 14, 2026

/restest

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 14, 2026

/cherrypick release-1.3

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@kfswain: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 14, 2026

I'm going to force merge this since this netlify job is hanging and this PR is not related to our docs:
image

@kfswain kfswain merged commit 6f54218 into kubernetes-sigs:main Jan 14, 2026
12 of 17 checks passed
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@kfswain: new pull request created: #2154

Details

In response to this:

/cherrypick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

RyanRosario pushed a commit to RyanRosario/gateway-api-inference-extension that referenced this pull request Jan 20, 2026
This commit increases the default Flow Garbage Collection timeout from
from 5 minutes to 1 hour.

This serves as a mitigation for a race condition where requests pending
in the queue for longer than the GC timeout (e.g., during
scale-from-zero) could result in the underlying flow state being deleted
while the request was still active.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants