Inference: Fix EPP Endpoint Sync and Race Condition #12810

danehans · 2025-11-04T01:26:41Z

Description

Fixes an issue where the inference extension did not program the managed Envoy proxy until controller restart due to mismatched endpoint sources between DP and IR translation paths.

Changes:

Updated processPoolBackendObjIR to use endpoints from IR instead of the pod index.
Removed pod index threading from plugin and collection initialization.
Relaxed route pass to not fail on empty endpoint sets (aligns with fail-open/503 behavior).
Simplified unit tests to seed IR endpoints directly, removing mock KRT index setup.
Passes errors to DP to ensure proper cluster/routing management.

Fixes: #12265

Change Type

/kind fix

Changelog

Fixes endpoint synchronization for inference extension plugin.

Additional Notes

This is not a backport from main since the Envoy-based data plane is being removed in v2.2.

- Stores endpoints via atomic.Value and adds setEndpoints/getEndpoints to snapshot safely without locks. - Updates Equals to compare endpoint snapshots without locks, fixing race condition in krt.Equal/DeepEqual. - Switches error handling to hasErrors/snapshotErrors/setErrors. The backend path now returns empty ClusterLoadAssinment when errors exist. - Updates tests to seed errors via setErrors and avoid direct field access. - Keeps DP collection returning Backend IR on empty endpoints and relaxes route pass to allow empty endpoint sets. - Passes errors to DP to ensure cluster management. Signed-off-by: Daneyon Hansen <[email protected]>

lgadban

LGTM

lgadban · 2025-11-14T16:46:03Z

internal/kgateway/extensions2/plugins/inferenceextension/endpointpicker/backends.go

 ) *ir.EndpointsForBackend {
-	// Build an endpoint list
 	irPool := in.ObjIr.(*inferencePool)
-	poolEps := irPool.resolvePoolEndpoints(podIdx)


we are no longer calling resolvePoolEndpoints in this function, so we are just relying on the endpoints to be computed elsewhere in the plugin?

Correct, the eps resolution now happens only once during ApplyForBackend(). If the pool has no endpoints during ApplyForBackend(), translation does not fail. Instead, it keeps the route valid and provides an empty subset hint so the EPP returns a 5xx.

github-actions bot added kind/fix Categorizes issue or PR as related to a bug. release-note labels Nov 4, 2025

danehans force-pushed the issue_12265 branch 3 times, most recently from 48cc1ab to fc451d6 Compare November 4, 2025 19:35

danehans changed the title ~~Inference: Fix EPP Endpoint Sync~~ Inference: Fix EPP Endpoint Sync and Race Condition Nov 4, 2025

danehans mentioned this pull request Nov 4, 2025

agentgateway: test llm-d support #12451

Open

danehans force-pushed the issue_12265 branch from fc451d6 to 591a07f Compare November 6, 2025 20:05

danehans force-pushed the issue_12265 branch from 591a07f to a1c089c Compare November 6, 2025 20:15

lgadban approved these changes Nov 14, 2025

View reviewed changes

danehans added this pull request to the merge queue Nov 17, 2025

Merged via the queue into kgateway-dev:v2.1.x with commit 531027d Nov 17, 2025
41 of 44 checks passed

danehans deleted the issue_12265 branch November 17, 2025 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference: Fix EPP Endpoint Sync and Race Condition #12810

Inference: Fix EPP Endpoint Sync and Race Condition #12810

Uh oh!

danehans commented Nov 4, 2025 •

edited

Loading

Uh oh!

lgadban left a comment

Uh oh!

lgadban Nov 14, 2025

Uh oh!

danehans Nov 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Inference: Fix EPP Endpoint Sync and Race Condition #12810

Inference: Fix EPP Endpoint Sync and Race Condition #12810

Uh oh!

Conversation

danehans commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Change Type

Changelog

Additional Notes

Uh oh!

lgadban left a comment

Choose a reason for hiding this comment

Uh oh!

lgadban Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

danehans Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danehans commented Nov 4, 2025 •

edited

Loading

danehans Nov 17, 2025 •

edited

Loading