Skip to content

Conversation

@danehans
Copy link
Contributor

@danehans danehans commented Nov 4, 2025

Description

Fixes an issue where the inference extension did not program the managed Envoy proxy until controller restart due to mismatched endpoint sources between DP and IR translation paths.

Changes:

  • Updated processPoolBackendObjIR to use endpoints from IR instead of the pod index.
  • Removed pod index threading from plugin and collection initialization.
  • Relaxed route pass to not fail on empty endpoint sets (aligns with fail-open/503 behavior).
  • Simplified unit tests to seed IR endpoints directly, removing mock KRT index setup.
  • Passes errors to DP to ensure proper cluster/routing management.

Fixes: #12265

Change Type

/kind fix

Changelog

Fixes endpoint synchronization for inference extension plugin.

Additional Notes

This is not a backport from main since the Envoy-based data plane is being removed in v2.2.

@github-actions github-actions bot added kind/fix Categorizes issue or PR as related to a bug. release-note labels Nov 4, 2025
@danehans danehans force-pushed the issue_12265 branch 3 times, most recently from 48cc1ab to fc451d6 Compare November 4, 2025 19:35
@danehans danehans changed the title Inference: Fix EPP Endpoint Sync Inference: Fix EPP Endpoint Sync and Race Condition Nov 4, 2025
- Stores endpoints via atomic.Value and adds setEndpoints/getEndpoints to snapshot
  safely without locks.
- Updates Equals to compare endpoint snapshots without locks, fixing race condition
  in krt.Equal/DeepEqual.
- Switches error handling to hasErrors/snapshotErrors/setErrors. The backend path now
  returns empty ClusterLoadAssinment when errors exist.
- Updates tests to seed errors via setErrors and avoid direct field access.
- Keeps DP collection returning Backend IR on empty endpoints and relaxes route pass
  to allow empty endpoint sets.
- Passes errors to DP to ensure cluster management.

Signed-off-by: Daneyon Hansen <[email protected]>
Copy link
Contributor

@lgadban lgadban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

) *ir.EndpointsForBackend {
// Build an endpoint list
irPool := in.ObjIr.(*inferencePool)
poolEps := irPool.resolvePoolEndpoints(podIdx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are no longer calling resolvePoolEndpoints in this function, so we are just relying on the endpoints to be computed elsewhere in the plugin?

Copy link
Contributor Author

@danehans danehans Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the eps resolution now happens only once during ApplyForBackend(). If the pool has no endpoints during ApplyForBackend(), translation does not fail. Instead, it keeps the route valid and provides an empty subset hint so the EPP returns a 5xx.

@danehans danehans added this pull request to the merge queue Nov 17, 2025
Merged via the queue into kgateway-dev:v2.1.x with commit 531027d Nov 17, 2025
41 of 44 checks passed
@danehans danehans deleted the issue_12265 branch November 17, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/fix Categorizes issue or PR as related to a bug. release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants