Skip to content

Conversation

@jmcguire98
Copy link
Contributor

@jmcguire98 jmcguire98 commented Oct 29, 2025

Description

We want to be able to more easily surface to end users cases where configuration they have provided has been translated into config that was rejected (NACKed) by the gateway they were sending it to. In the past this most frequently happened when a user created a policy with illegal CEL which was nacked at the gateway (subsequently we added CEL validation before translating).

To achieve this I have added a NACK publisher that publishes k8s events when an instance of a NACK occurs.

Change Type

/kind new_feature

Changelog

Added event reporting for agentgateway gateways that indicates when a gateway has nacked an update

Additional Notes

fixes #12671

Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
…an limit the perf damage we're doing here

Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>

Signed-off-by: Joe McGuire <[email protected]>
@github-actions github-actions bot added kind/feature Categorizes issue or PR as related to a new feature. release-note labels Oct 29, 2025
@npolshakova npolshakova self-requested a review October 29, 2025 19:01
Signed-off-by: Joe McGuire <[email protected]>
Signed-off-by: Joe McGuire <[email protected]>
@jmcguire98 jmcguire98 force-pushed the report-agw-nacks-on-gateways branch from d1de508 to d93746e Compare October 29, 2025 21:04
Signed-off-by: Joe McGuire <[email protected]>
@jmcguire98 jmcguire98 marked this pull request as ready for review October 31, 2025 16:15
@jmcguire98 jmcguire98 changed the title agentgateway: report nacks via gateway status agentgateway: report nacks via events Nov 6, 2025
@jmcguire98 jmcguire98 requested a review from howardjohn November 6, 2025 23:00
corev1.EventSource{Component: wellknown.DefaultAgwControllerName},
)
eventBroadcaster.StartRecordingToSink(&typedcorev1.EventSinkImpl{
Interface: client.Kube().CoreV1().Events(""),
Copy link
Contributor

@npolshakova npolshakova Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the events to be cluster-wide? (vs. only being in the gw ns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're initializing the eventbroadcaster we have to give it perms to publish events cluster wide, but when we're actually publishing the events we have to publish them in the same ns as their involvedObject (k8s apiserver requirement)

UID: deployUID,
}

p.eventRecorder.Eventf(gatewayRef, corev1.EventTypeWarning, ReasonNack, event.ErrorMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the ErrorMsg have the type url of the resource in it?

Copy link
Contributor Author

@jmcguire98 jmcguire98 Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ErrorMsg is pretty verbose:
Screenshot 2025-11-06 at 2 45 38 PM and assuming agentgateway is telling us which resource name it nacked it should be easy to track down.

@jmcguire98
Copy link
Contributor Author

/merge

@gateway-bot gateway-bot added this pull request to the merge queue Nov 7, 2025
Merged via the queue into kgateway-dev:main with commit a1c9588 Nov 7, 2025
28 checks passed

if nackHandler != nil {
gateway := kgwxds.AgentgatewayID(con.node)
// Collect resource names from the request only (subscribe/unsubscribe/initial versions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on error, agentgateway will not send any sub/unsub/initial version in the request. so this should always be empty afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's true.

i think you may have an old version loaded. I ended up getting rid of all this logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

agentgateway: report statuses on gateways if the dataplane NACKs an update

4 participants