Skip to content

Add tracing entry span with W3C propagation to EPP handler#2057

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
sallyom:tracing-spans
Jan 16, 2026
Merged

Add tracing entry span with W3C propagation to EPP handler#2057
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
sallyom:tracing-spans

Conversation

@sallyom
Copy link
Copy Markdown
Contributor

@sallyom sallyom commented Jan 5, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Add tracing entry span with W3C propagation to EPP handler
See #1520

Does this PR introduce a user-facing change?:

EPP request handler now includes distributed tracing entry span. When enabled via the existing --tracing flag, trace spans are created and W3C trace context is propagated to downstream services, enabling end-to-end request tracing. Tracing remains opt-in with no breaking changes introduced.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 5, 2026

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit deba8b2
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695d4f70b8751f00089314f9
😎 Deploy Preview https://deploy-preview-2057--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 5, 2026
@sallyom sallyom marked this pull request as draft January 5, 2026 19:18
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026
@sallyom sallyom force-pushed the tracing-spans branch 2 times, most recently from 3843677 to ee6df62 Compare January 5, 2026 19:37
@sallyom sallyom marked this pull request as ready for review January 6, 2026 17:44
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott January 6, 2026 17:44
Signed-off-by: sallyom <somalley@redhat.com>
@sallyom
Copy link
Copy Markdown
Contributor Author

sallyom commented Jan 7, 2026

For example, in llm-d, with the GAIE entry span & propagation, a trace looks like:
gaie-entry-span-overview
and drilldown see the GAIE plugins & vLLM end-to-end trace (with other llm-d components instrumented):
gaie-entry-trace

Without this PR, without the entry span & propagation but with tracing enabled in GAIE, spans in individual components aren't connected:
gaie-no-entry-span

Comment on lines +137 to +149
// Inject trace context headers for propagation to downstream services
traceHeaders := make(map[string]string)
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.MapCarrier(traceHeaders))
for key, value := range traceHeaders {
headers = append(headers, &configPb.HeaderValueOption{
Header: &configPb.HeaderValue{
Key: key,
RawValue: []byte(value),
},
})
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to manually propagate context like this, as long as the go context.Context is correctly passed around then the otel sdk will handle propagation for you

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @damemi! I wasn't sure about this, I will remove this and retest to be sure. TY again!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the manual propagation, then will verify with llm-d:

  1. Does vllm:llm_request span show up as a child of gateway.request?
  2. Does the trace ID remain consistent end-to-end?
  3. If there's an upstream traceparent, is it continued correctly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry point of request handling is: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/handlers/server.go#L128C49-L128C80

Where the context in Go is wrapped in the srv extProcPb.ExternalProcessor_ProcessServer. Does OTel need the context to be explicitly defined in function interface?

ref - https://pkg.go.dev/google.golang.org/grpc#ServerStream

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing with the context propagation - it seems with GAIE's architecture we need to manually propagate the trace headers. With GAIE's architecture as an Envoy External Processor it doesn't make HTTP requests directly. Without manual propagation, trace context doesn't reach downstream services. I have confirmed this with some testing. Without the manual trace propagation we see separate spans for gateway-api-inference-extension and vllm services, not the vllm child span with the propagated context headers. I'll leave the manual propagation in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sallyom ah that's interesting, I didn't think about how this was working with envoy so there could be some work you need to do there. Not something I've worked with before but testing tells the truth

Comment on lines +131 to +136

// Start tracing span for the request
tracer := otel.Tracer("gateway-api-inference-extension")
ctx, span := tracer.Start(ctx, "gateway.request", trace.WithSpanKind(trace.SpanKindServer))
defer span.End()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be done if the user requested tracing. I think we need to add either a command line argument to enable tracing or to add something in the EPP Configuration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls are a zero-overhead no-op unless a TracerProvider is configured. So, all you should need to gate on the user enabling is the creation of the TracerProvider itself.

For reference, this is the same way that Kubernetes components implement tracing. They actually set up a no-op tracerprovider, but having no TracerProvider configured should be effectively the same.

Either way, it's not about feature gating the tracer.Start() calls, it's about the tracerprovider

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @damemi! I'll leave as/is but still open to other opinions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the trace initialization is only invoked if the tracing is enabled:

If InitTracing is not invoked, a default noop provider will be used (Correct me if I was wrong here). So it should be fine to keep it the way the PR implements.

@JeffLuoo
Copy link
Copy Markdown
Contributor

lgtm, can any of approver help review it as well? Thanks!

cc: @nirrozenbaum @kfswain

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 13, 2026

/approve

Excited to have E2E tracing, thanks all! Will leave to reviewers for final stamp.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2026
Copy link
Copy Markdown
Member

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about the envoy handling to say for sure, but it could be worth a todo to look into the manual context propagation. Otherwise lgtm!

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, JeffLuoo, kfswain, sallyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kfswain
Copy link
Copy Markdown
Collaborator

kfswain commented Jan 16, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2026
@k8s-ci-robot k8s-ci-robot merged commit 812efb2 into kubernetes-sigs:main Jan 16, 2026
12 checks passed
RyanRosario pushed a commit to RyanRosario/gateway-api-inference-extension that referenced this pull request Jan 20, 2026
sallyom added a commit to sallyom/gateway-api-inference-extension that referenced this pull request Jan 25, 2026
@Gregory-Pereira
Copy link
Copy Markdown
Member

/milestone v1.4

Not sure I have permissions though

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@Gregory-Pereira: You must be a member of the kubernetes-sigs/gateway-api-inference-extension-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Inference Gateway Milestone Maintainers and have them propose you as an additional delegate for this responsibility.

Details

In response to this:

/milestone v1.4

Not sure I have permissions though

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kfswain kfswain added this to the v1.4 milestone Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants