feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

ntny · 2025-06-24T13:12:36Z

Description of your changes:

Add standard gRPC RPS and latency metrics to each API server endpoint. There are a lot of manually added metrics in run_server.go.
This PR replaces most of them with standardized gRPC metrics.
Additionally, many services in the api-server previously had no metrics at all — this PR adds basic observability for them as well.
I kept the old metrics to maintain backward compatibility with existing dashboards.
Add metrics to measure the delay between Argo Workflow creation for recurring runs and its reporting to the API server by the persistence agent
Optimize patchExistingTasks MySQL query by using the runId field to leverage the existing index (since podName is not covered by the index)

ReportWorkflowV1 Performance After Optimization (1.5M Tasks in MySQL):

Metrics will be available after the PR is merged.

This PR was motivated by an issue that emerged as the number of tasks in our system increased.
Recurring runs were created successfully, but their status was not updated for an extended period of time.
The same issue also affected regular (one-off) runs, which also experienced delays in status reporting.

google-oss-prow · 2025-06-24T13:12:47Z

Hi @ntny. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2025-06-24T13:12:59Z

🚫 This command cannot be processed. Only organization members or owners can use the commands.

zazulam · 2025-06-25T20:56:44Z

/ok-to-test

zazulam

lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.

This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.

Is it possible to leverage the modules/structs from the middleware package instead?

ntny · 2025-06-26T11:07:10Z

lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.

This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.

Is it possible to leverage the modules/structs from the middleware package instead?

I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.
Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.

zazulam · 2025-06-26T14:21:12Z

I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.

Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.

Makes sense, I'm okay with using the go-grpc-prometheus package if we can get an issue for tracking the upgrade of it / the middleware stack.

Any thoughts from your end @HumairAK?

mprahl · 2025-07-11T20:40:42Z

@ntny I made a PR to update go-grpc-middleware, so once that's merged, you can rebase your PR to use v2:
#12043

I prefer this approach since it seems straight forward to update it rather than add another unsupported library to KFP.

- add report gap histogram - optimize create or update tasks query Signed-off-by: ntny <[email protected]> Signed-off-by: arpechenin <[email protected]>

Signed-off-by: arpechenin <[email protected]>

google-oss-prow · 2025-07-12T21:11:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ntny · 2025-07-12T22:28:59Z

/hold

need to configure metric registration with the new go-grpc-middleware/providers/prometheus api

ntny · 2025-07-12T22:56:36Z

/unhold

Signed-off-by: arpechenin <[email protected]>

zazulam

Great work @ntny!
/lgtm

juliusvonkohout · 2025-07-14T17:39:38Z

@HumairAK for approval

google-oss-prow bot requested review from HumairAK and zazulam June 24, 2025 13:12

google-oss-prow bot added needs-ok-to-test size/M labels Jun 24, 2025

ntny changed the title ~~- add standard grpc metrics to api-server~~ feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting Jun 24, 2025

ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 7366105 to 0f0b567 Compare June 24, 2025 13:18

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Jun 25, 2025

zazulam reviewed Jun 26, 2025

View reviewed changes

HumairAK added this to KFP Project Tracker Jun 26, 2025

HumairAK moved this to In Review in KFP Project Tracker Jun 26, 2025

HumairAK assigned HumairAK, mprahl and zazulam Jun 26, 2025

arpechenin added 2 commits July 12, 2025 18:53

- add standard grpc metrics to api-server

92d2eb6

- add report gap histogram - optimize create or update tasks query Signed-off-by: ntny <[email protected]> Signed-off-by: arpechenin <[email protected]>

- move to the latest version of grpc-prometheus

ff66346

Signed-off-by: arpechenin <[email protected]>

ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 0f0b567 to ff66346 Compare July 12, 2025 21:11

google-oss-prow bot added the do-not-merge/hold label Jul 12, 2025

ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 90eb530 to 7d609e9 Compare July 12, 2025 22:53

google-oss-prow bot removed the do-not-merge/hold label Jul 12, 2025

- register metrics

bd7693d

Signed-off-by: arpechenin <[email protected]>

ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 7d609e9 to bd7693d Compare July 12, 2025 23:12

zazulam approved these changes Jul 13, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Jul 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

Uh oh!

ntny commented Jun 24, 2025 •

edited

Loading

Uh oh!

google-oss-prow bot commented Jun 24, 2025

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

zazulam commented Jun 25, 2025

Uh oh!

zazulam left a comment •

edited

Loading

Uh oh!

ntny commented Jun 26, 2025 •

edited

Loading

Uh oh!

zazulam commented Jun 26, 2025

Uh oh!

mprahl commented Jul 11, 2025

Uh oh!

google-oss-prow bot commented Jul 12, 2025

Uh oh!

ntny commented Jul 12, 2025

Uh oh!

ntny commented Jul 12, 2025

Uh oh!

zazulam left a comment

Uh oh!

juliusvonkohout commented Jul 14, 2025

Uh oh!

Uh oh!

feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

Are you sure you want to change the base?

feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

Uh oh!

Conversation

ntny commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Jun 24, 2025

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

zazulam commented Jun 25, 2025

Uh oh!

zazulam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntny commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zazulam commented Jun 26, 2025

Uh oh!

mprahl commented Jul 11, 2025

Uh oh!

google-oss-prow bot commented Jul 12, 2025

Uh oh!

ntny commented Jul 12, 2025

Uh oh!

ntny commented Jul 12, 2025

Uh oh!

zazulam left a comment

Choose a reason for hiding this comment

Uh oh!

juliusvonkohout commented Jul 14, 2025

Uh oh!

Uh oh!

ntny commented Jun 24, 2025 •

edited

Loading

zazulam left a comment •

edited

Loading

ntny commented Jun 26, 2025 •

edited

Loading