Skip to content

feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ntny
Copy link
Contributor

@ntny ntny commented Jun 24, 2025

Description of your changes:

  • Add standard gRPC RPS and latency metrics to each API server endpoint. There are a lot of manually added metrics in run_server.go.
    This PR replaces most of them with standardized gRPC metrics.
    Additionally, many services in the api-server previously had no metrics at all — this PR adds basic observability for them as well.
    I kept the old metrics to maintain backward compatibility with existing dashboards.

  • Add metrics to measure the delay between Argo Workflow creation for recurring runs and its reporting to the API server by the persistence agent

  • Optimize patchExistingTasks MySQL query by using the runId field to leverage the existing index (since podName is not covered by the index)

ReportWorkflowV1 Performance After Optimization (1.5M Tasks in MySQL):
Снимок экрана 2025-06-24 в 16 10 41

Metrics will be available after the PR is merged.

This PR was motivated by an issue that emerged as the number of tasks in our system increased.
Recurring runs were created successfully, but their status was not updated for an extended period of time.
The same issue also affected regular (one-off) runs, which also experienced delays in status reporting.

Copy link

Hi @ntny. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@ntny ntny changed the title - add standard grpc metrics to api-server feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting Jun 24, 2025
@ntny ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 7366105 to 0f0b567 Compare June 24, 2025 13:18
@zazulam
Copy link
Collaborator

zazulam commented Jun 25, 2025

/ok-to-test

Copy link
Collaborator

@zazulam zazulam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.

This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.

Is it possible to leverage the modules/structs from the middleware package instead?

@ntny
Copy link
Contributor Author

ntny commented Jun 26, 2025

lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.

This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.

Is it possible to leverage the modules/structs from the middleware package instead?

lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.

This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.

Is it possible to leverage the modules/structs from the middleware package instead?

  1. I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.
  2. Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.

@zazulam
Copy link
Collaborator

zazulam commented Jun 26, 2025

  1. I was a bit concerned about that too, but decided to stick with the v1 version since the project already uses github.com/grpc-ecosystem/go-grpc-middleware v1.3.0. Using go-grpc-prometheus/v2 would require upgrading to the v2 middleware stack, which I figured would be better handled in a separate merge request with proper testing.
  2. Theoretically, yes but this requires implementing all metrics manually with matching naming and building a custom interceptor. That adds some maintenance overhead. It might be an option to explore if there are concerns about using a package.

Makes sense, I'm okay with using the go-grpc-prometheus package if we can get an issue for tracking the upgrade of it / the middleware stack.

Any thoughts from your end @HumairAK?

@mprahl
Copy link
Collaborator

mprahl commented Jul 11, 2025

@ntny I made a PR to update go-grpc-middleware, so once that's merged, you can rebase your PR to use v2:
#12043

I prefer this approach since it seems straight forward to update it rather than add another unsupported library to KFP.

arpechenin added 2 commits July 12, 2025 18:53
- add report gap histogram
- optimize create or update tasks query

Signed-off-by: ntny <[email protected]>
Signed-off-by: arpechenin <[email protected]>
@ntny ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 0f0b567 to ff66346 Compare July 12, 2025 21:11
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ntny
Copy link
Contributor Author

ntny commented Jul 12, 2025

/hold

need to configure metric registration with the new go-grpc-middleware/providers/prometheus api

@ntny ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 90eb530 to 7d609e9 Compare July 12, 2025 22:53
@ntny
Copy link
Contributor Author

ntny commented Jul 12, 2025

/unhold

Signed-off-by: arpechenin <[email protected]>
@ntny ntny force-pushed the feat(backend)-api-server-metrics-and-optimize-reporting branch from 7d609e9 to bd7693d Compare July 12, 2025 23:12
Copy link
Collaborator

@zazulam zazulam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ntny!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jul 13, 2025
@juliusvonkohout
Copy link
Member

@HumairAK for approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

5 participants