-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(backend): add gRPC metrics to api-server (RPS/latency), optimize execution spec reporting #12010
Conversation
Hi @ntny. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
🚫 This command cannot be processed. Only organization members or owners can use the commands. |
7366105
to
0f0b567
Compare
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just curious if there is a reason behind sticking with this go-grpc-prometheus package when its github page has it listed as deprecated.
This project is depreacted and archived as the functionality moved to go-grpc-middleware repo since provider/[email protected] release. You can pull it using go get github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus. The API is simplified and morernized, yet functionality is similar to what v1.2.0 offered. All questions and issues you can submit here.
Is it possible to leverage the modules/structs from the middleware package instead?
|
Makes sense, I'm okay with using the Any thoughts from your end @HumairAK? |
- add report gap histogram - optimize create or update tasks query Signed-off-by: ntny <[email protected]> Signed-off-by: arpechenin <[email protected]>
Signed-off-by: arpechenin <[email protected]>
0f0b567
to
ff66346
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold need to configure metric registration with the new go-grpc-middleware/providers/prometheus api |
90eb530
to
7d609e9
Compare
/unhold |
Signed-off-by: arpechenin <[email protected]>
7d609e9
to
bd7693d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @ntny!
/lgtm
@HumairAK for approval |
Description of your changes:
Add standard gRPC RPS and latency metrics to each API server endpoint. There are a lot of manually added metrics in run_server.go.
This PR replaces most of them with standardized gRPC metrics.
Additionally, many services in the api-server previously had no metrics at all — this PR adds basic observability for them as well.
I kept the old metrics to maintain backward compatibility with existing dashboards.
Add metrics to measure the delay between Argo Workflow creation for recurring runs and its reporting to the API server by the persistence agent
Optimize patchExistingTasks MySQL query by using the runId field to leverage the existing index (since podName is not covered by the index)
ReportWorkflowV1 Performance After Optimization (1.5M Tasks in MySQL):

Metrics will be available after the PR is merged.
This PR was motivated by an issue that emerged as the number of tasks in our system increased.
Recurring runs were created successfully, but their status was not updated for an extended period of time.
The same issue also affected regular (one-off) runs, which also experienced delays in status reporting.