Use the correct vllm metric gpu_cache_usage_perc --> kv_cache_usage_perc#1905
Conversation
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
|
Welcome @ezrasilvera! |
|
Hi @ezrasilvera. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
in which vllm version was it changed? |
It seems like a relatively old change. You can see here vllm-project/vllm#18354 and vllm-project/vllm#24245 |
|
great catch @ezrasilvera. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ezrasilvera, nirrozenbaum The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thank you @ezrasilvera for fixing this! Looks like this was changed in v0.10. And I confirmed that the latest llm-d image has the change. I filed an issue in vllm to add conformance tests to prevent such breaking changes: |
|
@nirrozenbaum Any chance we cherrypick this to the release? |
@liu-cong @nirrozenbaum Unfortunately I don't think it's a vllm issue and I don't think tests on the vllm side can help avoiding such issues. They actually had grace period in which both metrics existed. |
I think there are values in both adding tests to vllm side and here. Adding a conformance test on vllm side allows us to get notified early on breaking changes so we can plan proactively. It also enables a feedback channel for vllm maintainers to understand downstream dependencies better. In EPP, some conformance test that covers latest N vllm versions is a great idea. I think we can add a "preflight" check feature as well (we should allow it to be turned off). |
yes, theoretically we can, although this is configurable using a flag in helm install |
I fully agree with both comments 😀 |
@nirrozenbaum True, I am creating a PR to in llm-d to configure this correctly even without the patch release, just to be safe. |
Indeed. This is exactly what we did to validate that the new name is working |
…erc (kubernetes-sigs#1905) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- IGW pr kubernetes-sigs/gateway-api-inference-extension#1905 landed fixing metrics name; no longer need to specify - tokenizer now pulls in vLLM and thus Torch: need writeable paths for torch on tokenization - tokenizer moves health endpoint from `/health` --> `/healthz` - image bumps Signed-off-by: greg pereira <grpereir@redhat.com>
- IGW pr kubernetes-sigs/gateway-api-inference-extension#1905 landed fixing metrics name; no longer need to specify - tokenizer now pulls in vLLM and thus Torch: need writeable paths for torch on tokenization - tokenizer moves health endpoint from `/health` --> `/healthz` - image bumps Signed-off-by: greg pereira <grpereir@redhat.com>
- IGW pr kubernetes-sigs/gateway-api-inference-extension#1905 landed fixing metrics name; no longer need to specify - tokenizer now pulls in vLLM and thus Torch: need writeable paths for torch on tokenization - tokenizer moves health endpoint from `/health` --> `/healthz` - image bumps Signed-off-by: greg pereira <grpereir@redhat.com>
- IGW pr kubernetes-sigs/gateway-api-inference-extension#1905 landed fixing metrics name; no longer need to specify - tokenizer now pulls in vLLM and thus Torch: need writeable paths for torch on tokenization - tokenizer moves health endpoint from `/health` --> `/healthz` - image bumps Signed-off-by: greg pereira <grpereir@redhat.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
The metric in vllm was changed from gpu_cache_usage_perc to kv_cache_usage_perc
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: