-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Optimize KV cache distribution for asymmetric pipeline parallelism #25164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable optimization for distributing KV cache memory proportionally in asymmetric pipeline parallelism setups. The implementation is well-structured and includes comprehensive unit tests. My review focuses on simplifying the core calculation logic for better efficiency and maintainability. By removing an unnecessary loop, the code becomes cleaner and more direct. I've provided suggestions to refactor this logic in both vllm/worker/worker.py
and vllm/v1/worker/gpu_worker.py
.
667c9c6
to
4e63857
Compare
1deec3d
to
470d550
Compare
…elism Signed-off-by: gholmes829 <[email protected]>
Signed-off-by: gholmes829 <[email protected]>
Signed-off-by: gholmes829 <[email protected]>
470d550
to
a721527
Compare
This pull request has merge conflicts that must be resolved before it can be |
Purpose
I'd like to add a compelling option to enable proportional KV cache memory distribution. I use pipeline parallelism with 2 GPUs and asymmetrically load layers between them. When using
--kv-cache-memory-bytes
, VLLM will correctly allocate only as much VRAM as is needed to load the model layers on the devices....HOWEVER for the KV cache, it will uniformly allocate memory. This means if I split the layers say 48/16, the first rank ends up with ~1x max concurrency per request while the second rank ends up with ~3x max concurrency per request. This is extremely wasteful, as the second rank is mostly bottlenecked by the prior stage with lower max concurrency.
In short, this behavior causes wildly inefficient memory usage characterized by:
Solution:
Add opt-in flag
--enable-pp-prop-kv-cache
which, when enabled, modifies the behavior of--kv-cache-memory-bytes
to get distributed per device proportional to the number of layers distributed to that device (as perVLLM_PP_LAYER_PARTITION
) instead of being a uniform allocation per device.Test Plan
In addition to unit tests, I manually tested bunch of cases comparing state before and after my changes. I am only including relevant arguments and reference "vllm serve" for following, but I will list more details of my env below and am happy to share detailed engine arguments if anyone would like.
Baseline run (prior to any of my changes):
(B) New
--enable-pp-prop-kv-cache
enabled (same thing but with the new flag):Unit tests run with:
Environment details:
vllm/vllm-openai:v0.10.2
Test Result
(A) Baseline (pre-implementation):
(B) With
--enable-pp-prop-kv-cache
enabled:Note how for Scenario A it allocates:
And for Scenario B it allocates:
Note that it doesn't normalize it to 1.00 or anything, I just lined up the math perfectly on this one :)
In terms of numbers, in my scenario this has basically decreased my KV cache memory usage by 50% for the same performance! Other setups could benefit more or less than this depending on the partitioning and concurrency multiplier.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.