-
-
Notifications
You must be signed in to change notification settings - Fork 9k
[Core] Implement disagg prefill by StatelessProcessGroup #10502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: KuntaiDu <[email protected]> Co-authored-by: ApostaC <[email protected]> Co-authored-by: YaoJiayi <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
4541111
to
1eadc94
Compare
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
… package Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
This is a known issue, it will be addressed in the future PR by @KuntaiDu . If you need a quick workaround, you can modify |
Hello, I encountered the following issue while running the decomposition reasoning on the 'mian' branch: The actual real kvcache shape is “kv_cache[0] shape torch.Size([2162, 81920])” INFO 12-03 14:31:48 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241203-143148.pkl... My startup command is: CUDA_VISIBLE_DEVICES=3 nohup nohup python3 CUDA_VISIBLE_DEVICES=4 nohup python3 nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 & My startup command is: CUDA_VISIBLE_DEVICES=3 nohup nohup python3 CUDA_VISIBLE_DEVICES=4 nohup python3 nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 & |
Hi @liweiqing1997 , currently I only tested Llama-style model. What kind of model are you using? |
I guess you are using a GPU card with Volta or Turing architecture? I have found this problem in the older version of this PR. @KuntaiDu If you don't have bandwidth, I can propose a PR to fix this. |
I am testing the Qwen 1.5 14B chat. Previously, I tested a version that had not been merged into the vllm/main branch, and it ran successfully. However, the main branch version does not work. I'm not sure if any changes were made or if there is an issue with my settings. |
BTW feel free to also comment in disaggregated prefill roadmap (#10818) |
NVIDIA A100-SXM4-80GB |
OK, then this bug may affect a wider range than I thought. My solution is to obtain |
…t#10502) This PR provides initial support for single-node disaggregated prefill in 1P1D scenario. Signed-off-by: KuntaiDu <[email protected]> Co-authored-by: ApostaC <[email protected]> Co-authored-by: YaoJiayi <[email protected]>
Hello, I noticed that in #6170 you used torch.distributed.init_process_group to initialize all ranks for prefill node and decode node, but later changed it to StatelessProcessGroup for kv cache transfer. |
@liuyumoye can you take a look at #10884 ? I think mooncake transfer engine should support cpu transfer. |
Thanks, I'll try your suggestion |
hello, can you run tp=4 use this? @AmberLJC |
@chenkaiyue I managed to do that on 2024 Dec. We realized some issued related to buffer data management under high request load, and some deadlock issue (large request block the later on small request) though. not sure they've been fixed later on |
I also observed similar issue. This issue happens only to some dev environments and hard to reproduce. Please use a third-party connectors (e.g. Mooncake / LMCache) and I'll work on a fix when I have bandwidth. |
Thanks Kuntai! |
…t#10502) This PR provides initial support for single-node disaggregated prefill in 1P1D scenario. Signed-off-by: KuntaiDu <[email protected]> Co-authored-by: ApostaC <[email protected]> Co-authored-by: YaoJiayi <[email protected]>
Thanks for the great work!
|
And the result of latency last seems too high? It costs 117 ms to transfer 1KB data for 1000 iter and nearly 95us for 1KB one time, in rdma it just needs 5~10 us. |
Hi @KuntaiDu ,I would like to ask how to set up tp4. Currently, I'm using version v0.8.2. When I set --tensor_parallel_size to 2, the program fails to run. |
Thanks for the great work! I tried to deploy 1 prefill and 2 decode instances on my 3 4090 GPUs. The deployment code is as follows. But I encountered a problem. This service can only process 2 requests correctly, and the subsequent requests will not be responded to. What is the reason?
|
I encountered the same problem, the error log is as follows:
|
did you resolve this problem? |
A light-weight implementation of disaggregated prefill. I switched from PR #8498 to here in order to fix DCO issues.