Skip to content

[Core] Implement disagg prefill by StatelessProcessGroup #10502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Dec 2, 2024

Conversation

KuntaiDu
Copy link
Collaborator

@KuntaiDu KuntaiDu commented Nov 20, 2024

A light-weight implementation of disaggregated prefill. I switched from PR #8498 to here in order to fix DCO issues.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

KuntaiDu and others added 2 commits November 20, 2024 21:46
Signed-off-by: KuntaiDu <[email protected]>
Co-authored-by: ApostaC <[email protected]>
Co-authored-by: YaoJiayi <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
@KuntaiDu KuntaiDu force-pushed the kuntai-disagg-fix-DCO branch from 4541111 to 1eadc94 Compare November 20, 2024 21:47
Signed-off-by: KuntaiDu <[email protected]>
@KuntaiDu KuntaiDu added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2024
Copy link

mergify bot commented Nov 22, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 22, 2024
@mergify mergify bot removed the needs-rebase label Nov 22, 2024
Copy link

mergify bot commented Nov 22, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 22, 2024
@mergify mergify bot removed the needs-rebase label Nov 24, 2024
@ShangmingCai
Copy link
Contributor

Is there a workaround to solve this timeout. Or how to modify the 5min timeout. Thanks!

This is a known issue, it will be addressed in the future PR by @KuntaiDu . If you need a quick workaround, you can modify disagg_prefill_proxy_server.py to send a shadow request every 4 min through apscheduler.

@liweiqing1997
Copy link

Hello, I encountered the following issue while running the decomposition reasoning on the 'mian' branch:
ValueError: not enough values to unpack (expected 4, got 2).

The actual real kvcache shape is “kv_cache[0] shape torch.Size([2162, 81920])”

INFO 12-03 14:31:48 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241203-143148.pkl...
INFO 12-03 14:31:48 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241203-143148.pkl.
ERROR 12-03 14:31:48 engine.py:135] ValueError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241203-143148.pkl): not enough values to unpack (expected 4, got 2)')
ERROR 12-03 14:31:48 engine.py:135] Traceback (most recent call last):
ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 12-03 14:31:48 engine.py:135] return func(*args, **kwargs)
ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner.py", line 1718, in execute_model
ERROR 12-03 14:31:48 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_states(
ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states
ERROR 12-03 14:31:48 engine.py:135] self.connector.send_kv_caches_and_hidden_states(
ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 134, in send_kv_caches_and_hidden_states
ERROR 12-03 14:31:48 engine.py:135] _, _, num_heads, head_size = kv_cache[0].shape
ERROR 12-03 14:31:48 engine.py:135] ValueError: not enough values to unpack (expected 4, got 2)
ERROR 12-03 14:31:48 engine.py:135]

My startup command is:

CUDA_VISIBLE_DEVICES=3 nohup nohup python3
-m vllm.entrypoints.openai.api_server
--model $model
--port 8100
--max-model-len 1000
--gpu-memory-utilization 0.7
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &

CUDA_VISIBLE_DEVICES=4 nohup python3
-m vllm.entrypoints.openai.api_server
--model $model
--port 8200
--max-model-len 1000
--gpu-memory-utilization 0.7
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &

nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &

My startup command is:

CUDA_VISIBLE_DEVICES=3 nohup nohup python3
-m vllm.entrypoints.openai.api_server
--model $model
--port 8100
--max-model-len 1000
--gpu-memory-utilization 0.7
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &

CUDA_VISIBLE_DEVICES=4 nohup python3
-m vllm.entrypoints.openai.api_server
--model $model
--port 8200
--max-model-len 1000
--gpu-memory-utilization 0.7
--kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &

nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Dec 3, 2024

Hi @liweiqing1997 , currently I only tested Llama-style model. What kind of model are you using?

@ShangmingCai
Copy link
Contributor

Hello, I encountered the following issue while running the decomposition reasoning on the 'mian' branch: ValueError: not enough values to unpack (expected 4, got 2).

The actual real kvcache shape is “kv_cache[0] shape torch.Size([2162, 81920])”

INFO 12-03 14:31:48 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241203-143148.pkl... INFO 12-03 14:31:48 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241203-143148.pkl. ERROR 12-03 14:31:48 engine.py:135] ValueError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241203-143148.pkl): not enough values to unpack (expected 4, got 2)') ERROR 12-03 14:31:48 engine.py:135] Traceback (most recent call last): ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 12-03 14:31:48 engine.py:135] return func(*args, **kwargs) ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner.py", line 1718, in execute_model ERROR 12-03 14:31:48 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_states( ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states ERROR 12-03 14:31:48 engine.py:135] self.connector.send_kv_caches_and_hidden_states( ERROR 12-03 14:31:48 engine.py:135] File "/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 134, in send_kv_caches_and_hidden_states ERROR 12-03 14:31:48 engine.py:135] _, _, num_heads, head_size = kv_cache[0].shape ERROR 12-03 14:31:48 engine.py:135] ValueError: not enough values to unpack (expected 4, got 2) ERROR 12-03 14:31:48 engine.py:135]

My startup command is:

CUDA_VISIBLE_DEVICES=3 nohup nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8100 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &

CUDA_VISIBLE_DEVICES=4 nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8200 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &

nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &

My startup command is:

CUDA_VISIBLE_DEVICES=3 nohup nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8100 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &

CUDA_VISIBLE_DEVICES=4 nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8200 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &

nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &

I guess you are using a GPU card with Volta or Turing architecture? I have found this problem in the older version of this PR. @KuntaiDu If you don't have bandwidth, I can propose a PR to fix this.

@liweiqing1997
Copy link

liweiqing1997 commented Dec 3, 2024

Hi @liweiqing1997 , currently I only tested Llama-style model. What kind of model are you using?

I am testing the Qwen 1.5 14B chat. Previously, I tested a version that had not been merged into the vllm/main branch, and it ran successfully. However, the main branch version does not work. I'm not sure if any changes were made or if there is an issue with my settings.

@KuntaiDu
Copy link
Collaborator Author

KuntaiDu commented Dec 3, 2024

BTW feel free to also comment in disaggregated prefill roadmap (#10818)

@liweiqing1997
Copy link

您好,我在“mian”分支上运行分解推理时遇到以下问题:ValueError:没有足够的值可以解包(预期 4 个,但得到 2 个)。
实际的真实 kvcache 形状是“kv_cache[0] shape torch.Size([2162, 81920])”
INFO 12-03 14:31:48 model_runner_base.py:120] 将失败执行的输入写入 /tmp/err_execute_model_input_20241203-143148.pkl... INFO 12-03 14:31:48 model_runner_base.py:149] 已完成将失败执行的输入写入 /tmp/err_execute_model_input_20241203-143148.pkl。错误 12-03 14:31:48 engine.py:135] ValueError('模型执行错误(输入转储到 /tmp/err_execute_model_input_20241203-143148.pkl):没有足够的值来解包(预期 4 个,得到 2 个)')错误 12-03 14:31:48 engine.py:135] 回溯(最近一次调用最后一次):错误 12-03 14:31:48 engine.py:135] 文件“/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner_base.py”,第 116 行,在 _wrapper 中错误12-03 14:31:48 engine.py:135] 返回 func(*args,**kwargs) 错误 12-03 14:31:48 engine.py:135] 文件“/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/worker/model_runner.py”,第 1718 行,在 execute_model 中错误 12-03 14:31:48 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_​​states(错误 12-03 14:31:48 engine.py:135] 文件“/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_transfer_agent.py”,第 60 行,在 send_kv_caches_and_hidden_​​states 中错误 12-03 14:31:48 engine.py:135] self.connector.send_kv_caches_and_hidden_​​states(错误 12-03 14:31:48 engine.py:135] 文件“/mnt/data_disk101/data_disk/lwq/LLM_INFER/split_platform/opensource/vllm-kuntai-disagg-refactor_1202/vllm/distributed/kv_transfer/kv_connector/simple_connector.py”,第 134 行,位于 send_kv_caches_and_hidden_​​states ERROR 12-03 14:31:48 engine.py:135] ,num_heads,head_size = kv_cache[0].shape ERROR 12-03 14:31:48 engine.py:135] ValueError:没有足够的值来解压(预期 4 个,得到 2 个)ERROR 12-03 14:31:48 engine.py:135]
我的启动命令是:
CUDA_VISIBLE_DEVICES=3 nohup nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8100 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &
CUDA_VISIBLE_DEVICES=4 nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8200 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &
nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &
我的启动命令是:
CUDA_VISIBLE_DEVICES=3 nohup nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8100 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' > $Log_folder/p.log 2>&1 &
CUDA_VISIBLE_DEVICES=4 nohup python3 -m vllm.entrypoints.openai.api_server --model $model --port 8200 --max-model-len 1000 --gpu-memory-utilization 0.7 --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' > $Log_folder/D.log 2>&1 &
nohup python3 disagg_prefill_proxy_server.py > $Log_folder/proxy_server.log 2>&1 &

我猜你使用的是 Volta 或 Turing 架构的 GPU 卡?我在这个 PR 的旧版本中发现了这个问题。@KuntaiDu如果您没有带宽,我可以提出 PR 来解决这个问题。

NVIDIA A100-SXM4-80GB

@ShangmingCai
Copy link
Contributor

NVIDIA A100-SXM4-80GB

OK, then this bug may affect a wider range than I thought. My solution is ​​to obtain num_heads and head_size from model_executable.model.config instead of getting them from kv_cache[0].shape.

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
…t#10502)

This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <[email protected]>
Co-authored-by: ApostaC <[email protected]>
Co-authored-by: YaoJiayi <[email protected]>
@liuyumoye
Copy link

Hello, I noticed that in #6170 you used torch.distributed.init_process_group to initialize all ranks for prefill node and decode node, but later changed it to StatelessProcessGroup for kv cache transfer.
However, StatelessProcessGroup only supports nccl backend. If I want to use the CPU for transferring KV cache, do you have any good suggestions? It seems that TCPStore might not be suitable for transferring large amounts of data.

@youkaichao
Copy link
Member

@liuyumoye can you take a look at #10884 ? I think mooncake transfer engine should support cpu transfer.

@liuyumoye
Copy link

@liuyumoye can you take a look at #10884 ? I think mooncake transfer engine should support cpu transfer.

Thanks, I'll try your suggestion

@chenkaiyue
Copy link

chenkaiyue commented Jan 20, 2025

Thanks for the great work!

I manage to run examples/kv_transfer/disagg_prefill_example.sh on 1 GPU. May I ask a question about how to configure models with TP=4 (there are some code I don't fully understand, so I'd just ask here)? Here is my script to setup P/D, please advise me the configuration that need to be changed


CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model "$model_name" \
    --port 8100 \ 
    --load-format dummy \ 
    --gpu-memory-utilization 0.9 \
    --kv-connector PyNcclConnector \
    --kv-role kv_producer \
    --kv-rank 0 \
    --tensor-parallel-size 4 \
    --kv-parallel-size 2 &
 
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model "$model_name" \
    --load-format dummy \ 
    --port 8200 \
    --max-num-batched-tokens 200000 \
    --gpu-memory-utilization 0.9 \
    --kv-connector PyNcclConnector \
    --kv-role kv_consumer \
    --kv-rank 1 \
    --tensor-parallel-size 4 \
    --kv-parallel-size 2 &

hello, can you run tp=4 use this? @AmberLJC

@AmberLJC
Copy link

@chenkaiyue I managed to do that on 2024 Dec. We realized some issued related to buffer data management under high request load, and some deadlock issue (large request block the later on small request) though. not sure they've been fixed later on

@KuntaiDu
Copy link
Collaborator Author

I also observed similar issue. This issue happens only to some dev environments and hard to reproduce. Please use a third-party connectors (e.g. Mooncake / LMCache) and I'll work on a fix when I have bandwidth.

@AmberLJC
Copy link

Thanks Kuntai!
(We found potential ways to fix, happy to talk if you want.)

anko-intel pushed a commit to HabanaAI/vllm-fork that referenced this pull request Feb 12, 2025
…t#10502)

This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <[email protected]>
Co-authored-by: ApostaC <[email protected]>
Co-authored-by: YaoJiayi <[email protected]>
@WoShiAPei
Copy link

WoShiAPei commented Feb 17, 2025

Thanks for the great work!
I am running test_send_recv.py and want to know what the code below used for.There is a stalessprocess in kv cache connector which makes init_process_group meaningless.I have tried to remove the code and found the test_send_recv.sh can still run normally.

    torch.distributed.init_process_group(
        backend='gloo',
        init_method='tcp://localhost:12452',
        world_size=2,
        rank=my_rank,
    )

@WoShiAPei
Copy link

And the result of latency last seems too high? It costs 117 ms to transfer 1KB data for 1000 iter and nearly 95us for 1KB one time, in rdma it just needs 5~10 us.

@xidiancpy
Copy link

Hi @KuntaiDu ,I would like to ask how to set up tp4. Currently, I'm using version v0.8.2. When I set --tensor_parallel_size to 2, the program fails to run.

@Dreamer-HIT
Copy link

Thanks for the great work!

I tried to deploy 1 prefill and 2 decode instances on my 3 4090 GPUs. The deployment code is as follows. But I encountered a problem. This service can only process 2 requests correctly, and the subsequent requests will not be responded to. What is the reason?

launch_disagg_prefill() {
  model="/root/autodl-tmp/Model/Meta-Llama-3.1-8B-Instruct" 
  # disagg prefill
  CUDA_VISIBLE_DEVICES=0 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

  CUDA_VISIBLE_DEVICES=1 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

    CUDA_VISIBLE_DEVICES=2 python3 \
        -m vllm.entrypoints.openai.api_server \
        --model $model \
        --port 8300 \
        --max-model-len 10000 \
        --gpu-memory-utilization 0.9 \
        --kv-transfer-config \
        '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":2,"kv_parallel_size":3,"kv_buffer_size":5e9}' &


  wait_for_server 8100
  wait_for_server 8200
  wait_for_server 8300
 
  python3 disagg_proxy_demo.py \
       --model $model  \
       --prefill localhost:8100   \
       --decode localhost:8200 localhost:8300   \
       --port 8000 &

  sleep 1
}

@tensorflowt
Copy link

Thanks for the great work!

I tried to deploy 1 prefill and 2 decode instances on my 3 4090 GPUs. The deployment code is as follows. But I encountered a problem. This service can only process 2 requests correctly, and the subsequent requests will not be responded to. What is the reason?

launch_disagg_prefill() {
  model="/root/autodl-tmp/Model/Meta-Llama-3.1-8B-Instruct" 
  # disagg prefill
  CUDA_VISIBLE_DEVICES=0 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

  CUDA_VISIBLE_DEVICES=1 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

    CUDA_VISIBLE_DEVICES=2 python3 \
        -m vllm.entrypoints.openai.api_server \
        --model $model \
        --port 8300 \
        --max-model-len 10000 \
        --gpu-memory-utilization 0.9 \
        --kv-transfer-config \
        '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":2,"kv_parallel_size":3,"kv_buffer_size":5e9}' &


  wait_for_server 8100
  wait_for_server 8200
  wait_for_server 8300
 
  python3 disagg_proxy_demo.py \
       --model $model  \
       --prefill localhost:8100   \
       --decode localhost:8200 localhost:8300   \
       --port 8000 &

  sleep 1
}

I encountered the same problem, the error log is as follows:

[rank1]:[W411 00:33:04.667467898 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=135, addr=[localhost]:10002, remote=[localhost]:14581) returned 0, likely a timeout
[rank1]:[W411 00:33:04.669049648 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=135, addr=[localhost]:10002, remote=[localhost]:14581) timed out after 300000ms
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [pynccl_pipe.py:264] Encountering exception in KV receiving thread
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [pynccl_pipe.py:265] wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [pynccl_pipe.py:266] My device: cuda:1
(VllmWorkerProcess pid=50535) Traceback (most recent call last):
(VllmWorkerProcess pid=50535)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50535)     tensor = future.result()
(VllmWorkerProcess pid=50535)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50535)     return self.__get_result()
(VllmWorkerProcess pid=50535)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50535)     raise self._exception
(VllmWorkerProcess pid=50535)   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50535)     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50535)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50535)     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50535)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50535)     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50535)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50535)     self.store.get(
(VllmWorkerProcess pid=50535) torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2347, in run_method
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 91, in start_worker_execution_loop
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1744, in execute_model
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     get_kv_transfer_group().recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.connector.recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 289, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     ret = self.select(current_tokens,
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 142, in select
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.consumer_buffer.drop_select(input_tokens, roi)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 202, in drop_select
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     input_tokens = self.data_pipe.recv_tensor()
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 269, in recv_tensor
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise e
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     tensor = future.result()
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.__get_result()
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise self._exception
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     self.store.get(
(VllmWorkerProcess pid=50535) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
[rank3]:[W411 00:33:04.679038558 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=135, addr=[localhost]:10004, remote=[localhost]:14585) returned 0, likely a timeout
[rank3]:[W411 00:33:04.679241500 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=135, addr=[localhost]:10004, remote=[localhost]:14585) timed out after 300000ms
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [pynccl_pipe.py:264] Encountering exception in KV receiving thread
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [pynccl_pipe.py:265] wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [pynccl_pipe.py:266] My device: cuda:3
(VllmWorkerProcess pid=50537) Traceback (most recent call last):
(VllmWorkerProcess pid=50537)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50537)     tensor = future.result()
(VllmWorkerProcess pid=50537)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50537)     return self.__get_result()
(VllmWorkerProcess pid=50537)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50537)     raise self._exception
(VllmWorkerProcess pid=50537)   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50537)     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50537)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50537)     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50537)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50537)     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50537)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50537)     self.store.get(
(VllmWorkerProcess pid=50537) torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2347, in run_method
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 91, in start_worker_execution_loop
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1744, in execute_model
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     get_kv_transfer_group().recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.connector.recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 289, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     ret = self.select(current_tokens,
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 142, in select
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.consumer_buffer.drop_select(input_tokens, roi)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 202, in drop_select
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     input_tokens = self.data_pipe.recv_tensor()
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 269, in recv_tensor
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise e
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     tensor = future.result()
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.__get_result()
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise self._exception
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     self.store.get(
(VllmWorkerProcess pid=50537) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
[rank2]:[W411 00:33:04.699075374 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=135, addr=[localhost]:10002, remote=[localhost]:14583) returned 0, likely a timeout
[rank2]:[W411 00:33:04.700465258 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=135, addr=[localhost]:10002, remote=[localhost]:14583) timed out after 300000ms
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [pynccl_pipe.py:264] Encountering exception in KV receiving thread
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [pynccl_pipe.py:265] wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [pynccl_pipe.py:266] My device: cuda:2
(VllmWorkerProcess pid=50536) Traceback (most recent call last):
(VllmWorkerProcess pid=50536)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50536)     tensor = future.result()
(VllmWorkerProcess pid=50536)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50536)     return self.__get_result()
(VllmWorkerProcess pid=50536)   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50536)     raise self._exception
(VllmWorkerProcess pid=50536)   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50536)     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50536)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50536)     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50536)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50536)     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50536)   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50536)     self.store.get(
(VllmWorkerProcess pid=50536) torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop.
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2347, in run_method
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 91, in start_worker_execution_loop
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.execute_model(execute_model_req=None)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     output = self.model_runner.execute_model(
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1744, in execute_model
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     get_kv_transfer_group().recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.connector.recv_kv_caches_and_hidden_states(
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 289, in recv_kv_caches_and_hidden_states
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     ret = self.select(current_tokens,
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 142, in select
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.consumer_buffer.drop_select(input_tokens, roi)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 202, in drop_select
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     input_tokens = self.data_pipe.recv_tensor()
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 269, in recv_tensor
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise e
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     tensor = future.result()
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.__get_result()
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     raise self._exception
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     result = self.fn(*self.args, **self.kwargs)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     metadata = self._recv_metadata()
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     return self.group.recv_obj(self.target_rank_for_recv)
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238]     self.store.get(
(VllmWorkerProcess pid=50536) ERROR 04-11 00:33:04 [multiproc_worker_utils.py:238] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
[rank0]:[W411 00:33:04.773462199 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=162, addr=[localhost]:21692, remote=[localhost]:14579) returned 0, likely a timeout
[rank0]:[W411 00:33:04.774374699 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=162, addr=[localhost]:21692, remote=[localhost]:14579) timed out after 300000ms
ERROR 04-11 00:33:04 [pynccl_pipe.py:264] Encountering exception in KV receiving thread
ERROR 04-11 00:33:04 [pynccl_pipe.py:265] wait timeout after 300000ms, keys: /send_to/1/40
ERROR 04-11 00:33:04 [pynccl_pipe.py:266] My device: cuda:0
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
    tensor = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
    metadata = self._recv_metadata()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
    return self.group.recv_obj(self.target_rank_for_recv)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
    self.store.get(
torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
ERROR 04-11 00:33:04 [engine.py:160] DistStoreError('wait timeout after 300000ms, keys: /send_to/1/40')
ERROR 04-11 00:33:04 [engine.py:160] Traceback (most recent call last):
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 04-11 00:33:04 [engine.py:160]     self.run_engine_loop()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 04-11 00:33:04 [engine.py:160]     request_outputs = self.engine_step()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 247, in engine_step
ERROR 04-11 00:33:04 [engine.py:160]     raise e
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 04-11 00:33:04 [engine.py:160]     return self.engine.step()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1430, in step
ERROR 04-11 00:33:04 [engine.py:160]     outputs = self.model_executor.execute_model(
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 299, in execute_model
ERROR 04-11 00:33:04 [engine.py:160]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 04-11 00:33:04 [engine.py:160]     return self.driver_worker.execute_model(execute_model_req)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 04-11 00:33:04 [engine.py:160]     output = self.model_runner.execute_model(
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-11 00:33:04 [engine.py:160]     return func(*args, **kwargs)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1744, in execute_model
ERROR 04-11 00:33:04 [engine.py:160]     get_kv_transfer_group().recv_kv_caches_and_hidden_states(
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
ERROR 04-11 00:33:04 [engine.py:160]     return self.connector.recv_kv_caches_and_hidden_states(
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 289, in recv_kv_caches_and_hidden_states
ERROR 04-11 00:33:04 [engine.py:160]     ret = self.select(current_tokens,
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 142, in select
ERROR 04-11 00:33:04 [engine.py:160]     return self.consumer_buffer.drop_select(input_tokens, roi)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 202, in drop_select
ERROR 04-11 00:33:04 [engine.py:160]     input_tokens = self.data_pipe.recv_tensor()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 269, in recv_tensor
ERROR 04-11 00:33:04 [engine.py:160]     raise e
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 262, in recv_tensor
ERROR 04-11 00:33:04 [engine.py:160]     tensor = future.result()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
ERROR 04-11 00:33:04 [engine.py:160]     return self.__get_result()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
ERROR 04-11 00:33:04 [engine.py:160]     raise self._exception
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 04-11 00:33:04 [engine.py:160]     result = self.fn(*self.args, **self.kwargs)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 192, in _recv_impl
ERROR 04-11 00:33:04 [engine.py:160]     metadata = self._recv_metadata()
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/pynccl_pipe.py", line 167, in _recv_metadata
ERROR 04-11 00:33:04 [engine.py:160]     return self.group.recv_obj(self.target_rank_for_recv)
ERROR 04-11 00:33:04 [engine.py:160]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/utils.py", line 170, in recv_obj
ERROR 04-11 00:33:04 [engine.py:160]     self.store.get(
ERROR 04-11 00:33:04 [engine.py:160] torch.distributed.DistStoreError: wait timeout after 300000ms, keys: /send_to/1/40
CRITICAL 04-11 00:33:04 [launcher.py:116] MQLLMEngine is already dead, terminating server process
INFO:     127.0.0.1:30154 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [49945]
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 317, in _bootstrap
    util._exit_function()
  File "/usr/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 426, in signal_handler
    raise KeyboardInterrupt("MQLLMEngine terminated")
KeyboardInterrupt: MQLLMEngine terminated
INFO 04-11 00:33:05 [multiproc_worker_utils.py:137] Terminating local vLLM worker processes
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@sjtuchenyp
Copy link

Thanks for the great work!

I tried to deploy 1 prefill and 2 decode instances on my 3 4090 GPUs. The deployment code is as follows. But I encountered a problem. This service can only process 2 requests correctly, and the subsequent requests will not be responded to. What is the reason?

launch_disagg_prefill() {
  model="/root/autodl-tmp/Model/Meta-Llama-3.1-8B-Instruct" 
  # disagg prefill
  CUDA_VISIBLE_DEVICES=0 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

  CUDA_VISIBLE_DEVICES=1 python3 \
    -m vllm.entrypoints.openai.api_server \
    --model $model \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":3,"kv_buffer_size":5e9}' &

    CUDA_VISIBLE_DEVICES=2 python3 \
        -m vllm.entrypoints.openai.api_server \
        --model $model \
        --port 8300 \
        --max-model-len 10000 \
        --gpu-memory-utilization 0.9 \
        --kv-transfer-config \
        '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":2,"kv_parallel_size":3,"kv_buffer_size":5e9}' &


  wait_for_server 8100
  wait_for_server 8200
  wait_for_server 8300
 
  python3 disagg_proxy_demo.py \
       --model $model  \
       --prefill localhost:8100   \
       --decode localhost:8200 localhost:8300   \
       --port 8000 &

  sleep 1
}

did you resolve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.