[Feature] Support torch profiler across omni stages#553
[Feature] Support torch profiler across omni stages#553hsliuustc0106 merged 17 commits intovllm-project:mainfrom
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
As dissucssed in the weekly meeting, we have those Todos under this Pull Request: 1. Rebase 2. API design
llm = LLM(
model="xx",
tensor_parallel_size=1,
profiler_config={
"profiler": "torch",
"torch_profiler_dir": "./vllm_profile",
},
)
llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
llm.stop_profile()
vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'3. Diffusion Pipeline Support 4. Provide examples 5. Documentation |
|
@amy-why-3459 PTAL |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
vllm_omni/entrypoints/omni_stage.py
Outdated
| cmd = task.get("command") | ||
| if cmd == "start" and has_profiler: | ||
| try: | ||
| await stage_engine.start_profile() |
There was a problem hiding this comment.
Should asyncio.create_task(stage_engine.start_profile()) be used to avoid blocking?
There was a problem hiding this comment.
The current await should be the right choice for correctness and precision. Changing to create_task would risk inaccurate traces. If start_profile() ever became heavy (unlikely), the vLLM core would handle it differently.
There was a problem hiding this comment.
There is already a warning in the doc that enabling profiler will reduce performance. #570
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
I think we do not need to use omni.close() manully to shutdown the service or process |
|
from the log, it seems the CPU consumes the majority time, is it accurate? |
|
fix ci please |
Sorry for being late and thanks for the review. I’ll work through the issues below ASAP:
|
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@wuhang2014 PTAL |
|
I tested vllm's profiler with model (vllm) Original: (vllm) After setting BUT, I'm confused why the trace file is far bigger than vllm without omni under current implementation. Looking at the pictures, compared with vLLM, the profiler of vLLM-Omni records additional parts highlighted in the red boxes in the picture, which account for the majority of the trace file. @lishunyang12 Could you also help take a look for these additional records? vllm-omni:
vllm:
The more details: Details |
I think Diffusion Pipeline are not supported profiler now,should add start_profile() and stop_profile(). |
please submit a new issue to weekly meeting and let's discuss on Wendesday |
vllm_omni/entrypoints/omni_stage.py
Outdated
| elif await handle_profiler_task_async(task_type): | ||
| pass # Profiler command handled |
There was a problem hiding this comment.
Logic here is a bit of confused. A type check statement should be in elif line and just execute handle_profiler_task_async with specified types. I think code here should be like:
elif is_profiling_task(task_type):
handle_profiling_task_async(task_type)
continue
...
There was a problem hiding this comment.
Make sense. I will modify it later. thx!
vllm_omni/entrypoints/omni_stage.py
Outdated
| if handle_profiler_task(task_type): | ||
| continue |
There was a problem hiding this comment.
I think code here has same logic as in async method:
if is_profiling_task(task_type):
handle_profiling_task(task_type)
continue
| profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR")) | ||
| if profiler_enabled: | ||
| omni_llm.start_profile(stages=[0]) |
There was a problem hiding this comment.
We should add a guidence in docs for users to start/stop profiling in their own codebase.
| print(f"Request ID: {request_id}, Saved audio to {output_wav}") | ||
|
|
||
| processed_count += len(stage_outputs.request_output) | ||
| if profiler_enabled and processed_count >= total_requests: |
There was a problem hiding this comment.
Why in condition processed_count >= total_requests, we should stop profiling?
There was a problem hiding this comment.
[In omni](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/entrypoints/omni.py)
The omni will kill the worker once we exit the loop. I set the stop_profile() when we are about to exit the loop.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
I have investigated the trace logs and confirmed that the large file size and high CPU time are expected artifacts of the
Why? The profiler is accurate; we are seeing the cost of process synchronization(multiprocessing, shm_broadcast), not model execution (cuda graph, gemm, flashattn) . Reference: The logic switching between Breakdown for profiling results:
Conponent Analysis
Why In the
illusion? |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
Thanks for investigating @lishunyang12. IMO, it's hard not to trace shm_boardcast.py:dequeue if we want to reuse vLLM's profiler. Even if the trace file is so large(~70MB), current profiler can still work. So I suggest we can skip this issue temporarily and merge this PR first (if other issues were fixed). And I will open an issue to trace it. After we have deeper discussion, I think we will find some ways to fix it in a following PR. @hsliuustc0106 |
|
thanks, but we have figured out a lot of problems with profiling, please open new issues |
Can we get a simular time cost if we hack the code with forcing to use UniProcExecutor ? @gcanlin @lishunyang12 |
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: lishunyang <lishunyang12@163.com>







PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Close #481.
vLLM-Omni supports profiling via PyTorch's built-in profiler, allowing users to capture detailed performance traces of multi-stage inference pipelines. This document describes the architecture and usage of the profiling system.
Task Type Definition
Profiler commands are defined as part of the
OmniStageTaskTypeenum instage_utils.py:Basic Usage (Offline Inference)
Important Notes
Call Order:
stop_profile()must be called AFTER inference completes, not immediately aftergenerate(). Thegenerate()method returns a generator, so actual inference happens during iteration.Stage Selection: Pass
stages=[0, 1, 2]to profile specific stages, orNoneto profile all stages.Output: Traces are saved to
VLLM_TORCH_PROFILER_DIR/with stage-specific subdirectories.Output Format
Profiler generates two types of output:
*.pt.trace.json.gz- Can be viewed in Perfettoprofiler_out_*.txt- Human-readable performance summaryTest Plan
After investigating vllm profiler, I recommend to use the below env var.
To avoid generate too big trace files, limiting the number of stages to be profiled would be recommended.
Test Result
profiler_out_0.txt:Details
Use https://ui.perfetto.dev to view the trace:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)