Skip to content

[Feature] Support torch profiler across omni stages#553

Merged
hsliuustc0106 merged 17 commits intovllm-project:mainfrom
gcanlin:feat-profiler
Jan 5, 2026
Merged

[Feature] Support torch profiler across omni stages#553
hsliuustc0106 merged 17 commits intovllm-project:mainfrom
gcanlin:feat-profiler

Conversation

@gcanlin
Copy link
Collaborator

@gcanlin gcanlin commented Dec 30, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Close #481.

vLLM-Omni supports profiling via PyTorch's built-in profiler, allowing users to capture detailed performance traces of multi-stage inference pipelines. This document describes the architecture and usage of the profiling system.

Task Type Definition

Profiler commands are defined as part of the OmniStageTaskType enum in stage_utils.py:

class OmniStageTaskType(enum.Enum):
    GENERATE = "generate"
    ABORT = "abort"
    SHUTDOWN = "shutdown"
    PROFILER_START = "profiler_start"
    PROFILER_STOP = "profiler_stop"

Basic Usage (Offline Inference)

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-Omni-7B")

# Start profiling for specific stages
omni.start_profile(stages=[0, 1])

# Run inference - profiling captures this
generator = omni.generate(prompts, sampling_params)
for output in generator:
    process(output)

# Stop profiling AFTER inference completes
omni.stop_profile()

omni.close()

Important Notes

  1. Call Order: stop_profile() must be called AFTER inference completes, not immediately after generate(). The generate() method returns a generator, so actual inference happens during iteration.

  2. Stage Selection: Pass stages=[0, 1, 2] to profile specific stages, or None to profile all stages.

  3. Output: Traces are saved to VLLM_TORCH_PROFILER_DIR/ with stage-specific subdirectories.

Output Format

Profiler generates two types of output:

  1. Chrome Trace (JSON): *.pt.trace.json.gz - Can be viewed in Perfetto
  2. Summary Table: profiler_out_*.txt - Human-readable performance summary

Test Plan

After investigating vllm profiler, I recommend to use the below env var.

export VLLM_USE_MODELSCOPE=True
export VLLM_TORCH_PROFILER_DIR=./trace_perf
export VLLM_TORCH_PROFILER_WITH_STACK=1
export VLLM_TORCH_PROFILER_RECORD_SHAPES=0
export VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=0
export VLLM_TORCH_PROFILER_WITH_FLOPS=0
export VLLM_TORCH_PROFILER_USE_GZIP=1
export VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL=1
export VLLM_TORCH_PROFILER_DISABLE_ASYNC_LLM=1
export VLLM_PROFILER_MAX_ITERS=1

To avoid generate too big trace files, limiting the number of stages to be profiled would be recommended.

omni_llm.start_profile(stages=[1])

Test Result

profiler_out_0.txt:

Details
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us     162.909ms      3083.44%     162.909ms     162.909ms             1  
                           vllm::outplace_fused_experts        13.07%      26.381ms        14.32%      28.914ms       1.446ms       1.260ms        23.85%       1.606ms      80.312us            20  
                                       fused_moe_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.260ms        23.85%       1.260ms      31.502us            40  
                                               aten::mm         3.99%       8.050ms         6.17%      12.462ms     102.991us     949.932us        17.98%     978.540us       8.087us           121  
                                              aten::mul         0.65%       1.314ms         1.12%       2.252ms      12.372us     671.188us        12.70%     671.188us       3.688us           182  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     660.692us        12.51%     660.692us       3.671us           180  
ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us     504.757us         9.55%     504.757us       8.275us            61  
                                              aten::cat         0.78%       1.580ms         1.71%       3.458ms      26.809us     433.302us         8.20%     436.854us       3.386us           129  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     251.577us         4.76%     251.577us       3.145us            80  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     212.222us         4.02%     212.222us       2.101us           101  
                                _vllm_fa2_C::varlen_fwd         0.14%     278.931us         2.74%       5.528ms     276.417us     206.365us         3.91%     232.957us      11.648us            20  
void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     206.365us         3.91%     206.365us      10.318us            20  
                                  Lazy Function Loading         0.76%       1.539ms         0.76%       1.539ms      54.959us     183.066us         3.46%     183.066us       6.538us            28  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     165.725us         3.14%     165.725us       4.042us            41  
void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm...         0.00%       0.000us         0.00%       0.000us       0.000us     159.549us         3.02%     159.549us       6.937us            23  
                           _moe_C::moe_align_block_size         0.07%     134.704us         0.24%     477.402us      23.870us     154.812us         2.93%     154.812us       7.741us            20  
                                            aten::copy_         0.34%     692.508us         1.17%       2.355ms      25.596us     149.885us         2.84%     224.732us       2.443us            92  
                       Runtime Triggered Module Loading        41.47%      83.744ms        41.47%      83.744ms       4.652ms     138.876us         2.63%     138.876us       7.715us            18  
                                            aten::index         0.22%     438.690us         0.48%     974.327us      44.288us     137.212us         2.60%     148.891us       6.768us            22  
                                              aten::sum         0.14%     291.610us         0.23%     464.187us      23.209us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::reduce_kernel<128, 4, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     134.780us         2.55%     134.780us       6.418us            21  
                                              aten::add         0.28%     556.915us         0.46%     932.664us       9.818us     134.270us         2.54%     137.918us       1.452us            95  
                                           _C::rms_norm         0.09%     174.059us         0.24%     490.130us      11.954us     133.343us         2.52%     133.343us       3.252us            41  
                                 _C::fused_add_rms_norm         0.11%     212.570us         0.24%     492.322us      12.308us     131.166us         2.48%     131.166us       3.279us            40  
std::enable_if<(((8)>(0)))&&vllm::_typeConvert<c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     131.166us         2.48%     131.166us       3.279us            40  
void vllm::rms_norm_kernel<c10::BFloat16, 8, 3>(c10:...         0.00%       0.000us         0.00%       0.000us       0.000us     130.367us         2.47%     130.367us       3.259us            40  
                                            aten::addmm         0.34%     695.435us         0.48%     959.588us      79.966us     119.358us         2.26%     144.606us      12.051us            12  
                                   _moe_C::topk_softmax         0.07%     144.369us         0.18%     362.292us      18.115us     116.286us         2.20%     116.286us       5.814us            20  
void vllm::moe::topkGatingSoftmax<8, 128, 4, 16, 32,...         0.00%       0.000us         0.00%       0.000us       0.000us     116.286us         2.20%     116.286us       5.814us            20  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     110.783us         2.10%     110.783us       2.770us            40  
void vllm::moe::moe_align_block_size_kernel<int>(int...         0.00%       0.000us         0.00%       0.000us       0.000us     106.684us         2.02%     106.684us       5.334us            20  
                                       _C::silu_and_mul         0.09%     173.470us         0.21%     428.884us      10.722us     101.052us         1.91%     101.052us       2.526us            40  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     101.052us         1.91%     101.052us       2.526us            40  
void cutlass::Kernel2<cutlass_80_wmma_tensorop_s1616...         0.00%       0.000us         0.00%       0.000us       0.000us      86.206us         1.63%      86.206us       4.310us            20  
                                              aten::sub         0.15%     309.465us         0.29%     593.790us      14.138us      83.840us         1.59%      86.080us       2.050us            42  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      74.783us         1.42%      74.783us       5.753us            13  
                  _C_cache_ops::reshape_and_cache_flash         0.07%     133.160us         1.72%       3.483ms     174.130us      69.920us         1.32%      80.096us       4.005us            20  
void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us      69.920us         1.32%      69.920us       3.496us            20  
void dot_kernel<float, 128, 0, cublasDotParams<cubla...         0.00%       0.000us         0.00%       0.000us       0.000us      60.639us         1.15%      60.639us       3.032us            20  
                                          aten::sigmoid         0.11%     221.586us         0.17%     342.447us      17.122us      53.726us         1.02%      53.726us       2.686us            20  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      53.726us         1.02%      53.726us       2.686us            20  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us       0.000us      50.879us         0.96%      50.879us       8.480us             6  
void reduce_1Block_kernel<float, 128, 7, cublasGemvT...         0.00%       0.000us         0.00%       0.000us       0.000us      48.222us         0.91%      48.222us       2.411us            20  
void vllm::moe::count_and_sort_expert_tokens_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      48.128us         0.91%      48.128us       2.406us            20  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us      39.040us         0.74%      39.040us       2.169us            18  
                                             aten::topk         1.18%       2.373ms        10.73%      21.659ms      21.659ms      30.335us         0.57%     151.675us     151.675us             1  
                              aten::_local_scalar_dense         0.05%     102.563us         0.16%     330.240us       6.740us      28.672us         0.54%      28.672us       0.585us            49  
                                          aten::nonzero         0.10%     200.502us         2.89%       5.843ms       1.948ms      27.678us         0.52%      76.953us      25.651us             3  
ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_6...         0.00%       0.000us         0.00%       0.000us       0.000us      25.727us         0.49%      25.727us       8.576us             3  
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      25.695us         0.49%      25.695us       1.977us            13  
                                               aten::eq         0.10%     198.200us         7.49%      15.118ms       1.260ms      23.647us         0.45%      35.615us       2.968us            12  
                                            aten::fill_         0.05%     110.020us         0.11%     221.114us       4.336us      23.039us         0.44%      23.039us       0.452us            51  
void at::native::sbtopk::gatherTopK<float, unsigned ...         0.00%       0.000us         0.00%       0.000us       0.000us      21.856us         0.41%      21.856us      21.856us             1  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us      19.776us         0.37%      19.776us       3.296us             6  
                                             aten::silu         0.04%      90.091us         0.07%     135.434us      22.572us      16.864us         0.32%      16.864us       2.811us             6  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      16.864us         0.32%      16.864us       2.811us             6  
                                 aten::_index_put_impl_         0.03%      58.090us         0.10%     209.052us      69.684us      12.864us         0.24%      22.720us       7.573us             3  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      12.864us         0.24%      12.864us       6.432us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, __nv...         0.00%       0.000us         0.00%       0.000us       0.000us      12.224us         0.23%      12.224us       4.075us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.808us         0.22%      11.808us       1.968us             6  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.496us         0.20%      10.496us       5.248us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us      10.304us         0.20%      10.304us       3.435us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.471us         0.18%       9.471us       2.368us             4  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       9.216us         0.17%       9.216us       3.072us             3  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.184us         0.17%       9.184us       1.837us             5  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       8.607us         0.16%       8.607us       1.721us             5  
void at::native::radixSortKVInPlace<-2, -1, 32, 4, f...         0.00%       0.000us         0.00%       0.000us       0.000us       8.479us         0.16%       8.479us       8.479us             1  
void at_cuda_detail::cub::DeviceSelectSweepKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.872us         0.15%       7.872us       2.624us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.488us         0.14%       7.488us       2.496us             3  
                                     aten::index_select         0.89%       1.796ms         0.96%       1.936ms     645.364us       7.231us         0.14%      12.159us       4.053us             3  
void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.231us         0.14%       7.231us       7.231us             1  
void at_cuda_detail::cub::DeviceReduceSingleTileKern...         0.00%       0.000us         0.00%       0.000us       0.000us       7.103us         0.13%       7.103us       2.368us             3  
                                           aten::argmax         0.01%      19.123us         0.01%      29.241us      29.241us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.784us         0.13%       6.784us       3.392us             2  
                                             aten::div_         0.01%      23.018us         0.02%      42.255us      21.127us       6.688us         0.13%       6.688us       3.344us             2  
                                     aten::masked_fill_         0.02%      30.954us         0.05%      99.651us      24.913us       6.464us         0.12%       8.640us       2.160us             4  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       6.176us         0.12%       6.176us       2.059us             3  
                                             aten::sub_         0.01%      27.663us         0.03%      68.612us      22.871us       6.176us         0.12%       8.160us       2.720us             3  
                                           aten::gather         0.02%      45.116us         0.06%     125.564us      62.782us       5.664us         0.11%       8.128us       4.064us             2  
void at_cuda_detail::cub::DeviceCompactInitKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.343us         0.10%       5.343us       1.781us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.248us         0.10%       5.248us       1.749us             3  
                                         aten::_softmax         0.01%      12.536us         0.01%      24.545us      24.545us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us         0.09%       4.831us       2.415us             2  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us         0.09%       4.768us       4.768us             1  
                                       aten::bitwise_or         0.02%      34.568us         5.54%      11.187ms       5.594ms       4.576us         0.09%      11.968us       5.984us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.576us         0.09%       4.576us       2.288us             2  
                                      aten::bitwise_not         0.02%      35.313us         0.05%      96.025us      48.013us       4.415us         0.08%       6.622us       3.311us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.415us         0.08%       4.415us       2.208us             2  
                                              aten::max         0.07%     143.851us         0.58%       1.181ms      34.721us       4.352us         0.08%       4.352us       0.128us            34  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       4.352us         0.08%       4.352us       4.352us             1  
                                               aten::gt         0.02%      34.013us         0.03%      54.536us      27.268us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us         0.08%       4.288us       2.144us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.192us         0.08%       4.192us       2.096us             2  
                                              aten::any         0.02%      46.241us        11.90%      24.028ms      24.028ms       3.968us         0.08%      15.872us      15.872us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us         0.08%       3.968us       3.968us             1  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.648us         0.07%       3.648us       3.648us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 201.919ms
Self CUDA time total: 5.283ms

Use https://ui.perfetto.dev to view the trace:

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin mentioned this pull request Dec 31, 2025
1 task
@lishunyang12
Copy link
Contributor

As dissucssed in the weekly meeting, we have those Todos under this Pull Request:

1. Rebase
Recent updates introduce omni_base class which is overloaded by omni class and async_omni class. We can place start_profile() and stop_profile() under omni_base instead of replicating same functions in two child classes.

2. API design
To adapt to vLLM developers transitioning to vLLM-omni, we should stick to vLLM profiler's api design.
https://docs.vllm.ai/en/stable/contributing/profiling/#profile-with-pytorch-profiler

  • Offline Inference:
 llm = LLM(
        model="xx",
        tensor_parallel_size=1,
        profiler_config={
            "profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
        },
    )

    llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    llm.stop_profile()
  • Online Serving:
vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

3. Diffusion Pipeline Support
Diffusion engine is newly introduced in omni and should add implementation for start_profile() and stop_profile(). We can reuse vLLM profiler module.
https://github.com/vllm-project/vllm/tree/main/vllm/profiler

4. Provide examples
image
To align with vLLM, we can provide test files specifically for offline inference profiling. I recommend trying with small models like qwen2.5-omni-7B/3B for Omni and z-image for Diffusion.

5. Documentation
Provide developers with an user guide at https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/.

@hsliuustc0106
Copy link
Collaborator

@amy-why-3459 PTAL

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Collaborator Author

gcanlin commented Dec 31, 2025

  1. I have rebased code and extract common method into OmniBase class.
  2. Offline profiler API has been implemented. But because the flag --profiler-config is introduced by [Cleanup] Refactor profiling env vars into a CLI config vllm#29912 which isn't included in vllm 0.12.0. So we have to support it in next version.
  3. I'd prefer to make diffusion profiler implemented in a follow-up PR.

cmd = task.get("command")
if cmd == "start" and has_profiler:
try:
await stage_engine.start_profile()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should asyncio.create_task(stage_engine.start_profile()) be used to avoid blocking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current await should be the right choice for correctness and precision. Changing to create_task would risk inaccurate traces. If start_profile() ever became heavy (unlikely), the vLLM core would handle it differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a warning in the doc that enabling profiler will reduce performance. #570

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
gcanlin and others added 3 commits December 31, 2025 16:43
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin marked this pull request as ready for review December 31, 2025 17:08
@amy-why-3459 amy-why-3459 mentioned this pull request Jan 4, 2026
35 tasks
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jan 4, 2026
@hsliuustc0106
Copy link
Collaborator

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Close #481.

vLLM-Omni supports profiling via PyTorch's built-in profiler, allowing users to capture detailed performance traces of multi-stage inference pipelines. This document describes the architecture and usage of the profiling system.

Task Type Definition

Profiler commands are defined as part of the OmniStageTaskType enum in stage_utils.py:

class OmniStageTaskType(enum.Enum):
    GENERATE = "generate"
    ABORT = "abort"
    SHUTDOWN = "shutdown"
    PROFILER_START = "profiler_start"
    PROFILER_STOP = "profiler_stop"

Basic Usage (Offline Inference)

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-Omni-7B")

# Start profiling for specific stages
omni.start_profile(stages=[0, 1])

# Run inference - profiling captures this
generator = omni.generate(prompts, sampling_params)
for output in generator:
    process(output)

# Stop profiling AFTER inference completes
omni.stop_profile()

omni.close()

Important Notes

  1. Call Order: stop_profile() must be called AFTER inference completes, not immediately after generate(). The generate() method returns a generator, so actual inference happens during iteration.
  2. Stage Selection: Pass stages=[0, 1, 2] to profile specific stages, or None to profile all stages.
  3. Output: Traces are saved to VLLM_TORCH_PROFILER_DIR/ with stage-specific subdirectories.

Output Format

Profiler generates two types of output:

  1. Chrome Trace (JSON): *.pt.trace.json.gz - Can be viewed in Perfetto
  2. Summary Table: profiler_out_*.txt - Human-readable performance summary

Test Plan

After investigating vllm profiler, I recommend to use the below env var.

export VLLM_USE_MODELSCOPE=True
export VLLM_TORCH_PROFILER_DIR=./trace_perf
export VLLM_TORCH_PROFILER_WITH_STACK=1
export VLLM_TORCH_PROFILER_RECORD_SHAPES=0
export VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=0
export VLLM_TORCH_PROFILER_WITH_FLOPS=0
export VLLM_TORCH_PROFILER_USE_GZIP=1
export VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL=1
export VLLM_TORCH_PROFILER_DISABLE_ASYNC_LLM=1
export VLLM_PROFILER_MAX_ITERS=1

To avoid generate too big trace files, limiting the number of stages to be profiled would be recommended.

omni_llm.start_profile(stages=[1])

Test Result

profiler_out_0.txt:

Details

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us     162.909ms      3083.44%     162.909ms     162.909ms             1  
                           vllm::outplace_fused_experts        13.07%      26.381ms        14.32%      28.914ms       1.446ms       1.260ms        23.85%       1.606ms      80.312us            20  
                                       fused_moe_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.260ms        23.85%       1.260ms      31.502us            40  
                                               aten::mm         3.99%       8.050ms         6.17%      12.462ms     102.991us     949.932us        17.98%     978.540us       8.087us           121  
                                              aten::mul         0.65%       1.314ms         1.12%       2.252ms      12.372us     671.188us        12.70%     671.188us       3.688us           182  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     660.692us        12.51%     660.692us       3.671us           180  
ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us     504.757us         9.55%     504.757us       8.275us            61  
                                              aten::cat         0.78%       1.580ms         1.71%       3.458ms      26.809us     433.302us         8.20%     436.854us       3.386us           129  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     251.577us         4.76%     251.577us       3.145us            80  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     212.222us         4.02%     212.222us       2.101us           101  
                                _vllm_fa2_C::varlen_fwd         0.14%     278.931us         2.74%       5.528ms     276.417us     206.365us         3.91%     232.957us      11.648us            20  
void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     206.365us         3.91%     206.365us      10.318us            20  
                                  Lazy Function Loading         0.76%       1.539ms         0.76%       1.539ms      54.959us     183.066us         3.46%     183.066us       6.538us            28  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     165.725us         3.14%     165.725us       4.042us            41  
void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm...         0.00%       0.000us         0.00%       0.000us       0.000us     159.549us         3.02%     159.549us       6.937us            23  
                           _moe_C::moe_align_block_size         0.07%     134.704us         0.24%     477.402us      23.870us     154.812us         2.93%     154.812us       7.741us            20  
                                            aten::copy_         0.34%     692.508us         1.17%       2.355ms      25.596us     149.885us         2.84%     224.732us       2.443us            92  
                       Runtime Triggered Module Loading        41.47%      83.744ms        41.47%      83.744ms       4.652ms     138.876us         2.63%     138.876us       7.715us            18  
                                            aten::index         0.22%     438.690us         0.48%     974.327us      44.288us     137.212us         2.60%     148.891us       6.768us            22  
                                              aten::sum         0.14%     291.610us         0.23%     464.187us      23.209us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::reduce_kernel<128, 4, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     134.780us         2.55%     134.780us       6.418us            21  
                                              aten::add         0.28%     556.915us         0.46%     932.664us       9.818us     134.270us         2.54%     137.918us       1.452us            95  
                                           _C::rms_norm         0.09%     174.059us         0.24%     490.130us      11.954us     133.343us         2.52%     133.343us       3.252us            41  
                                 _C::fused_add_rms_norm         0.11%     212.570us         0.24%     492.322us      12.308us     131.166us         2.48%     131.166us       3.279us            40  
std::enable_if<(((8)>(0)))&&vllm::_typeConvert<c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     131.166us         2.48%     131.166us       3.279us            40  
void vllm::rms_norm_kernel<c10::BFloat16, 8, 3>(c10:...         0.00%       0.000us         0.00%       0.000us       0.000us     130.367us         2.47%     130.367us       3.259us            40  
                                            aten::addmm         0.34%     695.435us         0.48%     959.588us      79.966us     119.358us         2.26%     144.606us      12.051us            12  
                                   _moe_C::topk_softmax         0.07%     144.369us         0.18%     362.292us      18.115us     116.286us         2.20%     116.286us       5.814us            20  
void vllm::moe::topkGatingSoftmax<8, 128, 4, 16, 32,...         0.00%       0.000us         0.00%       0.000us       0.000us     116.286us         2.20%     116.286us       5.814us            20  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     110.783us         2.10%     110.783us       2.770us            40  
void vllm::moe::moe_align_block_size_kernel<int>(int...         0.00%       0.000us         0.00%       0.000us       0.000us     106.684us         2.02%     106.684us       5.334us            20  
                                       _C::silu_and_mul         0.09%     173.470us         0.21%     428.884us      10.722us     101.052us         1.91%     101.052us       2.526us            40  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     101.052us         1.91%     101.052us       2.526us            40  
void cutlass::Kernel2<cutlass_80_wmma_tensorop_s1616...         0.00%       0.000us         0.00%       0.000us       0.000us      86.206us         1.63%      86.206us       4.310us            20  
                                              aten::sub         0.15%     309.465us         0.29%     593.790us      14.138us      83.840us         1.59%      86.080us       2.050us            42  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      74.783us         1.42%      74.783us       5.753us            13  
                  _C_cache_ops::reshape_and_cache_flash         0.07%     133.160us         1.72%       3.483ms     174.130us      69.920us         1.32%      80.096us       4.005us            20  
void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us      69.920us         1.32%      69.920us       3.496us            20  
void dot_kernel<float, 128, 0, cublasDotParams<cubla...         0.00%       0.000us         0.00%       0.000us       0.000us      60.639us         1.15%      60.639us       3.032us            20  
                                          aten::sigmoid         0.11%     221.586us         0.17%     342.447us      17.122us      53.726us         1.02%      53.726us       2.686us            20  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      53.726us         1.02%      53.726us       2.686us            20  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us       0.000us      50.879us         0.96%      50.879us       8.480us             6  
void reduce_1Block_kernel<float, 128, 7, cublasGemvT...         0.00%       0.000us         0.00%       0.000us       0.000us      48.222us         0.91%      48.222us       2.411us            20  
void vllm::moe::count_and_sort_expert_tokens_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      48.128us         0.91%      48.128us       2.406us            20  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us      39.040us         0.74%      39.040us       2.169us            18  
                                             aten::topk         1.18%       2.373ms        10.73%      21.659ms      21.659ms      30.335us         0.57%     151.675us     151.675us             1  
                              aten::_local_scalar_dense         0.05%     102.563us         0.16%     330.240us       6.740us      28.672us         0.54%      28.672us       0.585us            49  
                                          aten::nonzero         0.10%     200.502us         2.89%       5.843ms       1.948ms      27.678us         0.52%      76.953us      25.651us             3  
ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_6...         0.00%       0.000us         0.00%       0.000us       0.000us      25.727us         0.49%      25.727us       8.576us             3  
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      25.695us         0.49%      25.695us       1.977us            13  
                                               aten::eq         0.10%     198.200us         7.49%      15.118ms       1.260ms      23.647us         0.45%      35.615us       2.968us            12  
                                            aten::fill_         0.05%     110.020us         0.11%     221.114us       4.336us      23.039us         0.44%      23.039us       0.452us            51  
void at::native::sbtopk::gatherTopK<float, unsigned ...         0.00%       0.000us         0.00%       0.000us       0.000us      21.856us         0.41%      21.856us      21.856us             1  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us      19.776us         0.37%      19.776us       3.296us             6  
                                             aten::silu         0.04%      90.091us         0.07%     135.434us      22.572us      16.864us         0.32%      16.864us       2.811us             6  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      16.864us         0.32%      16.864us       2.811us             6  
                                 aten::_index_put_impl_         0.03%      58.090us         0.10%     209.052us      69.684us      12.864us         0.24%      22.720us       7.573us             3  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      12.864us         0.24%      12.864us       6.432us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, __nv...         0.00%       0.000us         0.00%       0.000us       0.000us      12.224us         0.23%      12.224us       4.075us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.808us         0.22%      11.808us       1.968us             6  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.496us         0.20%      10.496us       5.248us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us      10.304us         0.20%      10.304us       3.435us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.471us         0.18%       9.471us       2.368us             4  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       9.216us         0.17%       9.216us       3.072us             3  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.184us         0.17%       9.184us       1.837us             5  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       8.607us         0.16%       8.607us       1.721us             5  
void at::native::radixSortKVInPlace<-2, -1, 32, 4, f...         0.00%       0.000us         0.00%       0.000us       0.000us       8.479us         0.16%       8.479us       8.479us             1  
void at_cuda_detail::cub::DeviceSelectSweepKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.872us         0.15%       7.872us       2.624us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.488us         0.14%       7.488us       2.496us             3  
                                     aten::index_select         0.89%       1.796ms         0.96%       1.936ms     645.364us       7.231us         0.14%      12.159us       4.053us             3  
void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.231us         0.14%       7.231us       7.231us             1  
void at_cuda_detail::cub::DeviceReduceSingleTileKern...         0.00%       0.000us         0.00%       0.000us       0.000us       7.103us         0.13%       7.103us       2.368us             3  
                                           aten::argmax         0.01%      19.123us         0.01%      29.241us      29.241us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.784us         0.13%       6.784us       3.392us             2  
                                             aten::div_         0.01%      23.018us         0.02%      42.255us      21.127us       6.688us         0.13%       6.688us       3.344us             2  
                                     aten::masked_fill_         0.02%      30.954us         0.05%      99.651us      24.913us       6.464us         0.12%       8.640us       2.160us             4  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       6.176us         0.12%       6.176us       2.059us             3  
                                             aten::sub_         0.01%      27.663us         0.03%      68.612us      22.871us       6.176us         0.12%       8.160us       2.720us             3  
                                           aten::gather         0.02%      45.116us         0.06%     125.564us      62.782us       5.664us         0.11%       8.128us       4.064us             2  
void at_cuda_detail::cub::DeviceCompactInitKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.343us         0.10%       5.343us       1.781us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.248us         0.10%       5.248us       1.749us             3  
                                         aten::_softmax         0.01%      12.536us         0.01%      24.545us      24.545us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us         0.09%       4.831us       2.415us             2  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us         0.09%       4.768us       4.768us             1  
                                       aten::bitwise_or         0.02%      34.568us         5.54%      11.187ms       5.594ms       4.576us         0.09%      11.968us       5.984us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.576us         0.09%       4.576us       2.288us             2  
                                      aten::bitwise_not         0.02%      35.313us         0.05%      96.025us      48.013us       4.415us         0.08%       6.622us       3.311us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.415us         0.08%       4.415us       2.208us             2  
                                              aten::max         0.07%     143.851us         0.58%       1.181ms      34.721us       4.352us         0.08%       4.352us       0.128us            34  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       4.352us         0.08%       4.352us       4.352us             1  
                                               aten::gt         0.02%      34.013us         0.03%      54.536us      27.268us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us         0.08%       4.288us       2.144us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.192us         0.08%       4.192us       2.096us             2  
                                              aten::any         0.02%      46.241us        11.90%      24.028ms      24.028ms       3.968us         0.08%      15.872us      15.872us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us         0.08%       3.968us       3.968us             1  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.648us         0.07%       3.648us       3.648us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 201.919ms
Self CUDA time total: 5.283ms

Use https://ui.perfetto.dev to view the trace:

image Essential Elements of an Effective PR Description Checklist * [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". * [ ] The test plan, such as providing test command. * [ ] The test results, such as pasting the results comparison before and after, or e2e results * [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. * [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

I think we do not need to use omni.close() manully to shutdown the service or process

@hsliuustc0106
Copy link
Collaborator

from the log, it seems the CPU consumes the majority time, is it accurate?

@hsliuustc0106
Copy link
Collaborator

fix ci please

@gcanlin
Copy link
Collaborator Author

gcanlin commented Jan 4, 2026

fix ci please

Sorry for being late and thanks for the review. I’ll work through the issues below ASAP:

  1. CUDA time total seems to be not accurate.
  2. CI fails.
  3. From @lishunyang12 's feedback, multi-stages profiler maybe exists some bugs.

lishunyang12 and others added 3 commits January 4, 2026 23:09
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@hsliuustc0106
Copy link
Collaborator

@wuhang2014 PTAL

@gcanlin
Copy link
Collaborator Author

gcanlin commented Jan 5, 2026

I tested vllm's profiler with model Qwen/Qwen2.5-Omni-7B(only thinker) and get the below results, which can explain why CUDA time is much smaller than CPU time. So yes, it's accurate because we're reusing vllm's profiler. And setting VLLM_PROFILER_MAX_ITERS=1 will reduce much CUDA time.

(vllm) Original:

(EngineCore_DP0 pid=719568) Self CPU time total: 249.078ms
(EngineCore_DP0 pid=719568) Self CUDA time total: 184.041ms

(vllm) After setting export VLLM_PROFILER_MAX_ITERS=1:

(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms

BUT, I'm confused why the trace file is far bigger than vllm without omni under current implementation. Looking at the pictures, compared with vLLM, the profiler of vLLM-Omni records additional parts highlighted in the red boxes in the picture, which account for the majority of the trace file. @lishunyang12 Could you also help take a look for these additional records?

vllm-omni:

image

vllm:

image

The more details:

Details
INFO 01-05 02:28:17 [llm.py:360] Supported tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 553.36it/s]
Processed prompts:   0%|                                            | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=748665) INFO 01-05 02:28:18 [wrapper.py:108] Max profiling iterations reached. Stopping profiler...
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665)                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665)                                  execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us      16.765ms       161.18%      16.765ms      16.765ms             1
(EngineCore_DP0 pid=748665) ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us       5.129ms        49.31%       5.129ms     176.877us            29
(EngineCore_DP0 pid=748665)       ampere_s16816gemm_bf16_128x64_ldg8_stages_32x6_tn         0.00%       0.000us         0.00%       0.000us       0.000us       2.462ms        23.67%       2.462ms      87.941us            28
(EngineCore_DP0 pid=748665)        ampere_s16816gemm_bf16_64x64_ldg8_stages_64x6_tn         0.00%       0.000us         0.00%       0.000us       0.000us     753.941us         7.25%     753.941us      26.926us            28
(EngineCore_DP0 pid=748665) ampere_bf16_s16816gemm_bf16_64x64_ldg8_relu_f2f_stag...         0.00%       0.000us         0.00%       0.000us       0.000us     751.668us         7.23%     751.668us      26.845us            28
(EngineCore_DP0 pid=748665)                                                aten::mm         0.02%      78.934us         0.02%      91.386us      91.386us     618.101us         5.94%     618.101us     618.101us             1
(EngineCore_DP0 pid=748665)                                 _vllm_fa2_C::varlen_fwd         0.07%     286.438us         0.94%       4.019ms     143.549us     292.121us         2.81%     317.849us      11.352us            28
(EngineCore_DP0 pid=748665) void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     292.121us         2.81%     292.121us      10.433us            28
(EngineCore_DP0 pid=748665)                                   Lazy Function Loading         0.11%     455.061us         0.11%     455.061us      41.369us     244.568us         2.35%     244.568us      22.233us            11
(EngineCore_DP0 pid=748665)                        Runtime Triggered Module Loading         4.13%      17.717ms         4.13%      17.717ms       4.429ms     156.540us         1.50%     156.540us      39.135us             4
(EngineCore_DP0 pid=748665)                                          aten::_softmax         0.01%      42.400us         0.02%      80.406us      40.203us     153.725us         1.48%     153.725us      76.862us             2
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us     153.725us         1.48%     153.725us      76.862us             2
(EngineCore_DP0 pid=748665) void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     149.120us         1.43%     149.120us       2.663us            56
(EngineCore_DP0 pid=748665)                   _C_cache_ops::reshape_and_cache_flash         0.03%     149.870us         0.08%     360.388us      12.871us     109.023us         1.05%     109.023us       3.894us            28
(EngineCore_DP0 pid=748665) void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us     109.023us         1.05%     109.023us       3.894us            28
(EngineCore_DP0 pid=748665)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2         0.00%       0.000us         0.00%       0.000us       0.000us     106.525us         1.02%     106.525us       3.804us            28
(EngineCore_DP0 pid=748665)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0         0.00%       0.000us         0.00%       0.000us       0.000us      91.771us         0.88%      91.771us       3.278us            28
(EngineCore_DP0 pid=748665) triton_poi_fused_cat_index_select_split_split_with_s...         0.00%       0.000us         0.00%       0.000us       0.000us      79.486us         0.76%      79.486us       2.944us            27
(EngineCore_DP0 pid=748665)                       triton_poi_fused_mul_silu_slice_1         0.00%       0.000us         0.00%       0.000us       0.000us      75.453us         0.73%      75.453us       2.695us            28
(EngineCore_DP0 pid=748665)                                              aten::sort         0.03%     149.280us        95.86%     410.913ms     410.913ms      69.694us         0.67%     420.692us     420.692us             1
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_4         0.00%       0.000us         0.00%       0.000us       0.000us      60.511us         0.58%      60.511us       2.241us            27
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortOnesweepKer...         0.00%       0.000us         0.00%       0.000us       0.000us      48.094us         0.46%      48.094us      12.023us             4
(EngineCore_DP0 pid=748665)                                             aten::copy_         0.06%     258.622us         0.53%       2.280ms      78.607us      40.800us         0.39%      59.584us       2.055us            29
(EngineCore_DP0 pid=748665)                                         cudaGraphLaunch         0.20%     843.738us         0.20%     843.738us      29.094us      14.496us         0.14%      14.496us       0.500us            29
(EngineCore_DP0 pid=748665)                                         Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      12.032us         0.12%      12.032us       1.719us             7
(EngineCore_DP0 pid=748665)                                            aten::argmax         0.01%      38.972us         0.02%      67.032us      67.032us      11.456us         0.11%      11.456us      11.456us             1
(EngineCore_DP0 pid=748665)                                   cudaStreamIsCapturing         0.01%      24.967us         0.01%      24.967us       0.780us      10.656us         0.10%      10.656us       0.333us            32
(EngineCore_DP0 pid=748665) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us      10.432us         0.10%      10.432us      10.432us             1
(EngineCore_DP0 pid=748665)                                           aten::scatter         0.01%      33.678us         0.02%      87.305us      87.305us      10.399us         0.10%      12.927us      12.927us             1
(EngineCore_DP0 pid=748665) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      10.399us         0.10%      10.399us      10.399us             1
(EngineCore_DP0 pid=748665)                          Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.368us         0.10%      10.368us       1.152us             9
(EngineCore_DP0 pid=748665)                                              aten::div_         0.01%      27.728us         0.01%      45.493us      22.746us       8.064us         0.08%       8.064us       4.032us             2
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.392us         0.07%       7.392us       2.464us             3
(EngineCore_DP0 pid=748665)                          Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.136us         0.07%       7.136us       2.379us             3
(EngineCore_DP0 pid=748665)                                      aten::index_select         0.01%      40.458us         0.03%     128.561us      64.280us       7.008us         0.07%      14.016us       7.008us             2
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.008us         0.07%       7.008us       7.008us             1
(EngineCore_DP0 pid=748665)                                             aten::index         0.01%      25.503us         0.02%      70.290us      70.290us       6.240us         0.06%       9.216us       9.216us             1
(EngineCore_DP0 pid=748665) void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       6.240us         0.06%       6.240us       6.240us             1
(EngineCore_DP0 pid=748665)                                            aten::cumsum         0.01%      53.262us         0.04%     188.388us     188.388us       6.047us         0.06%      18.141us      18.141us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.728us         0.06%       5.728us       5.728us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortHistogramKe...         0.00%       0.000us         0.00%       0.000us       0.000us       5.536us         0.05%       5.536us       5.536us             1
(EngineCore_DP0 pid=748665)                                      aten::exponential_         0.01%      21.808us         0.01%      40.378us      40.378us       4.960us         0.05%       4.960us       4.960us             1
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::distribution...         0.00%       0.000us         0.00%       0.000us       0.000us       4.960us         0.05%       4.960us       4.960us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceScanKernel<at_cuda_d...         0.00%       0.000us         0.00%       0.000us       0.000us       4.704us         0.05%       4.704us       4.704us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.256us         0.04%       4.256us       4.256us             1
(EngineCore_DP0 pid=748665)                                                aten::le         0.01%      23.814us         0.01%      42.912us      42.912us       4.224us         0.04%       4.224us       4.224us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.224us         0.04%       4.224us       4.224us             1
(EngineCore_DP0 pid=748665)                                    cudaDriverGetVersion         0.00%       4.996us         0.00%       4.996us       0.172us       4.032us         0.04%       4.032us       0.139us            29
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_0         0.00%       0.000us         0.00%       0.000us       0.000us       4.000us         0.04%       4.000us       4.000us             1
(EngineCore_DP0 pid=748665)                          Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us       3.936us         0.04%       3.936us       3.936us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.808us         0.04%       3.808us       3.808us             1
(EngineCore_DP0 pid=748665)                                               aten::sub         0.01%      60.373us         0.03%     117.496us      39.165us       3.520us         0.03%       5.312us       1.771us             3
(EngineCore_DP0 pid=748665)                                      triton_red_fused_1         0.00%       0.000us         0.00%       0.000us       0.000us       3.519us         0.03%       3.519us       3.519us             1
(EngineCore_DP0 pid=748665) at::native::(anonymous namespace)::fill_reverse_indi...         0.00%       0.000us         0.00%       0.000us       0.000us       2.880us         0.03%       2.880us       2.880us             1
(EngineCore_DP0 pid=748665)                                      aten::masked_fill_         0.00%      17.340us         0.01%      30.540us      30.540us       2.688us         0.03%       2.688us       2.688us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.688us         0.03%       2.688us       2.688us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       2.464us         0.02%       2.464us       2.464us             1
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_2         0.00%       0.000us         0.00%       0.000us       0.000us       2.336us         0.02%       2.336us       2.336us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       2.240us         0.02%       2.240us       2.240us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortExclusiveSu...         0.00%       0.000us         0.00%       0.000us       0.000us       2.176us         0.02%       2.176us       2.176us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.792us         0.02%       1.792us       1.792us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.728us         0.02%       1.728us       1.728us             1
(EngineCore_DP0 pid=748665)                        Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.536us         0.01%       1.536us       1.536us             1
(EngineCore_DP0 pid=748665)                                 Activity Buffer Request         0.33%       1.415ms         0.33%       1.415ms       1.415ms       1.504us         0.01%       1.504us       1.504us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceScanInitKernel<at_cu...         0.00%       0.000us         0.00%       0.000us       0.000us       1.343us         0.01%       1.343us       1.343us             1
(EngineCore_DP0 pid=748665)                                             aten::fill_         0.01%      37.675us         0.02%      97.573us      10.841us       1.312us         0.01%       2.624us       0.292us             9
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.312us         0.01%       1.312us       1.312us             1
(EngineCore_DP0 pid=748665)                                  execute_new_1_cached_0         1.24%       5.328ms         3.86%      16.535ms      16.535ms       0.000us         0.00%       1.126ms       1.126ms             1
(EngineCore_DP0 pid=748665)                                             aten::empty         0.05%     232.853us         0.05%     232.853us       3.105us       0.000us         0.00%       0.000us       0.000us            75
(EngineCore_DP0 pid=748665)                                                aten::to         0.00%      20.043us         0.06%     276.869us      18.458us       0.000us         0.00%      17.600us       1.173us            15
(EngineCore_DP0 pid=748665)                                        aten::lift_fresh         0.00%       5.186us         0.00%       5.186us       0.519us       0.000us         0.00%       0.000us       0.000us            10
(EngineCore_DP0 pid=748665)                                           aten::detach_         0.00%       2.036us         0.00%       2.036us       0.291us       0.000us         0.00%       0.000us       0.000us             7
(EngineCore_DP0 pid=748665)                                            aten::expand         0.04%     160.158us         0.05%     226.821us       2.493us       0.000us         0.00%       0.000us       0.000us            91
(EngineCore_DP0 pid=748665)                                        aten::as_strided         0.05%     233.168us         0.05%     233.168us       0.918us       0.000us         0.00%       0.000us       0.000us           254
(EngineCore_DP0 pid=748665)                                               aten::max         0.01%      27.310us         0.02%      84.039us      16.808us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                        aten::contiguous         0.00%       7.237us         0.01%      59.929us      11.986us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                             aten::clone         0.00%      15.910us         0.01%      52.692us      10.538us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                        aten::empty_like         0.01%      64.137us         0.06%     256.336us       7.324us       0.000us         0.00%       0.000us       0.000us            35
(EngineCore_DP0 pid=748665)                                               aten::add         0.00%      13.370us         0.00%      13.370us       2.674us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                              aten::item         0.00%       9.374us         0.00%      12.668us       2.534us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                               aten::_local_scalar_dense         0.00%       3.294us         0.00%       3.294us       0.659us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                               aten::cat         0.00%      20.270us         0.01%      39.340us      19.670us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                            aten::narrow         0.00%      12.987us         0.00%      19.070us       9.535us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                             aten::slice         0.06%     239.412us         0.08%     326.962us       3.336us       0.000us         0.00%       0.000us       0.000us            98
(EngineCore_DP0 pid=748665)                                         cudaMemcpyAsync         0.04%     188.137us         0.04%     188.137us      13.438us       0.000us         0.00%       0.000us       0.000us            14
(EngineCore_DP0 pid=748665)                                             aten::alias         0.00%       5.078us         0.00%       5.078us       2.539us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                           aten::flatten         0.00%       7.492us         0.00%      11.670us      11.670us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                                              aten::view         0.00%       8.955us         0.00%       8.955us       2.985us       0.000us         0.00%       0.000us       0.000us             3
(EngineCore_DP0 pid=748665)                                         aten::expand_as         0.00%       5.301us         0.00%      13.241us       6.621us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                        cudaLaunchKernel         0.19%     830.662us         0.27%       1.153ms      13.410us       0.000us         0.00%     160.251us       1.863us            86
(EngineCore_DP0 pid=748665)                                          aten::_to_copy         0.01%      30.874us         0.06%     256.826us      51.365us       0.000us         0.00%      17.600us       3.520us             5
(EngineCore_DP0 pid=748665)                                     aten::empty_strided         0.05%     219.729us         0.05%     219.729us       6.463us       0.000us         0.00%       0.000us       0.000us            34
(EngineCore_DP0 pid=748665)                                         aten::embedding         0.00%      12.115us         0.03%     135.502us     135.502us       0.000us         0.00%      14.016us      14.016us             1
(EngineCore_DP0 pid=748665)                                           aten::resize_         0.00%       6.732us         0.00%       6.732us       6.732us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                                       Pregraph bytecode         0.04%     176.615us         0.04%     176.615us     176.615us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                     vllm::unified_attention_with_output         0.54%       2.306ms         1.76%       7.555ms     269.834us       0.000us         0.00%     426.872us      15.245us            28
(EngineCore_DP0 pid=748665)                                            aten::unbind         0.03%     119.471us         0.07%     295.443us      10.552us       0.000us         0.00%       0.000us       0.000us            28
(EngineCore_DP0 pid=748665)                                            aten::select         0.03%     121.132us         0.04%     184.344us       3.234us       0.000us         0.00%       0.000us       0.000us            57
(EngineCore_DP0 pid=748665)                                  cudaDeviceGetAttribute         0.01%      28.127us         0.01%      28.127us       0.485us       0.000us         0.00%       0.000us       0.000us            58
(EngineCore_DP0 pid=748665)                                    cudaFuncSetAttribute         0.01%      46.421us         0.78%       3.349ms     119.621us       0.000us         0.00%      25.728us       0.919us            28
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms
(EngineCore_DP0 pid=748665)

@david6666666
Copy link
Collaborator

As dissucssed in the weekly meeting, we have those Todos under this Pull Request:

1. Rebase Recent updates introduce omni_base class which is overloaded by omni class and async_omni class. We can place start_profile() and stop_profile() under omni_base instead of replicating same functions in two child classes.

2. API design To adapt to vLLM developers transitioning to vLLM-omni, we should stick to vLLM profiler's api design. https://docs.vllm.ai/en/stable/contributing/profiling/#profile-with-pytorch-profiler

  • Offline Inference:
 llm = LLM(
        model="xx",
        tensor_parallel_size=1,
        profiler_config={
            "profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
        },
    )

    llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    llm.stop_profile()
  • Online Serving:
vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

3. Diffusion Pipeline Support Diffusion engine is newly introduced in omni and should add implementation for start_profile() and stop_profile(). We can reuse vLLM profiler module. https://github.com/vllm-project/vllm/tree/main/vllm/profiler

4. Provide examples image To align with vLLM, we can provide test files specifically for offline inference profiling. I recommend trying with small models like qwen2.5-omni-7B/3B for Omni and z-image for Diffusion.

5. Documentation Provide developers with an user guide at https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/.

I think Diffusion Pipeline are not supported profiler now,should add start_profile() and stop_profile().

@hsliuustc0106
Copy link
Collaborator

I tested vllm's profiler with model Qwen/Qwen2.5-Omni-7B(only thinker) and get the below results, which can explain why CUDA time is much smaller than CPU time. So yes, it's accurate because we're reusing vllm's profiler. And setting VLLM_PROFILER_MAX_ITERS=1 will reduce much CUDA time.

(vllm) Original:

(EngineCore_DP0 pid=719568) Self CPU time total: 249.078ms
(EngineCore_DP0 pid=719568) Self CUDA time total: 184.041ms

(vllm) After setting export VLLM_PROFILER_MAX_ITERS=1:

(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms

BUT, I'm confused why the trace file is far bigger than vllm without omni under current implementation. Looking at the pictures, compared with vLLM, the profiler of vLLM-Omni records additional parts highlighted in the red boxes in the picture, which account for the majority of the trace file. @lishunyang12 Could you also help take a look for these additional records?

vllm-omni:

image vllm: image The more details:

Details

please submit a new issue to weekly meeting and let's discuss on Wendesday

Comment on lines +1253 to +1254
elif await handle_profiler_task_async(task_type):
pass # Profiler command handled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here is a bit of confused. A type check statement should be in elif line and just execute handle_profiler_task_async with specified types. I think code here should be like:

elif is_profiling_task(task_type):
    handle_profiling_task_async(task_type)
    continue
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcanlin WDYT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I will modify it later. thx!

Comment on lines +669 to +670
if handle_profiler_task(task_type):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code here has same logic as in async method:

if is_profiling_task(task_type):
    handle_profiling_task(task_type)
    continue

Comment on lines +380 to +382
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
if profiler_enabled:
omni_llm.start_profile(stages=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a guidence in docs for users to start/stop profiling in their own codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, noted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #570.

print(f"Request ID: {request_id}, Saved audio to {output_wav}")

processed_count += len(stage_outputs.request_output)
if profiler_enabled and processed_count >= total_requests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in condition processed_count >= total_requests, we should stop profiling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[In omni](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/entrypoints/omni.py)

The omni will kill the worker once we exit the loop. I set the stop_profile() when we are about to exit the loop.
image

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@lishunyang12
Copy link
Contributor

lishunyang12 commented Jan 5, 2026

I have investigated the trace logs and confirmed that the large file size and high CPU time are expected artifacts of the MultiprocExecutor used in vllm-omni

  • Standard vLLM typically uses UniProcExecutor (single process), which has zero IPC overhead
  • vLLM-Omni forces MultiprocExecutor(separate processes), so the worker process sits in a busy loop waiting for shared memory synchronization. This "wait" time is recorded as CPU activity.

Why?
vllm-omni deals with Multimodal inputs (Audio and Video). Use multiproc to prevent video/audio decoding from freezing the GPU/GPUs.

The profiler is accurate; we are seeing the cost of process synchronization(multiprocessing, shm_broadcast), not model execution (cuda graph, gemm, flashattn) .

Reference: The logic switching between UniProcand Multiproc is defined here: https://github.com/vllm-project/vllm/blob/main/vllm/v1/executor/abstract.py#L26

Breakdown for profiling results:

Metric Time % of Total
Execution Time 10.4 ms ~2.4%
Waiting Time 418.2 ms ~97.6%
Total Latency 428.6 ms 100%

Conponent Analysis

Component Time Taken Explanation
aten::sort 410.9 ms The CPU initiates the sort for token sampling but blocks here while waiting for the GPU and other processes to sync. It is not actually calculating for 400ms; it is sleeping.
Runtime Loading 17.7 ms One-time cost for loading dynamic libraries. This disappears after the first run
GEMM (Matrix Math) 8.3 ms The core linear layers of the model.
Flash Attention 0.3 ms Flash Attention 2

Why aten:sort?
The 410ms you see is Wait Time, not Sort Time

In the vllm sampling phase, the CPU often needs to sort the logits to pick the next token.

  • The Workflow according to the metrics table:
    1. GPU calculates logits (takes 10ms).
    2. CPU calls aten::sort to pick the best token
    3. The Bottleneck: To sort the data, the CPU must wait for the GPU to finish calculating and transfer the data.

illusion?
The profiler starts the clock when aten::sort is called. The CPU then sits there doing nothing while waiting for the GPU or the shared memory sync. Once the data arrives, the sort finishes instantly

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Collaborator Author

gcanlin commented Jan 5, 2026

Thanks for investigating @lishunyang12. IMO, it's hard not to trace shm_boardcast.py:dequeue if we want to reuse vLLM's profiler. Even if the trace file is so large(~70MB), current profiler can still work. So I suggest we can skip this issue temporarily and merge this PR first (if other issues were fixed). And I will open an issue to trace it. After we have deeper discussion, I think we will find some ways to fix it in a following PR. @hsliuustc0106

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit 52afcb6 into vllm-project:main Jan 5, 2026
7 checks passed
@hsliuustc0106
Copy link
Collaborator

thanks, but we have figured out a lot of problems with profiling, please open new issues

@wuhang2014
Copy link
Contributor

I have investigated the trace logs and confirmed that the large file size and high CPU time are expected artifacts of the MultiprocExecutor used in vllm-omni

  • Standard vLLM typically uses UniProcExecutor (single process), which has zero IPC overhead
  • vLLM-Omni forces MultiprocExecutor(separate processes), so the worker process sits in a busy loop waiting for shared memory synchronization. This "wait" time is recorded as CPU activity.

Why? vllm-omni deals with Multimodal inputs (Audio and Video). Use multiproc to prevent video/audio decoding from freezing the GPU/GPUs.

The profiler is accurate; we are seeing the cost of process synchronization(multiprocessing, shm_broadcast), not model execution (cuda graph, gemm, flashattn) .

Reference: The logic switching between UniProcand Multiproc is defined here: https://github.com/vllm-project/vllm/blob/main/vllm/v1/executor/abstract.py#L26

Breakdown for profiling results:

Metric Time % of Total
Execution Time 10.4 ms ~2.4%
Waiting Time 418.2 ms ~97.6%
Total Latency 428.6 ms 100%
Conponent Analysis

Component Time Taken Explanation
aten::sort 410.9 ms The CPU initiates the sort for token sampling but blocks here while waiting for the GPU and other processes to sync. It is not actually calculating for 400ms; it is sleeping.
Runtime Loading 17.7 ms One-time cost for loading dynamic libraries. This disappears after the first run
GEMM (Matrix Math) 8.3 ms The core linear layers of the model.
Flash Attention 0.3 ms Flash Attention 2
Why aten:sort? The 410ms you see is Wait Time, not Sort Time

In the vllm sampling phase, the CPU often needs to sort the logits to pick the next token.

  • The Workflow according to the metrics table:

    1. GPU calculates logits (takes 10ms).
    2. CPU calls aten::sort to pick the best token
    3. The Bottleneck: To sort the data, the CPU must wait for the GPU to finish calculating and transfer the data.

illusion? The profiler starts the clock when aten::sort is called. The CPU then sits there doing nothing while waiting for the GPU or the shared memory sync. Once the data arrives, the sort finishes instantly

Can we get a simular time cost if we hack the code with forcing to use UniProcExecutor ? @gcanlin @lishunyang12

tzhouam pushed a commit to tzhouam/vllm-omni that referenced this pull request Jan 6, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: lishunyang <lishunyang12@163.com>
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: lishunyang <lishunyang12@163.com>
sniper35 pushed a commit to sniper35/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: lishunyang <lishunyang12@163.com>
ZJY0516 pushed a commit to LawJarp-A/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Co-authored-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support torch profiler

6 participants