[Feature] Support torch profiler across omni stages by gcanlin · Pull Request #553 · vllm-project/vllm-omni

gcanlin · 2025-12-30T16:43:22Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Close #481.

vLLM-Omni supports profiling via PyTorch's built-in profiler, allowing users to capture detailed performance traces of multi-stage inference pipelines. This document describes the architecture and usage of the profiling system.

Task Type Definition

Profiler commands are defined as part of the OmniStageTaskType enum in stage_utils.py:

class OmniStageTaskType(enum.Enum):
    GENERATE = "generate"
    ABORT = "abort"
    SHUTDOWN = "shutdown"
    PROFILER_START = "profiler_start"
    PROFILER_STOP = "profiler_stop"

Basic Usage (Offline Inference)

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-Omni-7B")

# Start profiling for specific stages
omni.start_profile(stages=[0, 1])

# Run inference - profiling captures this
generator = omni.generate(prompts, sampling_params)
for output in generator:
    process(output)

# Stop profiling AFTER inference completes
omni.stop_profile()

omni.close()

Important Notes

Call Order: stop_profile() must be called AFTER inference completes, not immediately after generate(). The generate() method returns a generator, so actual inference happens during iteration.
Stage Selection: Pass stages=[0, 1, 2] to profile specific stages, or None to profile all stages.
Output: Traces are saved to VLLM_TORCH_PROFILER_DIR/ with stage-specific subdirectories.

Output Format

Profiler generates two types of output:

Chrome Trace (JSON): *.pt.trace.json.gz - Can be viewed in Perfetto
Summary Table: profiler_out_*.txt - Human-readable performance summary

Test Plan

After investigating vllm profiler, I recommend to use the below env var.

export VLLM_USE_MODELSCOPE=True
export VLLM_TORCH_PROFILER_DIR=./trace_perf
export VLLM_TORCH_PROFILER_WITH_STACK=1
export VLLM_TORCH_PROFILER_RECORD_SHAPES=0
export VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=0
export VLLM_TORCH_PROFILER_WITH_FLOPS=0
export VLLM_TORCH_PROFILER_USE_GZIP=1
export VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL=1
export VLLM_TORCH_PROFILER_DISABLE_ASYNC_LLM=1
export VLLM_PROFILER_MAX_ITERS=1

To avoid generate too big trace files, limiting the number of stages to be profiled would be recommended.

omni_llm.start_profile(stages=[1])

Test Result

profiler_out_0.txt:

Details

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us     162.909ms      3083.44%     162.909ms     162.909ms             1  
                           vllm::outplace_fused_experts        13.07%      26.381ms        14.32%      28.914ms       1.446ms       1.260ms        23.85%       1.606ms      80.312us            20  
                                       fused_moe_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.260ms        23.85%       1.260ms      31.502us            40  
                                               aten::mm         3.99%       8.050ms         6.17%      12.462ms     102.991us     949.932us        17.98%     978.540us       8.087us           121  
                                              aten::mul         0.65%       1.314ms         1.12%       2.252ms      12.372us     671.188us        12.70%     671.188us       3.688us           182  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     660.692us        12.51%     660.692us       3.671us           180  
ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us     504.757us         9.55%     504.757us       8.275us            61  
                                              aten::cat         0.78%       1.580ms         1.71%       3.458ms      26.809us     433.302us         8.20%     436.854us       3.386us           129  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     251.577us         4.76%     251.577us       3.145us            80  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     212.222us         4.02%     212.222us       2.101us           101  
                                _vllm_fa2_C::varlen_fwd         0.14%     278.931us         2.74%       5.528ms     276.417us     206.365us         3.91%     232.957us      11.648us            20  
void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     206.365us         3.91%     206.365us      10.318us            20  
                                  Lazy Function Loading         0.76%       1.539ms         0.76%       1.539ms      54.959us     183.066us         3.46%     183.066us       6.538us            28  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     165.725us         3.14%     165.725us       4.042us            41  
void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm...         0.00%       0.000us         0.00%       0.000us       0.000us     159.549us         3.02%     159.549us       6.937us            23  
                           _moe_C::moe_align_block_size         0.07%     134.704us         0.24%     477.402us      23.870us     154.812us         2.93%     154.812us       7.741us            20  
                                            aten::copy_         0.34%     692.508us         1.17%       2.355ms      25.596us     149.885us         2.84%     224.732us       2.443us            92  
                       Runtime Triggered Module Loading        41.47%      83.744ms        41.47%      83.744ms       4.652ms     138.876us         2.63%     138.876us       7.715us            18  
                                            aten::index         0.22%     438.690us         0.48%     974.327us      44.288us     137.212us         2.60%     148.891us       6.768us            22  
                                              aten::sum         0.14%     291.610us         0.23%     464.187us      23.209us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::reduce_kernel<128, 4, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     134.780us         2.55%     134.780us       6.418us            21  
                                              aten::add         0.28%     556.915us         0.46%     932.664us       9.818us     134.270us         2.54%     137.918us       1.452us            95  
                                           _C::rms_norm         0.09%     174.059us         0.24%     490.130us      11.954us     133.343us         2.52%     133.343us       3.252us            41  
                                 _C::fused_add_rms_norm         0.11%     212.570us         0.24%     492.322us      12.308us     131.166us         2.48%     131.166us       3.279us            40  
std::enable_if<(((8)>(0)))&&vllm::_typeConvert<c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     131.166us         2.48%     131.166us       3.279us            40  
void vllm::rms_norm_kernel<c10::BFloat16, 8, 3>(c10:...         0.00%       0.000us         0.00%       0.000us       0.000us     130.367us         2.47%     130.367us       3.259us            40  
                                            aten::addmm         0.34%     695.435us         0.48%     959.588us      79.966us     119.358us         2.26%     144.606us      12.051us            12  
                                   _moe_C::topk_softmax         0.07%     144.369us         0.18%     362.292us      18.115us     116.286us         2.20%     116.286us       5.814us            20  
void vllm::moe::topkGatingSoftmax<8, 128, 4, 16, 32,...         0.00%       0.000us         0.00%       0.000us       0.000us     116.286us         2.20%     116.286us       5.814us            20  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     110.783us         2.10%     110.783us       2.770us            40  
void vllm::moe::moe_align_block_size_kernel<int>(int...         0.00%       0.000us         0.00%       0.000us       0.000us     106.684us         2.02%     106.684us       5.334us            20  
                                       _C::silu_and_mul         0.09%     173.470us         0.21%     428.884us      10.722us     101.052us         1.91%     101.052us       2.526us            40  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     101.052us         1.91%     101.052us       2.526us            40  
void cutlass::Kernel2<cutlass_80_wmma_tensorop_s1616...         0.00%       0.000us         0.00%       0.000us       0.000us      86.206us         1.63%      86.206us       4.310us            20  
                                              aten::sub         0.15%     309.465us         0.29%     593.790us      14.138us      83.840us         1.59%      86.080us       2.050us            42  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      74.783us         1.42%      74.783us       5.753us            13  
                  _C_cache_ops::reshape_and_cache_flash         0.07%     133.160us         1.72%       3.483ms     174.130us      69.920us         1.32%      80.096us       4.005us            20  
void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us      69.920us         1.32%      69.920us       3.496us            20  
void dot_kernel<float, 128, 0, cublasDotParams<cubla...         0.00%       0.000us         0.00%       0.000us       0.000us      60.639us         1.15%      60.639us       3.032us            20  
                                          aten::sigmoid         0.11%     221.586us         0.17%     342.447us      17.122us      53.726us         1.02%      53.726us       2.686us            20  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      53.726us         1.02%      53.726us       2.686us            20  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us       0.000us      50.879us         0.96%      50.879us       8.480us             6  
void reduce_1Block_kernel<float, 128, 7, cublasGemvT...         0.00%       0.000us         0.00%       0.000us       0.000us      48.222us         0.91%      48.222us       2.411us            20  
void vllm::moe::count_and_sort_expert_tokens_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      48.128us         0.91%      48.128us       2.406us            20  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us      39.040us         0.74%      39.040us       2.169us            18  
                                             aten::topk         1.18%       2.373ms        10.73%      21.659ms      21.659ms      30.335us         0.57%     151.675us     151.675us             1  
                              aten::_local_scalar_dense         0.05%     102.563us         0.16%     330.240us       6.740us      28.672us         0.54%      28.672us       0.585us            49  
                                          aten::nonzero         0.10%     200.502us         2.89%       5.843ms       1.948ms      27.678us         0.52%      76.953us      25.651us             3  
ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_6...         0.00%       0.000us         0.00%       0.000us       0.000us      25.727us         0.49%      25.727us       8.576us             3  
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      25.695us         0.49%      25.695us       1.977us            13  
                                               aten::eq         0.10%     198.200us         7.49%      15.118ms       1.260ms      23.647us         0.45%      35.615us       2.968us            12  
                                            aten::fill_         0.05%     110.020us         0.11%     221.114us       4.336us      23.039us         0.44%      23.039us       0.452us            51  
void at::native::sbtopk::gatherTopK<float, unsigned ...         0.00%       0.000us         0.00%       0.000us       0.000us      21.856us         0.41%      21.856us      21.856us             1  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us      19.776us         0.37%      19.776us       3.296us             6  
                                             aten::silu         0.04%      90.091us         0.07%     135.434us      22.572us      16.864us         0.32%      16.864us       2.811us             6  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      16.864us         0.32%      16.864us       2.811us             6  
                                 aten::_index_put_impl_         0.03%      58.090us         0.10%     209.052us      69.684us      12.864us         0.24%      22.720us       7.573us             3  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      12.864us         0.24%      12.864us       6.432us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, __nv...         0.00%       0.000us         0.00%       0.000us       0.000us      12.224us         0.23%      12.224us       4.075us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.808us         0.22%      11.808us       1.968us             6  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.496us         0.20%      10.496us       5.248us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us      10.304us         0.20%      10.304us       3.435us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.471us         0.18%       9.471us       2.368us             4  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       9.216us         0.17%       9.216us       3.072us             3  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.184us         0.17%       9.184us       1.837us             5  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       8.607us         0.16%       8.607us       1.721us             5  
void at::native::radixSortKVInPlace<-2, -1, 32, 4, f...         0.00%       0.000us         0.00%       0.000us       0.000us       8.479us         0.16%       8.479us       8.479us             1  
void at_cuda_detail::cub::DeviceSelectSweepKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.872us         0.15%       7.872us       2.624us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.488us         0.14%       7.488us       2.496us             3  
                                     aten::index_select         0.89%       1.796ms         0.96%       1.936ms     645.364us       7.231us         0.14%      12.159us       4.053us             3  
void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.231us         0.14%       7.231us       7.231us             1  
void at_cuda_detail::cub::DeviceReduceSingleTileKern...         0.00%       0.000us         0.00%       0.000us       0.000us       7.103us         0.13%       7.103us       2.368us             3  
                                           aten::argmax         0.01%      19.123us         0.01%      29.241us      29.241us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.784us         0.13%       6.784us       3.392us             2  
                                             aten::div_         0.01%      23.018us         0.02%      42.255us      21.127us       6.688us         0.13%       6.688us       3.344us             2  
                                     aten::masked_fill_         0.02%      30.954us         0.05%      99.651us      24.913us       6.464us         0.12%       8.640us       2.160us             4  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       6.176us         0.12%       6.176us       2.059us             3  
                                             aten::sub_         0.01%      27.663us         0.03%      68.612us      22.871us       6.176us         0.12%       8.160us       2.720us             3  
                                           aten::gather         0.02%      45.116us         0.06%     125.564us      62.782us       5.664us         0.11%       8.128us       4.064us             2  
void at_cuda_detail::cub::DeviceCompactInitKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.343us         0.10%       5.343us       1.781us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.248us         0.10%       5.248us       1.749us             3  
                                         aten::_softmax         0.01%      12.536us         0.01%      24.545us      24.545us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us         0.09%       4.831us       2.415us             2  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us         0.09%       4.768us       4.768us             1  
                                       aten::bitwise_or         0.02%      34.568us         5.54%      11.187ms       5.594ms       4.576us         0.09%      11.968us       5.984us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.576us         0.09%       4.576us       2.288us             2  
                                      aten::bitwise_not         0.02%      35.313us         0.05%      96.025us      48.013us       4.415us         0.08%       6.622us       3.311us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.415us         0.08%       4.415us       2.208us             2  
                                              aten::max         0.07%     143.851us         0.58%       1.181ms      34.721us       4.352us         0.08%       4.352us       0.128us            34  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       4.352us         0.08%       4.352us       4.352us             1  
                                               aten::gt         0.02%      34.013us         0.03%      54.536us      27.268us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us         0.08%       4.288us       2.144us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.192us         0.08%       4.192us       2.096us             2  
                                              aten::any         0.02%      46.241us        11.90%      24.028ms      24.028ms       3.968us         0.08%      15.872us      15.872us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us         0.08%       3.968us       3.968us             1  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.648us         0.07%       3.648us       3.648us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 201.919ms
Self CUDA time total: 5.283ms

Use https://ui.perfetto.dev to view the trace:

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

lishunyang12 · 2025-12-31T08:08:27Z

As dissucssed in the weekly meeting, we have those Todos under this Pull Request:

1. Rebase
Recent updates introduce omni_base class which is overloaded by omni class and async_omni class. We can place start_profile() and stop_profile() under omni_base instead of replicating same functions in two child classes.

2. API design
To adapt to vLLM developers transitioning to vLLM-omni, we should stick to vLLM profiler's api design.
https://docs.vllm.ai/en/stable/contributing/profiling/#profile-with-pytorch-profiler

Offline Inference:

 llm = LLM(
        model="xx",
        tensor_parallel_size=1,
        profiler_config={
            "profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
        },
    )

    llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    llm.stop_profile()

Online Serving:

vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'

3. Diffusion Pipeline Support
Diffusion engine is newly introduced in omni and should add implementation for start_profile() and stop_profile(). We can reuse vLLM profiler module.
https://github.com/vllm-project/vllm/tree/main/vllm/profiler

4. Provide examples

To align with vLLM, we can provide test files specifically for offline inference profiling. I recommend trying with small models like qwen2.5-omni-7B/3B for Omni and z-image for Diffusion.

5. Documentation
Provide developers with an user guide at https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/.

hsliuustc0106 · 2025-12-31T09:03:16Z

@amy-why-3459 PTAL

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

vllm_omni/entrypoints/omni.py

gcanlin · 2025-12-31T09:29:42Z

I have rebased code and extract common method into OmniBase class.
Offline profiler API has been implemented. But because the flag --profiler-config is introduced by [Cleanup] Refactor profiling env vars into a CLI config vllm#29912 which isn't included in vllm 0.12.0. So we have to support it in next version.
I'd prefer to make diffusion profiler implemented in a follow-up PR.

vllm_omni/entrypoints/omni.py

amy-why-3459 · 2025-12-31T09:53:49Z

vllm_omni/entrypoints/omni_stage.py

+                cmd = task.get("command")
+                if cmd == "start" and has_profiler:
+                    try:
+                        await stage_engine.start_profile()


Should asyncio.create_task(stage_engine.start_profile()) be used to avoid blocking?

The current await should be the right choice for correctness and precision. Changing to create_task would risk inaccurate traces. If start_profile() ever became heavy (unlikely), the vLLM core would handle it differently.

There is already a warning in the doc that enabling profiler will reduce performance. #570

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: lishunyang <lishunyang12@163.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 · 2026-01-04T06:15:58Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Close #481.

vLLM-Omni supports profiling via PyTorch's built-in profiler, allowing users to capture detailed performance traces of multi-stage inference pipelines. This document describes the architecture and usage of the profiling system.

Task Type Definition

Profiler commands are defined as part of the OmniStageTaskType enum in stage_utils.py:

class OmniStageTaskType(enum.Enum):
    GENERATE = "generate"
    ABORT = "abort"
    SHUTDOWN = "shutdown"
    PROFILER_START = "profiler_start"
    PROFILER_STOP = "profiler_stop"

Basic Usage (Offline Inference)

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Qwen/Qwen3-Omni-7B")

# Start profiling for specific stages
omni.start_profile(stages=[0, 1])

# Run inference - profiling captures this
generator = omni.generate(prompts, sampling_params)
for output in generator:
    process(output)

# Stop profiling AFTER inference completes
omni.stop_profile()

omni.close()

Important Notes

Call Order: stop_profile() must be called AFTER inference completes, not immediately after generate(). The generate() method returns a generator, so actual inference happens during iteration.
Stage Selection: Pass stages=[0, 1, 2] to profile specific stages, or None to profile all stages.
Output: Traces are saved to VLLM_TORCH_PROFILER_DIR/ with stage-specific subdirectories.

Output Format

Profiler generates two types of output:

Chrome Trace (JSON): *.pt.trace.json.gz - Can be viewed in Perfetto
Summary Table: profiler_out_*.txt - Human-readable performance summary

Test Plan

After investigating vllm profiler, I recommend to use the below env var.

export VLLM_USE_MODELSCOPE=True
export VLLM_TORCH_PROFILER_DIR=./trace_perf
export VLLM_TORCH_PROFILER_WITH_STACK=1
export VLLM_TORCH_PROFILER_RECORD_SHAPES=0
export VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=0
export VLLM_TORCH_PROFILER_WITH_FLOPS=0
export VLLM_TORCH_PROFILER_USE_GZIP=1
export VLLM_TORCH_PROFILER_DUMP_CUDA_TIME_TOTAL=1
export VLLM_TORCH_PROFILER_DISABLE_ASYNC_LLM=1
export VLLM_PROFILER_MAX_ITERS=1

To avoid generate too big trace files, limiting the number of stages to be profiled would be recommended.

omni_llm.start_profile(stages=[1])

Test Result

profiler_out_0.txt:

Details

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                 execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us     162.909ms      3083.44%     162.909ms     162.909ms             1  
                           vllm::outplace_fused_experts        13.07%      26.381ms        14.32%      28.914ms       1.446ms       1.260ms        23.85%       1.606ms      80.312us            20  
                                       fused_moe_kernel         0.00%       0.000us         0.00%       0.000us       0.000us       1.260ms        23.85%       1.260ms      31.502us            40  
                                               aten::mm         3.99%       8.050ms         6.17%      12.462ms     102.991us     949.932us        17.98%     978.540us       8.087us           121  
                                              aten::mul         0.65%       1.314ms         1.12%       2.252ms      12.372us     671.188us        12.70%     671.188us       3.688us           182  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     660.692us        12.51%     660.692us       3.671us           180  
ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us     504.757us         9.55%     504.757us       8.275us            61  
                                              aten::cat         0.78%       1.580ms         1.71%       3.458ms      26.809us     433.302us         8.20%     436.854us       3.386us           129  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     251.577us         4.76%     251.577us       3.145us            80  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     212.222us         4.02%     212.222us       2.101us           101  
                                _vllm_fa2_C::varlen_fwd         0.14%     278.931us         2.74%       5.528ms     276.417us     206.365us         3.91%     232.957us      11.648us            20  
void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     206.365us         3.91%     206.365us      10.318us            20  
                                  Lazy Function Loading         0.76%       1.539ms         0.76%       1.539ms      54.959us     183.066us         3.46%     183.066us       6.538us            28  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us     165.725us         3.14%     165.725us       4.042us            41  
void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm...         0.00%       0.000us         0.00%       0.000us       0.000us     159.549us         3.02%     159.549us       6.937us            23  
                           _moe_C::moe_align_block_size         0.07%     134.704us         0.24%     477.402us      23.870us     154.812us         2.93%     154.812us       7.741us            20  
                                            aten::copy_         0.34%     692.508us         1.17%       2.355ms      25.596us     149.885us         2.84%     224.732us       2.443us            92  
                       Runtime Triggered Module Loading        41.47%      83.744ms        41.47%      83.744ms       4.652ms     138.876us         2.63%     138.876us       7.715us            18  
                                            aten::index         0.22%     438.690us         0.48%     974.327us      44.288us     137.212us         2.60%     148.891us       6.768us            22  
                                              aten::sum         0.14%     291.610us         0.23%     464.187us      23.209us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::reduce_kernel<128, 4, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     136.925us         2.59%     136.925us       6.846us            20  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     134.780us         2.55%     134.780us       6.418us            21  
                                              aten::add         0.28%     556.915us         0.46%     932.664us       9.818us     134.270us         2.54%     137.918us       1.452us            95  
                                           _C::rms_norm         0.09%     174.059us         0.24%     490.130us      11.954us     133.343us         2.52%     133.343us       3.252us            41  
                                 _C::fused_add_rms_norm         0.11%     212.570us         0.24%     492.322us      12.308us     131.166us         2.48%     131.166us       3.279us            40  
std::enable_if<(((8)>(0)))&&vllm::_typeConvert<c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     131.166us         2.48%     131.166us       3.279us            40  
void vllm::rms_norm_kernel<c10::BFloat16, 8, 3>(c10:...         0.00%       0.000us         0.00%       0.000us       0.000us     130.367us         2.47%     130.367us       3.259us            40  
                                            aten::addmm         0.34%     695.435us         0.48%     959.588us      79.966us     119.358us         2.26%     144.606us      12.051us            12  
                                   _moe_C::topk_softmax         0.07%     144.369us         0.18%     362.292us      18.115us     116.286us         2.20%     116.286us       5.814us            20  
void vllm::moe::topkGatingSoftmax<8, 128, 4, 16, 32,...         0.00%       0.000us         0.00%       0.000us       0.000us     116.286us         2.20%     116.286us       5.814us            20  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     110.783us         2.10%     110.783us       2.770us            40  
void vllm::moe::moe_align_block_size_kernel<int>(int...         0.00%       0.000us         0.00%       0.000us       0.000us     106.684us         2.02%     106.684us       5.334us            20  
                                       _C::silu_and_mul         0.09%     173.470us         0.21%     428.884us      10.722us     101.052us         1.91%     101.052us       2.526us            40  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us     101.052us         1.91%     101.052us       2.526us            40  
void cutlass::Kernel2<cutlass_80_wmma_tensorop_s1616...         0.00%       0.000us         0.00%       0.000us       0.000us      86.206us         1.63%      86.206us       4.310us            20  
                                              aten::sub         0.15%     309.465us         0.29%     593.790us      14.138us      83.840us         1.59%      86.080us       2.050us            42  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      74.783us         1.42%      74.783us       5.753us            13  
                  _C_cache_ops::reshape_and_cache_flash         0.07%     133.160us         1.72%       3.483ms     174.130us      69.920us         1.32%      80.096us       4.005us            20  
void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us      69.920us         1.32%      69.920us       3.496us            20  
void dot_kernel<float, 128, 0, cublasDotParams<cubla...         0.00%       0.000us         0.00%       0.000us       0.000us      60.639us         1.15%      60.639us       3.032us            20  
                                          aten::sigmoid         0.11%     221.586us         0.17%     342.447us      17.122us      53.726us         1.02%      53.726us       2.686us            20  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      53.726us         1.02%      53.726us       2.686us            20  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us       0.000us      50.879us         0.96%      50.879us       8.480us             6  
void reduce_1Block_kernel<float, 128, 7, cublasGemvT...         0.00%       0.000us         0.00%       0.000us       0.000us      48.222us         0.91%      48.222us       2.411us            20  
void vllm::moe::count_and_sort_expert_tokens_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      48.128us         0.91%      48.128us       2.406us            20  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us      39.040us         0.74%      39.040us       2.169us            18  
                                             aten::topk         1.18%       2.373ms        10.73%      21.659ms      21.659ms      30.335us         0.57%     151.675us     151.675us             1  
                              aten::_local_scalar_dense         0.05%     102.563us         0.16%     330.240us       6.740us      28.672us         0.54%      28.672us       0.585us            49  
                                          aten::nonzero         0.10%     200.502us         2.89%       5.843ms       1.948ms      27.678us         0.52%      76.953us      25.651us             3  
ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_6...         0.00%       0.000us         0.00%       0.000us       0.000us      25.727us         0.49%      25.727us       8.576us             3  
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      25.695us         0.49%      25.695us       1.977us            13  
                                               aten::eq         0.10%     198.200us         7.49%      15.118ms       1.260ms      23.647us         0.45%      35.615us       2.968us            12  
                                            aten::fill_         0.05%     110.020us         0.11%     221.114us       4.336us      23.039us         0.44%      23.039us       0.452us            51  
void at::native::sbtopk::gatherTopK<float, unsigned ...         0.00%       0.000us         0.00%       0.000us       0.000us      21.856us         0.41%      21.856us      21.856us             1  
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us      19.776us         0.37%      19.776us       3.296us             6  
                                             aten::silu         0.04%      90.091us         0.07%     135.434us      22.572us      16.864us         0.32%      16.864us       2.811us             6  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      16.864us         0.32%      16.864us       2.811us             6  
                                 aten::_index_put_impl_         0.03%      58.090us         0.10%     209.052us      69.684us      12.864us         0.24%      22.720us       7.573us             3  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      12.864us         0.24%      12.864us       6.432us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, __nv...         0.00%       0.000us         0.00%       0.000us       0.000us      12.224us         0.23%      12.224us       4.075us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us      11.808us         0.22%      11.808us       1.968us             6  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      10.496us         0.20%      10.496us       5.248us             2  
void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us      10.304us         0.20%      10.304us       3.435us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.471us         0.18%       9.471us       2.368us             4  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       9.216us         0.17%       9.216us       3.072us             3  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us       9.184us         0.17%       9.184us       1.837us             5  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       8.607us         0.16%       8.607us       1.721us             5  
void at::native::radixSortKVInPlace<-2, -1, 32, 4, f...         0.00%       0.000us         0.00%       0.000us       0.000us       8.479us         0.16%       8.479us       8.479us             1  
void at_cuda_detail::cub::DeviceSelectSweepKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       7.872us         0.15%       7.872us       2.624us             3  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.488us         0.14%       7.488us       2.496us             3  
                                     aten::index_select         0.89%       1.796ms         0.96%       1.936ms     645.364us       7.231us         0.14%      12.159us       4.053us             3  
void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.231us         0.14%       7.231us       7.231us             1  
void at_cuda_detail::cub::DeviceReduceSingleTileKern...         0.00%       0.000us         0.00%       0.000us       0.000us       7.103us         0.13%       7.103us       2.368us             3  
                                           aten::argmax         0.01%      19.123us         0.01%      29.241us      29.241us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       6.816us         0.13%       6.816us       6.816us             1  
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.784us         0.13%       6.784us       3.392us             2  
                                             aten::div_         0.01%      23.018us         0.02%      42.255us      21.127us       6.688us         0.13%       6.688us       3.344us             2  
                                     aten::masked_fill_         0.02%      30.954us         0.05%      99.651us      24.913us       6.464us         0.12%       8.640us       2.160us             4  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       6.176us         0.12%       6.176us       2.059us             3  
                                             aten::sub_         0.01%      27.663us         0.03%      68.612us      22.871us       6.176us         0.12%       8.160us       2.720us             3  
                                           aten::gather         0.02%      45.116us         0.06%     125.564us      62.782us       5.664us         0.11%       8.128us       4.064us             2  
void at_cuda_detail::cub::DeviceCompactInitKernel<at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.343us         0.10%       5.343us       1.781us             3  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       5.248us         0.10%       5.248us       1.749us             3  
                                         aten::_softmax         0.01%      12.536us         0.01%      24.545us      24.545us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us         0.09%       4.896us       4.896us             1  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.831us         0.09%       4.831us       2.415us             2  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       4.768us         0.09%       4.768us       4.768us             1  
                                       aten::bitwise_or         0.02%      34.568us         5.54%      11.187ms       5.594ms       4.576us         0.09%      11.968us       5.984us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.576us         0.09%       4.576us       2.288us             2  
                                      aten::bitwise_not         0.02%      35.313us         0.05%      96.025us      48.013us       4.415us         0.08%       6.622us       3.311us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.415us         0.08%       4.415us       2.208us             2  
                                              aten::max         0.07%     143.851us         0.58%       1.181ms      34.721us       4.352us         0.08%       4.352us       0.128us            34  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       4.352us         0.08%       4.352us       4.352us             1  
                                               aten::gt         0.02%      34.013us         0.03%      54.536us      27.268us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.320us         0.08%       4.320us       2.160us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us         0.08%       4.288us       2.144us             2  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.192us         0.08%       4.192us       2.096us             2  
                                              aten::any         0.02%      46.241us        11.90%      24.028ms      24.028ms       3.968us         0.08%      15.872us      15.872us             1  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us         0.08%       3.968us       3.968us             1  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.648us         0.07%       3.648us       3.648us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 201.919ms
Self CUDA time total: 5.283ms

Use https://ui.perfetto.dev to view the trace:

Essential Elements of an Effective PR Description Checklist * [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". * [ ] The test plan, such as providing test command. * [ ] The test results, such as pasting the results comparison before and after, or e2e results * [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. * [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

I think we do not need to use omni.close() manully to shutdown the service or process

hsliuustc0106 · 2026-01-04T06:17:27Z

from the log, it seems the CPU consumes the majority time, is it accurate?

hsliuustc0106 · 2026-01-04T13:11:45Z

fix ci please

gcanlin · 2026-01-04T14:13:05Z

fix ci please

Sorry for being late and thanks for the review. I’ll work through the issues below ASAP:

CUDA time total seems to be not accurate.
CI fails.
From @lishunyang12 's feedback, multi-stages profiler maybe exists some bugs.

Signed-off-by: lishunyang <lishunyang12@163.com>

examples/offline_inference/qwen3_omni/end2end.py

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 · 2026-01-05T00:58:51Z

@wuhang2014 PTAL

gcanlin · 2026-01-05T02:51:19Z

I tested vllm's profiler with model Qwen/Qwen2.5-Omni-7B(only thinker) and get the below results, which can explain why CUDA time is much smaller than CPU time. So yes, it's accurate because we're reusing vllm's profiler. And setting VLLM_PROFILER_MAX_ITERS=1 will reduce much CUDA time.

(vllm) Original:

(EngineCore_DP0 pid=719568) Self CPU time total: 249.078ms
(EngineCore_DP0 pid=719568) Self CUDA time total: 184.041ms

(vllm) After setting export VLLM_PROFILER_MAX_ITERS=1:

(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms

BUT, I'm confused why the trace file is far bigger than vllm without omni under current implementation. Looking at the pictures, compared with vLLM, the profiler of vLLM-Omni records additional parts highlighted in the red boxes in the picture, which account for the majority of the trace file. @lishunyang12 Could you also help take a look for these additional records?

vllm-omni:

vllm:

The more details:

Details

INFO 01-05 02:28:17 [llm.py:360] Supported tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 553.36it/s]
Processed prompts:   0%|                                            | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=748665) INFO 01-05 02:28:18 [wrapper.py:108] Max profiling iterations reached. Stopping profiler...
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665)                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665)                                  execute_new_1_cached_0         0.00%       0.000us         0.00%       0.000us       0.000us      16.765ms       161.18%      16.765ms      16.765ms             1
(EngineCore_DP0 pid=748665) ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f...         0.00%       0.000us         0.00%       0.000us       0.000us       5.129ms        49.31%       5.129ms     176.877us            29
(EngineCore_DP0 pid=748665)       ampere_s16816gemm_bf16_128x64_ldg8_stages_32x6_tn         0.00%       0.000us         0.00%       0.000us       0.000us       2.462ms        23.67%       2.462ms      87.941us            28
(EngineCore_DP0 pid=748665)        ampere_s16816gemm_bf16_64x64_ldg8_stages_64x6_tn         0.00%       0.000us         0.00%       0.000us       0.000us     753.941us         7.25%     753.941us      26.926us            28
(EngineCore_DP0 pid=748665) ampere_bf16_s16816gemm_bf16_64x64_ldg8_relu_f2f_stag...         0.00%       0.000us         0.00%       0.000us       0.000us     751.668us         7.23%     751.668us      26.845us            28
(EngineCore_DP0 pid=748665)                                                aten::mm         0.02%      78.934us         0.02%      91.386us      91.386us     618.101us         5.94%     618.101us     618.101us             1
(EngineCore_DP0 pid=748665)                                 _vllm_fa2_C::varlen_fwd         0.07%     286.438us         0.94%       4.019ms     143.549us     292.121us         2.81%     317.849us      11.352us            28
(EngineCore_DP0 pid=748665) void flash::flash_fwd_splitkv_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us     292.121us         2.81%     292.121us      10.433us            28
(EngineCore_DP0 pid=748665)                                   Lazy Function Loading         0.11%     455.061us         0.11%     455.061us      41.369us     244.568us         2.35%     244.568us      22.233us            11
(EngineCore_DP0 pid=748665)                        Runtime Triggered Module Loading         4.13%      17.717ms         4.13%      17.717ms       4.429ms     156.540us         1.50%     156.540us      39.135us             4
(EngineCore_DP0 pid=748665)                                          aten::_softmax         0.01%      42.400us         0.02%      80.406us      40.203us     153.725us         1.48%     153.725us      76.862us             2
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::cunn_SoftMax...         0.00%       0.000us         0.00%       0.000us       0.000us     153.725us         1.48%     153.725us      76.862us             2
(EngineCore_DP0 pid=748665) void cublasLt::splitKreduce_kernel<32, 16, int, floa...         0.00%       0.000us         0.00%       0.000us       0.000us     149.120us         1.43%     149.120us       2.663us            56
(EngineCore_DP0 pid=748665)                   _C_cache_ops::reshape_and_cache_flash         0.03%     149.870us         0.08%     360.388us      12.871us     109.023us         1.05%     109.023us       3.894us            28
(EngineCore_DP0 pid=748665) void vllm::reshape_and_cache_flash_kernel<__nv_bfloa...         0.00%       0.000us         0.00%       0.000us       0.000us     109.023us         1.05%     109.023us       3.894us            28
(EngineCore_DP0 pid=748665)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_2         0.00%       0.000us         0.00%       0.000us       0.000us     106.525us         1.02%     106.525us       3.804us            28
(EngineCore_DP0 pid=748665)      triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0         0.00%       0.000us         0.00%       0.000us       0.000us      91.771us         0.88%      91.771us       3.278us            28
(EngineCore_DP0 pid=748665) triton_poi_fused_cat_index_select_split_split_with_s...         0.00%       0.000us         0.00%       0.000us       0.000us      79.486us         0.76%      79.486us       2.944us            27
(EngineCore_DP0 pid=748665)                       triton_poi_fused_mul_silu_slice_1         0.00%       0.000us         0.00%       0.000us       0.000us      75.453us         0.73%      75.453us       2.695us            28
(EngineCore_DP0 pid=748665)                                              aten::sort         0.03%     149.280us        95.86%     410.913ms     410.913ms      69.694us         0.67%     420.692us     420.692us             1
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_4         0.00%       0.000us         0.00%       0.000us       0.000us      60.511us         0.58%      60.511us       2.241us            27
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortOnesweepKer...         0.00%       0.000us         0.00%       0.000us       0.000us      48.094us         0.46%      48.094us      12.023us             4
(EngineCore_DP0 pid=748665)                                             aten::copy_         0.06%     258.622us         0.53%       2.280ms      78.607us      40.800us         0.39%      59.584us       2.055us            29
(EngineCore_DP0 pid=748665)                                         cudaGraphLaunch         0.20%     843.738us         0.20%     843.738us      29.094us      14.496us         0.14%      14.496us       0.500us            29
(EngineCore_DP0 pid=748665)                                         Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      12.032us         0.12%      12.032us       1.719us             7
(EngineCore_DP0 pid=748665)                                            aten::argmax         0.01%      38.972us         0.02%      67.032us      67.032us      11.456us         0.11%      11.456us      11.456us             1
(EngineCore_DP0 pid=748665)                                   cudaStreamIsCapturing         0.01%      24.967us         0.01%      24.967us       0.780us      10.656us         0.10%      10.656us       0.333us            32
(EngineCore_DP0 pid=748665) void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us      10.432us         0.10%      10.432us      10.432us             1
(EngineCore_DP0 pid=748665)                                           aten::scatter         0.01%      33.678us         0.02%      87.305us      87.305us      10.399us         0.10%      12.927us      12.927us             1
(EngineCore_DP0 pid=748665) void at::native::_scatter_gather_elementwise_kernel<...         0.00%       0.000us         0.00%       0.000us       0.000us      10.399us         0.10%      10.399us      10.399us             1
(EngineCore_DP0 pid=748665)                          Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.368us         0.10%      10.368us       1.152us             9
(EngineCore_DP0 pid=748665)                                              aten::div_         0.01%      27.728us         0.01%      45.493us      22.746us       8.064us         0.08%       8.064us       4.032us             2
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       7.392us         0.07%       7.392us       2.464us             3
(EngineCore_DP0 pid=748665)                          Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.136us         0.07%       7.136us       2.379us             3
(EngineCore_DP0 pid=748665)                                      aten::index_select         0.01%      40.458us         0.03%     128.561us      64.280us       7.008us         0.07%      14.016us       7.008us             2
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::indexSelectS...         0.00%       0.000us         0.00%       0.000us       0.000us       7.008us         0.07%       7.008us       7.008us             1
(EngineCore_DP0 pid=748665)                                             aten::index         0.01%      25.503us         0.02%      70.290us      70.290us       6.240us         0.06%       9.216us       9.216us             1
(EngineCore_DP0 pid=748665) void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       6.240us         0.06%       6.240us       6.240us             1
(EngineCore_DP0 pid=748665)                                            aten::cumsum         0.01%      53.262us         0.04%     188.388us     188.388us       6.047us         0.06%      18.141us      18.141us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       5.728us         0.06%       5.728us       5.728us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortHistogramKe...         0.00%       0.000us         0.00%       0.000us       0.000us       5.536us         0.05%       5.536us       5.536us             1
(EngineCore_DP0 pid=748665)                                      aten::exponential_         0.01%      21.808us         0.01%      40.378us      40.378us       4.960us         0.05%       4.960us       4.960us             1
(EngineCore_DP0 pid=748665) void at::native::(anonymous namespace)::distribution...         0.00%       0.000us         0.00%       0.000us       0.000us       4.960us         0.05%       4.960us       4.960us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceScanKernel<at_cuda_d...         0.00%       0.000us         0.00%       0.000us       0.000us       4.704us         0.05%       4.704us       4.704us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       4.256us         0.04%       4.256us       4.256us             1
(EngineCore_DP0 pid=748665)                                                aten::le         0.01%      23.814us         0.01%      42.912us      42.912us       4.224us         0.04%       4.224us       4.224us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       4.224us         0.04%       4.224us       4.224us             1
(EngineCore_DP0 pid=748665)                                    cudaDriverGetVersion         0.00%       4.996us         0.00%       4.996us       0.172us       4.032us         0.04%       4.032us       0.139us            29
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_0         0.00%       0.000us         0.00%       0.000us       0.000us       4.000us         0.04%       4.000us       4.000us             1
(EngineCore_DP0 pid=748665)                          Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us       3.936us         0.04%       3.936us       3.936us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       3.808us         0.04%       3.808us       3.808us             1
(EngineCore_DP0 pid=748665)                                               aten::sub         0.01%      60.373us         0.03%     117.496us      39.165us       3.520us         0.03%       5.312us       1.771us             3
(EngineCore_DP0 pid=748665)                                      triton_red_fused_1         0.00%       0.000us         0.00%       0.000us       0.000us       3.519us         0.03%       3.519us       3.519us             1
(EngineCore_DP0 pid=748665) at::native::(anonymous namespace)::fill_reverse_indi...         0.00%       0.000us         0.00%       0.000us       0.000us       2.880us         0.03%       2.880us       2.880us             1
(EngineCore_DP0 pid=748665)                                      aten::masked_fill_         0.00%      17.340us         0.01%      30.540us      30.540us       2.688us         0.03%       2.688us       2.688us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.688us         0.03%       2.688us       2.688us             1
(EngineCore_DP0 pid=748665) void at::native::elementwise_kernel<128, 2, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us       2.464us         0.02%       2.464us       2.464us             1
(EngineCore_DP0 pid=748665)                                      triton_poi_fused_2         0.00%       0.000us         0.00%       0.000us       0.000us       2.336us         0.02%       2.336us       2.336us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       2.240us         0.02%       2.240us       2.240us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceRadixSortExclusiveSu...         0.00%       0.000us         0.00%       0.000us       0.000us       2.176us         0.02%       2.176us       2.176us             1
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.792us         0.02%       1.792us       1.792us             1
(EngineCore_DP0 pid=748665) void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.728us         0.02%       1.728us       1.728us             1
(EngineCore_DP0 pid=748665)                        Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.536us         0.01%       1.536us       1.536us             1
(EngineCore_DP0 pid=748665)                                 Activity Buffer Request         0.33%       1.415ms         0.33%       1.415ms       1.415ms       1.504us         0.01%       1.504us       1.504us             1
(EngineCore_DP0 pid=748665) void at_cuda_detail::cub::DeviceScanInitKernel<at_cu...         0.00%       0.000us         0.00%       0.000us       0.000us       1.343us         0.01%       1.343us       1.343us             1
(EngineCore_DP0 pid=748665)                                             aten::fill_         0.01%      37.675us         0.02%      97.573us      10.841us       1.312us         0.01%       2.624us       0.292us             9
(EngineCore_DP0 pid=748665) void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us       1.312us         0.01%       1.312us       1.312us             1
(EngineCore_DP0 pid=748665)                                  execute_new_1_cached_0         1.24%       5.328ms         3.86%      16.535ms      16.535ms       0.000us         0.00%       1.126ms       1.126ms             1
(EngineCore_DP0 pid=748665)                                             aten::empty         0.05%     232.853us         0.05%     232.853us       3.105us       0.000us         0.00%       0.000us       0.000us            75
(EngineCore_DP0 pid=748665)                                                aten::to         0.00%      20.043us         0.06%     276.869us      18.458us       0.000us         0.00%      17.600us       1.173us            15
(EngineCore_DP0 pid=748665)                                        aten::lift_fresh         0.00%       5.186us         0.00%       5.186us       0.519us       0.000us         0.00%       0.000us       0.000us            10
(EngineCore_DP0 pid=748665)                                           aten::detach_         0.00%       2.036us         0.00%       2.036us       0.291us       0.000us         0.00%       0.000us       0.000us             7
(EngineCore_DP0 pid=748665)                                            aten::expand         0.04%     160.158us         0.05%     226.821us       2.493us       0.000us         0.00%       0.000us       0.000us            91
(EngineCore_DP0 pid=748665)                                        aten::as_strided         0.05%     233.168us         0.05%     233.168us       0.918us       0.000us         0.00%       0.000us       0.000us           254
(EngineCore_DP0 pid=748665)                                               aten::max         0.01%      27.310us         0.02%      84.039us      16.808us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                        aten::contiguous         0.00%       7.237us         0.01%      59.929us      11.986us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                             aten::clone         0.00%      15.910us         0.01%      52.692us      10.538us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                        aten::empty_like         0.01%      64.137us         0.06%     256.336us       7.324us       0.000us         0.00%       0.000us       0.000us            35
(EngineCore_DP0 pid=748665)                                               aten::add         0.00%      13.370us         0.00%      13.370us       2.674us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                              aten::item         0.00%       9.374us         0.00%      12.668us       2.534us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                               aten::_local_scalar_dense         0.00%       3.294us         0.00%       3.294us       0.659us       0.000us         0.00%       0.000us       0.000us             5
(EngineCore_DP0 pid=748665)                                               aten::cat         0.00%      20.270us         0.01%      39.340us      19.670us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                            aten::narrow         0.00%      12.987us         0.00%      19.070us       9.535us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                             aten::slice         0.06%     239.412us         0.08%     326.962us       3.336us       0.000us         0.00%       0.000us       0.000us            98
(EngineCore_DP0 pid=748665)                                         cudaMemcpyAsync         0.04%     188.137us         0.04%     188.137us      13.438us       0.000us         0.00%       0.000us       0.000us            14
(EngineCore_DP0 pid=748665)                                             aten::alias         0.00%       5.078us         0.00%       5.078us       2.539us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                           aten::flatten         0.00%       7.492us         0.00%      11.670us      11.670us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                                              aten::view         0.00%       8.955us         0.00%       8.955us       2.985us       0.000us         0.00%       0.000us       0.000us             3
(EngineCore_DP0 pid=748665)                                         aten::expand_as         0.00%       5.301us         0.00%      13.241us       6.621us       0.000us         0.00%       0.000us       0.000us             2
(EngineCore_DP0 pid=748665)                                        cudaLaunchKernel         0.19%     830.662us         0.27%       1.153ms      13.410us       0.000us         0.00%     160.251us       1.863us            86
(EngineCore_DP0 pid=748665)                                          aten::_to_copy         0.01%      30.874us         0.06%     256.826us      51.365us       0.000us         0.00%      17.600us       3.520us             5
(EngineCore_DP0 pid=748665)                                     aten::empty_strided         0.05%     219.729us         0.05%     219.729us       6.463us       0.000us         0.00%       0.000us       0.000us            34
(EngineCore_DP0 pid=748665)                                         aten::embedding         0.00%      12.115us         0.03%     135.502us     135.502us       0.000us         0.00%      14.016us      14.016us             1
(EngineCore_DP0 pid=748665)                                           aten::resize_         0.00%       6.732us         0.00%       6.732us       6.732us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                                       Pregraph bytecode         0.04%     176.615us         0.04%     176.615us     176.615us       0.000us         0.00%       0.000us       0.000us             1
(EngineCore_DP0 pid=748665)                     vllm::unified_attention_with_output         0.54%       2.306ms         1.76%       7.555ms     269.834us       0.000us         0.00%     426.872us      15.245us            28
(EngineCore_DP0 pid=748665)                                            aten::unbind         0.03%     119.471us         0.07%     295.443us      10.552us       0.000us         0.00%       0.000us       0.000us            28
(EngineCore_DP0 pid=748665)                                            aten::select         0.03%     121.132us         0.04%     184.344us       3.234us       0.000us         0.00%       0.000us       0.000us            57
(EngineCore_DP0 pid=748665)                                  cudaDeviceGetAttribute         0.01%      28.127us         0.01%      28.127us       0.485us       0.000us         0.00%       0.000us       0.000us            58
(EngineCore_DP0 pid=748665)                                    cudaFuncSetAttribute         0.01%      46.421us         0.78%       3.349ms     119.621us       0.000us         0.00%      25.728us       0.919us            28
(EngineCore_DP0 pid=748665) -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms
(EngineCore_DP0 pid=748665)

david6666666 · 2026-01-05T02:52:37Z

As dissucssed in the weekly meeting, we have those Todos under this Pull Request:

1. Rebase Recent updates introduce omni_base class which is overloaded by omni class and async_omni class. We can place start_profile() and stop_profile() under omni_base instead of replicating same functions in two child classes.

2. API design To adapt to vLLM developers transitioning to vLLM-omni, we should stick to vLLM profiler's api design. https://docs.vllm.ai/en/stable/contributing/profiling/#profile-with-pytorch-profiler

Offline Inference:
 llm = LLM(
        model="xx",
        tensor_parallel_size=1,
        profiler_config={
            "profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
        },
    )

    llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    llm.stop_profile()
Online Serving:
vllm serve meta-llama/Llama-3.1-8B-Instruct --profiler-config '{"profiler": "torch", "torch_profiler_dir": "./vllm_profile"}'
3. Diffusion Pipeline Support Diffusion engine is newly introduced in omni and should add implementation for start_profile() and stop_profile(). We can reuse vLLM profiler module. https://github.com/vllm-project/vllm/tree/main/vllm/profiler

4. Provide examples To align with vLLM, we can provide test files specifically for offline inference profiling. I recommend trying with small models like qwen2.5-omni-7B/3B for Omni and z-image for Diffusion.

5. Documentation Provide developers with an user guide at https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/profiling/.

I think Diffusion Pipeline are not supported profiler now,should add start_profile() and stop_profile().

hsliuustc0106 · 2026-01-05T02:56:34Z

I tested vllm's profiler with model Qwen/Qwen2.5-Omni-7B(only thinker) and get the below results, which can explain why CUDA time is much smaller than CPU time. So yes, it's accurate because we're reusing vllm's profiler. And setting VLLM_PROFILER_MAX_ITERS=1 will reduce much CUDA time.

(vllm) Original:
(EngineCore_DP0 pid=719568) Self CPU time total: 249.078ms
(EngineCore_DP0 pid=719568) Self CUDA time total: 184.041ms
(vllm) After setting export VLLM_PROFILER_MAX_ITERS=1:
(EngineCore_DP0 pid=748665) Self CPU time total: 428.648ms
(EngineCore_DP0 pid=748665) Self CUDA time total: 10.401ms
BUT, I'm confused why the trace file is far bigger than vllm without omni under current implementation. Looking at the pictures, compared with vLLM, the profiler of vLLM-Omni records additional parts highlighted in the red boxes in the picture, which account for the majority of the trace file. @lishunyang12 Could you also help take a look for these additional records?

vllm-omni:
vllm: The more details:
Details

please submit a new issue to weekly meeting and let's discuss on Wendesday

wuhang2014 · 2026-01-05T02:58:51Z

vllm_omni/entrypoints/omni_stage.py

+            elif await handle_profiler_task_async(task_type):
+                pass  # Profiler command handled


Logic here is a bit of confused. A type check statement should be in elif line and just execute handle_profiler_task_async with specified types. I think code here should be like:

elif is_profiling_task(task_type): handle_profiling_task_async(task_type) continue ...

@gcanlin WDYT

Make sense. I will modify it later. thx!

wuhang2014 · 2026-01-05T03:04:12Z

vllm_omni/entrypoints/omni_stage.py

+        if handle_profiler_task(task_type):
+            continue


I think code here has same logic as in async method:

if is_profiling_task(task_type): handle_profiling_task(task_type) continue

wuhang2014 · 2026-01-05T03:06:22Z

examples/offline_inference/qwen2_5_omni/end2end.py

+    profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+    if profiler_enabled:
+        omni_llm.start_profile(stages=[0])


We should add a guidence in docs for users to start/stop profiling in their own codebase.

wuhang2014 · 2026-01-05T03:07:01Z

examples/offline_inference/qwen2_5_omni/end2end.py

                print(f"Request ID: {request_id}, Saved audio to {output_wav}")

+        processed_count += len(stage_outputs.request_output)
+        if profiler_enabled and processed_count >= total_requests:


Why in condition processed_count >= total_requests, we should stop profiling?

[In omni](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/entrypoints/omni.py)

The omni will kill the worker once we exit the loop. I set the stop_profile() when we are about to exit the loop.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

lishunyang12 · 2026-01-05T05:37:46Z

I have investigated the trace logs and confirmed that the large file size and high CPU time are expected artifacts of the MultiprocExecutor used in vllm-omni

Standard vLLM typically uses UniProcExecutor (single process), which has zero IPC overhead
vLLM-Omni forces MultiprocExecutor(separate processes), so the worker process sits in a busy loop waiting for shared memory synchronization. This "wait" time is recorded as CPU activity.

Why?
vllm-omni deals with Multimodal inputs (Audio and Video). Use multiproc to prevent video/audio decoding from freezing the GPU/GPUs.

The profiler is accurate; we are seeing the cost of process synchronization(multiprocessing, shm_broadcast), not model execution (cuda graph, gemm, flashattn) .

Reference: The logic switching between UniProcand Multiproc is defined here: https://github.com/vllm-project/vllm/blob/main/vllm/v1/executor/abstract.py#L26

Breakdown for profiling results:

Metric	Time	% of Total
Execution Time	10.4 ms	~2.4%
Waiting Time	418.2 ms	~97.6%
Total Latency	428.6 ms	100%

Conponent Analysis

Component	Time Taken	Explanation
`aten::sort`	410.9 ms	The CPU initiates the sort for token sampling but blocks here while waiting for the GPU and other processes to sync. It is not actually calculating for 400ms; it is sleeping.
`Runtime Loading`	17.7 ms	One-time cost for loading dynamic libraries. This disappears after the first run
`GEMM (Matrix Math)`	8.3 ms	The core linear layers of the model.
`Flash Attention`	0.3 ms	Flash Attention 2

Why aten:sort?
The 410ms you see is Wait Time, not Sort Time

In the vllm sampling phase, the CPU often needs to sort the logits to pick the next token.

The Workflow according to the metrics table:
1. GPU calculates logits (takes 10ms).
2. CPU calls aten::sort to pick the best token
3. The Bottleneck: To sort the data, the CPU must wait for the GPU to finish calculating and transfer the data.

illusion?
The profiler starts the clock when aten::sort is called. The CPU then sits there doing nothing while waiting for the GPU or the shared memory sync. Once the data arrives, the sort finishes instantly

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-01-05T07:27:04Z

Thanks for investigating @lishunyang12. IMO, it's hard not to trace shm_boardcast.py:dequeue if we want to reuse vLLM's profiler. Even if the trace file is so large(~70MB), current profiler can still work. So I suggest we can skip this issue temporarily and merge this PR first (if other issues were fixed). And I will open an issue to trace it. After we have deeper discussion, I think we will find some ways to fix it in a following PR. @hsliuustc0106

hsliuustc0106

lgtm

hsliuustc0106 · 2026-01-05T09:32:51Z

thanks, but we have figured out a lot of problems with profiling, please open new issues

wuhang2014 · 2026-01-05T09:53:29Z

I have investigated the trace logs and confirmed that the large file size and high CPU time are expected artifacts of the MultiprocExecutor used in vllm-omni

Standard vLLM typically uses UniProcExecutor (single process), which has zero IPC overhead

vLLM-Omni forces MultiprocExecutor(separate processes), so the worker process sits in a busy loop waiting for shared memory synchronization. This "wait" time is recorded as CPU activity.

Why? vllm-omni deals with Multimodal inputs (Audio and Video). Use multiproc to prevent video/audio decoding from freezing the GPU/GPUs.

The profiler is accurate; we are seeing the cost of process synchronization(multiprocessing, shm_broadcast), not model execution (cuda graph, gemm, flashattn) .

Reference: The logic switching between UniProcand Multiproc is defined here: https://github.com/vllm-project/vllm/blob/main/vllm/v1/executor/abstract.py#L26

Breakdown for profiling results:

Metric Time % of Total
Execution Time 10.4 ms ~2.4%
Waiting Time 418.2 ms ~97.6%
Total Latency 428.6 ms 100%
Conponent Analysis

Component Time Taken Explanation
aten::sort 410.9 ms The CPU initiates the sort for token sampling but blocks here while waiting for the GPU and other processes to sync. It is not actually calculating for 400ms; it is sleeping.
Runtime Loading 17.7 ms One-time cost for loading dynamic libraries. This disappears after the first run
GEMM (Matrix Math) 8.3 ms The core linear layers of the model.
Flash Attention 0.3 ms Flash Attention 2
Why aten:sort? The 410ms you see is Wait Time, not Sort Time

In the vllm sampling phase, the CPU often needs to sort the logits to pick the next token.

The Workflow according to the metrics table:

GPU calculates logits (takes 10ms).

CPU calls aten::sort to pick the best token

The Bottleneck: To sort the data, the CPU must wait for the GPU to finish calculating and transfer the data.

illusion? The profiler starts the clock when aten::sort is called. The CPU then sits there doing nothing while waiting for the GPU or the shared memory sync. Once the data arrives, the sort finishes instantly

Can we get a simular time cost if we hack the code with forcing to use UniProcExecutor ? @gcanlin @lishunyang12

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Co-authored-by: lishunyang <lishunyang12@163.com>

[Feature] Support torch profiler across omni stages

a1d874e

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin mentioned this pull request Dec 31, 2025

[Feature]: Support torch profiler #481

Closed

1 task

gcanlin added 2 commits December 31, 2025 08:48

Merge branch 'main' into feat-profiler

ada48cc

Merge branch 'main' into feat-profiler

228cc1c

Extract common into OmniBase

d02bc81

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

amy-why-3459 reviewed Dec 31, 2025

View reviewed changes

vllm_omni/entrypoints/omni.py Outdated Show resolved Hide resolved

amy-why-3459 reviewed Dec 31, 2025

View reviewed changes

vllm_omni/entrypoints/omni.py Outdated Show resolved Hide resolved

amy-why-3459 reviewed Dec 31, 2025

View reviewed changes

vllm_omni/entrypoints/omni.py Outdated Show resolved Hide resolved

amy-why-3459 reviewed Dec 31, 2025

View reviewed changes

Add OmniStageTaskType.PROFILER

256a5dd

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

lishunyang12 mentioned this pull request Dec 31, 2025

[Doc] User guide for torch profiler #570

Merged

5 tasks

gcanlin and others added 3 commits December 31, 2025 16:43

Merge branch 'main' into feat-profiler

e8fd4a2

fix pre-commit lint

7f5f9aa

Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: lishunyang <lishunyang12@163.com>

fix examples

48d42ba

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin marked this pull request as ready for review December 31, 2025 17:08

gcanlin requested a review from hsliuustc0106 as a code owner December 31, 2025 17:08

amy-why-3459 mentioned this pull request Jan 4, 2026

[RFC]: Qwen3-Omni deployment #409

Open

35 tasks

hsliuustc0106 added the ready label to trigger buildkite CI label Jan 4, 2026

lishunyang12 and others added 3 commits January 4, 2026 23:09

fix: worker will not be killed until traces generations are completed.

680727f

Signed-off-by: lishunyang <lishunyang12@163.com>

fix: add lib

81caa9b

Signed-off-by: lishunyang <lishunyang12@163.com>

Merge branch 'main' into feat-profiler

562b8bc

gcanlin commented Jan 4, 2026

View reviewed changes

examples/offline_inference/qwen3_omni/end2end.py Show resolved Hide resolved

make profiler example simpler

7c81558

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin added 2 commits January 4, 2026 15:37

fix pre-commit lint

c3d7634

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

revert comments

6160aa5

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Merge branch 'main' into feat-profiler

b9f47e5

wuhang2014 reviewed Jan 5, 2026

View reviewed changes

fix much bigger trace file bug

5164a36

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Make code cleaner

69a69c1

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

hsliuustc0106 approved these changes Jan 5, 2026

View reviewed changes

hsliuustc0106 merged commit 52afcb6 into vllm-project:main Jan 5, 2026
7 checks passed

lishunyang12 mentioned this pull request Jan 5, 2026

[RFC]: Support torch profiler support for diffusion #650

Closed

1 task

gcanlin mentioned this pull request Jan 5, 2026

[Feature]: Profiler Trace Size Optimization #651

Open

1 task

		elif await handle_profiler_task_async(task_type):
		pass # Profiler command handled

Conversation

gcanlin commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Task Type Definition

Basic Usage (Offline Inference)

Important Notes

Output Format

Test Plan

Test Result

Uh oh!

lishunyang12 commented Dec 31, 2025

Uh oh!

hsliuustc0106 commented Dec 31, 2025

Uh oh!

Uh oh!

gcanlin commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 4, 2026

Purpose

Task Type Definition

Basic Usage (Offline Inference)

Important Notes

Output Format

Test Plan

Test Result

Uh oh!

hsliuustc0106 commented Jan 4, 2026

Uh oh!

hsliuustc0106 commented Jan 4, 2026

Uh oh!

gcanlin commented Jan 4, 2026

Uh oh!

Uh oh!

hsliuustc0106 commented Jan 5, 2026

Uh oh!

gcanlin commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david6666666 commented Jan 5, 2026

Uh oh!

hsliuustc0106 commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcanlin commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gcanlin commented Dec 30, 2025 •

edited

Loading

gcanlin commented Dec 31, 2025 •

edited

Loading

gcanlin commented Jan 5, 2026 •

edited

Loading

lishunyang12 commented Jan 5, 2026 •

edited

Loading

gcanlin commented Jan 5, 2026 •

edited

Loading