Qwen-Image-TensorRT

Qwen-Image's DiT inference with TensorRT-10

ENV

The project was tested in the following environment:

Ubuntu 18.04
NVIDIA Driver 525.125.06
CUDA 11.8
Python 3.10.18
PyTorch 2.6.0+cu118
Diffusers 0.36.0.dev0
ONNX 1.19.0
TensorRT 10.13.0.35
cudnn-frontend 1.14.1

# Create conda env
conda create -n qwen-image python=3.10
conda activate qwen-image

# Install PyTorch
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# Install Diffusers
pip install git+https://github.com/huggingface/diffusers.git@fc337d585309c4b032e8d0180bea683007219df1
# Install ONNX
pip install onnx==1.19.0

# Install TensorRT
wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.13.0/tars/TensorRT-10.13.0.35.Linux.x86_64-gnu.cuda-11.8.tar.gz
tar -xf TensorRT-10.13.0.35.Linux.x86_64-gnu.cuda-11.8.tar.gz
pip install TensorRT-10.13.2.6/python/tensorrt-10.13.2.6-cp310-none-linux_x86_64.whl
export PATH=${PWD}/TensorRT-10.13.2.6/bin:$PATH
export LD_LIBRARY_PATH=${PWD}/TensorRT-10.13.2.6/lib:$LD_LIBRARY_PATH

# Install cudnn-frontend
# tensorrt-plugin is coming soon

CONVERT TO ONNX

Clone the project first:

git clone https://github.com/triple-Mu/Qwen-Image-TensorRT.git
cd Qwen-Image-TensorRT

Here are some scripts to test exporting onnx:

1-export-dit-directly.py

python step_by_step/1-export-dit-directly.py --model_path Qwen/Qwen-Image --onnx_path transformer_step1.onnx

This script almost no modifications, so the export fails with the following error:

  File "/root/anaconda3/envs/qwen-image/lib/python3.10/site-packages/torch/onnx/_internal/jit_utils.py", line 308, in _create_node
    _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
RuntimeError: ScalarType ComplexFloat is an unexpected tensor scalar type

Since ONNX does not support complex operators, proceed to step 2.

2-remove-complex-op.py

python step_by_step/2-remove-complex-op.py --model_path Qwen/Qwen-Image --onnx_path transformer_step2.onnx

After removing self.pos_embed and replacing apply_rotary_emb_qwen, it works fine.

3-merge-qkv-projection.py

python step_by_step/3-merge-qkv-projection.py --model_path Qwen/Qwen-Image --onnx_path transformer_step3.onnx

Advanced: Merging QKV GEMM reduces kernel launches and increases throughput.

4-cudnn-attention-plugin.py

python step_by_step/4-cudnn-attention-plugin.py --model_path Qwen/Qwen-Image --onnx_path transformer_step4.onnx

Advanced: Replacing sdpa with cudnn-attention, it results in a significant improvement on A100 GPU.

CONVERT TO TensorRT

After convert QwenImageTransformer2DModel to ONNX, the tensorrt engine can be built by trtexec.

Refer to 2-build_engine.sh

Set up TENSORRT_ROOT ONNX_PATH and ENGINE_PATH first, and the min/opt/max shape also can be modified by yourself.

Then run:

bash scripts/2-build_engine.sh

The following log output will be shown:

[09/07/2025-21:42:26] [I] === Trace details ===
[09/07/2025-21:42:26] [I] Trace averages of 10 runs:
[09/07/2025-21:42:26] [I] Average on 10 runs - GPU latency: 1666.2 ms - Host latency: 1666.9 ms (enqueue 1663.95 ms)
[09/07/2025-21:42:26] [I] 
[09/07/2025-21:42:26] [I] === Performance summary ===
[09/07/2025-21:42:26] [I] Throughput: 0.562059 qps
[09/07/2025-21:42:26] [I] Latency: min = 1656.22 ms, max = 1674.64 ms, mean = 1666.9 ms, median = 1667.89 ms, percentile(90%) = 1673.26 ms, percentile(95%) = 1674.64 ms, percentile(99%) = 1674.64 ms
[09/07/2025-21:42:26] [I] Enqueue Time: min = 1650.99 ms, max = 1672.49 ms, mean = 1663.95 ms, median = 1663.63 ms, percentile(90%) = 1672.08 ms, percentile(95%) = 1672.49 ms, percentile(99%) = 1672.49 ms
[09/07/2025-21:42:26] [I] H2D Latency: min = 0.631348 ms, max = 0.640015 ms, mean = 0.635217 ms, median = 0.635742 ms, percentile(90%) = 0.63623 ms, percentile(95%) = 0.640015 ms, percentile(99%) = 0.640015 ms
[09/07/2025-21:42:26] [I] GPU Compute Time: min = 1655.52 ms, max = 1673.94 ms, mean = 1666.2 ms, median = 1667.19 ms, percentile(90%) = 1672.56 ms, percentile(95%) = 1673.94 ms, percentile(99%) = 1673.94 ms
[09/07/2025-21:42:26] [I] D2H Latency: min = 0.0585938 ms, max = 0.0664062 ms, mean = 0.0639648 ms, median = 0.0644531 ms, percentile(90%) = 0.0654297 ms, percentile(95%) = 0.0664062 ms, percentile(99%) = 0.0664062 ms
[09/07/2025-21:42:26] [I] Total Host Walltime: 17.7917 s
[09/07/2025-21:42:26] [I] Total GPU Compute Time: 16.662 s
[09/07/2025-21:42:26] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[09/07/2025-21:42:26] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[09/07/2025-21:42:26] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/07/2025-21:42:26] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v101300] [b35] # trtexec --onnx=transformer_step2.onnx --saveEngine=transformer_step2.plan --bf16 --optShapes=hidden_states:1x6032x64,encoder_hidden_states:1x128x3584,timestep:1,img_rope_real:6032x64,img_rope_imag:6032x64,txt_rope_real:128x64,txt_rope_imag:128x64 --minShapes=hidden_states:1x3364x64,encoder_hidden_states:1x1x3584,timestep:1,img_rope_real:3364x64,img_rope_imag:3364x64,txt_rope_real:1x64,txt_rope_imag:1x64 --maxShapes=hidden_states:1x10816x64,encoder_hidden_states:1x1024x3584,timestep:1,img_rope_real:10816x64,img_rope_imag:10816x64,txt_rope_real:1024x64,txt_rope_imag:1024x64 --shapes=hidden_states:1x10816x64,encoder_hidden_states:1x1024x3584,timestep:1,img_rope_real:10816x64,img_rope_imag:10816x64,txt_rope_real:1024x64,txt_rope_imag:1024x64

RUNNING TensorRT Pipeline!

After convert ONNX to Engine, the pipeline can be built with Diffusers's pipeline.

Refer to run_trt_pipeline.py

Run:

python run_trt_pipeline.py --model_path Qwen/Qwen-Image --trt_path transformer_step2.engine

Then the example output image will be saved at example.png.

CUDNN-ATTENTION Plugin!

Build CUDNN-Attention plugin!

Refer to build.sh

Set up TENSORRT_ROOT CUDNN_ROOT and ARCH first.

Then run:

bash plugins/build.sh

Build engine with CUDNN-Attention plugin:

Refer to 4-build_engine_cudnn_attention.sh

Inference:

python run_trt_pipeline.py --model_path Qwen/Qwen-Image --trt_path transformer_step4.engine --plugin_path ./libcudnn_attention_plugin.so

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
plugins		plugins
scripts		scripts
step_by_step		step_by_step
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_trt_pipeline.py		run_trt_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwen-Image-TensorRT

ENV

CONVERT TO ONNX

CONVERT TO TensorRT

RUNNING TensorRT Pipeline!

CUDNN-ATTENTION Plugin!

About

Uh oh!

Releases

Packages

Languages

License

triple-Mu/Qwen-Image-TensorRT

Folders and files

Latest commit

History

Repository files navigation

Qwen-Image-TensorRT

ENV

CONVERT TO ONNX

CONVERT TO TensorRT

RUNNING TensorRT Pipeline!

CUDNN-ATTENTION Plugin!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages