Description
Hi vLLM genius @WoosukKwon @zhuohan123
After reading the Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning and the llama-cuda-graph-example by Fireworks.ai's @jamesr66a
CUDA graphs address all sources of CPU overhead highlighted above: user-written logic, PyTorch dispatcher logic, memory allocation overhead, and GPU driver/kernel overhead.
Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for CUDA graphs.
While both the regular attention mechanism and the PagedAttention scheme undergo shape changes over iterations, the latter provides a unique advantage when integrating with CUDA graphs.
And with this benchmark
We find that without CUDA graphs, LLaMA-7B inference executes at 30 tokens/sec, but with CUDA graphs enabled it executes at 69 tokens/sec for a 2.3x speedup.
We may refer to and port similar optimizations to vLLM. Cheers.