Skip to content

CUDA Graph support #914

Closed
Closed
@zhyncs

Description

@zhyncs

Hi vLLM genius @WoosukKwon @zhuohan123

After reading the Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning and the llama-cuda-graph-example by Fireworks.ai's @jamesr66a

CUDA graphs address all sources of CPU overhead highlighted above: user-written logic, PyTorch dispatcher logic, memory allocation overhead, and GPU driver/kernel overhead.

Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for CUDA graphs.

While both the regular attention mechanism and the PagedAttention scheme undergo shape changes over iterations, the latter provides a unique advantage when integrating with CUDA graphs.

And with this benchmark

We find that without CUDA graphs, LLaMA-7B inference executes at 30 tokens/sec, but with CUDA graphs enabled it executes at 69 tokens/sec for a 2.3x speedup.

We may refer to and port similar optimizations to vLLM. Cheers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions