CUDA Graph support

Hi vLLM genius @WoosukKwon @zhuohan123 

After reading the [Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning](https://blog.fireworks.ai/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning-353bf6241248) and the [llama-cuda-graph-example](https://github.com/fw-ai/llama-cuda-graph-example) by [Fireworks.ai](https://www.fireworks.ai/)'s @jamesr66a

> CUDA graphs address all sources of CPU overhead highlighted above: user-written logic, PyTorch dispatcher logic, memory allocation overhead, and GPU driver/kernel overhead.

> Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for CUDA graphs.

> While both the regular attention mechanism and the [PagedAttention](https://vllm.ai/) scheme undergo shape changes over iterations, the latter provides a unique advantage when integrating with CUDA graphs.

And with this [benchmark](https://github.com/fw-ai/llama-cuda-graph-example/commit/d8003f59af8893837ec9834c705cfd0035d3ad37#diff-4ead05c4053ddcb00e0038dcf342af9021f87146b8a29f67248719bc3c8d1566)
> We find that without CUDA graphs, LLaMA-7B inference executes at 30 tokens/sec, but with CUDA graphs enabled it executes at 69 tokens/sec for a 2.3x speedup.

We may refer to and port similar optimizations to vLLM. Cheers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA Graph support #914

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CUDA Graph support #914

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions