[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge)#11938
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com> (cherry picked from commit 176dc6d)
Signed-off-by: Chen Zhang <zhangch99@outlook.com> (cherry picked from commit c5a5155)
Signed-off-by: Chen Zhang <zhangch99@outlook.com> (cherry picked from commit fa9b0bb)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This pr implements step 1 of #11382 , so that
Benchmark result (accelerate hybrid model & very little overhead on standard full attention models)
Key modifications
num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()self.model_executor.initialize(num_gpu_blocks)(allocate kv cache)I plan to split it into the following prs:
HybridKVCacheManager, which is a pluggable alternative withKVCacheManager, and won't touch the code path for standard models with only full attention layers.