-
Notifications
You must be signed in to change notification settings - Fork 11.6k
kv-cache : add SWA support #13194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
kv-cache : add SWA support #13194
Conversation
It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp. |
Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case. An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)
I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document |
The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position The rest of the logic is the same - it just operates on both set of cells. For example, |
My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput. |
Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with |
e37f112
to
7e4b545
Compare
I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it. Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server. |
Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible. We can let user specify how many tokens are allocated in the sliding layers. For example, given We can further let And finally, we may need to add an API to return the furthest |
I'd +1 the ability to allow the user to switch. Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal. |
Is llama.cpp single user mode the most used case because that’s what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 ) We are really thankful of all the work you main contributors do on this project, but please do not fall in this « self-fulfilling prophecy » trap. |
I personally use llama.cpp for server use (with multiple users). |
58115a2
to
7e79a42
Compare
According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models. If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB. So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp. If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190 I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it? |
target #12799
This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.
However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.
The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in
llama-server
(ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.Any thoughts?