Skip to content

kv-cache : add SWA support #13194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

kv-cache : add SWA support #13194

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 29, 2025

target #12799

This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention (SWA) in order to reduce the memory usage for models such as Gemma3.

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

The reason we cannot do context caching with SWA enabled is because when the window slides, we "forget" the old KV stuff and there is no way to recover it without recomputing it. This means, no prefix cache in llama-server (ok, just last-prefix caching works), no context shift, no context reuse, etc. So I am having some doubts if this is really worth supporting.

Any thoughts?

@slaren
Copy link
Member

slaren commented Apr 29, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

@ngxson
Copy link
Collaborator

ngxson commented Apr 29, 2025

However, while working on this, I realized that enabling this option would prevent context caching, which IMO is a pretty big deal. So I am wondering if I am missing something.

Yes this is what I was thinking about for months now. There is no better solution than to disable context caching in this case.

An alternative solution is to allow user to choose one of the 2: either a proper SWA cache (good for memory) or allocate full (good for reusing cache)

So I am having some doubts if this is really worth supporting.

I'm feeling 50/50 here. One of the biggest use case would be to process large and diverse set of documents locally. In this case, user may never reuse the cache because each new request is a new document

@ggerganov
Copy link
Member Author

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

The way I am approaching it is to have the "KV cells" information maintained separately for the non-SWA and SWA layers. This way, upon each KV cache commit (see #12799), we can do a pass over the SWA cells and automatically remove those that have position pos < pos_max(seq_id) - n_swa. Note that such tokens are only pruned from the SWA cells, while they remain in the non-SWA cells. When constructing the KQ mask for the graph, we use the non-SWA cells to construct the kq_mask and the SWA cells to construct the kq_mask_swa.

The rest of the logic is the same - it just operates on both set of cells. For example, find_slot searches in both the non-SWA and SWA cells.

@JohannesGaessler
Copy link
Collaborator

My experience with the Gemma models in the context of Elo HeLLM has been that they required a disproportionate amount of computational resources to run benchmarks. The reason is that I was able to fit comparatively fewer parallel slots on 1 or 2 GPUs and my throughput was lower as a consequence. At least for my use case I value low memory usage for the context more than I value prompt caching because I have O(10000) short prompts and I'm bottlenecked mostly by generation throughput.

@ggerganov
Copy link
Member Author

Continuing thinking about the logic for when to discard tokens from the cache, it's indeed tricky and not very clear how to do. For example, when doing speculative decoding, we can submit a draft batch with D tokens to the target model. If we apply the pruning logic from my previous comment strictly, then this would cause to "forget" D-1 of the oldest tokens in the SWA layers, which depending if the draft gets rejected would be problematic. This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch from e37f112 to 7e4b545 Compare April 30, 2025 07:22
@ymcki
Copy link
Contributor

ymcki commented Apr 30, 2025

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

I second slaren's opinion. As far as I know, vllm also doesn't support iSWA while hf transformers and ollama does. vllm is geared toward multi-user server use case. I suppose that's why they don't support it.

Ideally, it should be implemented as a switch to let user choose which one to use. By default, iSWA should be on for llama-cli but off for llama-server.

@ngxson
Copy link
Collaborator

ngxson commented Apr 30, 2025

This makes me think that we should probably have some "extra room" in the SWA cache - for example n_swa + 2*n_batch. And the prune logic should be something like: pos < pos_max(seq_id) - n_swa - n_batch.

Yes I was thinking about this too, I think it can be a bit complicated to manage this case, but totally possible.

We can let user specify how many tokens are allocated in the sliding layers. For example, given n_swa=512, if llama_context is created with n_ctx=4096 and n_ctx_swa=1024, this will allow user to rollback until n_past - (1024 - 512)

We can further let n_ctx_swa = n_ctx * scale by default to make it transparent to end-user, with scale=0.5 by default for example. If scale=-1 then n_ctx_swa=n_swa

And finally, we may need to add an API to return the furthest n_past that user can rollback to, maybe something like llama_kv_self_get_minimum_pos ?

@isaac-mcfadyen
Copy link
Contributor

isaac-mcfadyen commented Apr 30, 2025

I'd +1 the ability to allow the user to switch.

Some use-cases benefit greatly from the prefix caching (example: on Metal systems with 48GB of RAM/VRAM, where pp is much slower than non-Metal pp and we have plenty of VRAM anyway) so allowing the user to choose would be optimal.

@ExtReMLapin
Copy link
Contributor

It's not very clear to me how to handle SWA with a unified cache where there may be multiple sequences, and it is not always obvious what tokens can be dropped from the cache. However I think it is definitely worth it for the single user case, which after all is the main use case of llama.cpp.

Is llama.cpp single user mode the most used case because that’s what the user base prefer or is it like that because the server performance goes down a lot with more than 3 users ? (#10860 )

We are really thankful of all the work you main contributors do on this project, but please do not fall in this « self-fulfilling prophecy » trap.

@aviallon
Copy link
Contributor

aviallon commented May 1, 2025

I personally use llama.cpp for server use (with multiple users).
I wonder if we could do something hybrid between iSWA and what is currently done.
I wonder if partial kV cache offload could work, with iSWA on the accelerator, and slower cache on RAM.

@ggerganov ggerganov force-pushed the gg/llama-kv-cache-v6 branch 2 times, most recently from 58115a2 to 7e79a42 Compare May 2, 2025 13:02
Base automatically changed from gg/llama-kv-cache-v6 to master May 2, 2025 14:48
@Dampfinchen
Copy link

According to the Gemma3 paper, interleaved Sliding Window Attention reduces KV Cache memory usage by 1/5, so it would be much easier to run as right now KV Cache size is much heavier than comparable models.

If the drawback is the absence of prompt caching, then indeed it would make sense to give the option to the user and let them decide on a per use case basis. I think for cases where you use RAG/Vector DB it would prove to be very useful as prompt caching does not work when beginning of the context changes anyway. I would personally agree with Johannes here, faster token generation thanks to SWA would be more useful for me as well since I'm using vector DB.

So for the use cases short prompts/RAG it would make a lot of sense. For simple chat use cases without any RAG, prompt caching would probably make it faster overall compared to SWA and no prompt cache. Overall, I think having the option would be a great addition to llama.cpp.

If it helps, Ollama implemented iSWA support for Gemma 3, since the project is pretty similar to llama.cpp, perhaps it's useful to get a rough idea on how to implement it (although Ollama is a different coding language): https://github.com/ollama/ollama/blob/2fec73eef6e9482f606f185ebb2ae4f75ad1a37c/model/models/gemma3/model_text.go#L190

I've been thinking, does Ollama support prompt caching? Since Gemma 3 SWA is supported in Ollama, how did they handle it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants