-
Notifications
You must be signed in to change notification settings - Fork 588
Description
On Oct 9, 2025 this llama.cpp pull request has been merged in mainline: ggml-org#16391
I think it would be extremely usefull to add it to koboldcpp! expecially, having a GUI option would make a lot of sense.
What is the problem?
Currently, KoboldCpp faces performance bottlenecks in two key scenarios when the KV cache must be dropped and rebuilt:
-
Multi-tasking in Front-ends (e.g., SillyTavern, lite..): When a user switches from a long chat conversation to a different task (like "generate image prompt for SD"), the server must clear the main chat's KV cache. When the user returns to the chat, the entire chat history must be re-processed from scratch, causing a long wait.
-
AI Horde Worker Inefficiency: As a worker, KoboldCpp processes jobs from many different users. Each new job likely has a completely different context, forcing the server to constantly drop and rebuild the KV cache. This leads to high "ingest" times and reduces the worker's overall throughput and efficiency.
What is the proposed solution?
The llama.cpp backend has implemented a native feature (server_prompt_cache) that solves this exact problem. It functions as a server-side "Context Shift" or KV cache manager.
This feature allows the server to intelligently cache multiple KV states in memory.
What are the benefits of adding this?
-
Instant Context Switching (SillyTavern): For front-end users, this would make the experience seamless. The server could cache the main chat state, process the image prompt, and then instantly restore the chat state, eliminating re-processing delays.
-
Massively Improved AI Horde Performance: This is a major benefit for Horde workers. The server could hold multiple user contexts in the cache. When a new job arrives that matches a cached context (e.g., a follow-up request from the same user), the worker can skip the entire processing/ingest phase and begin generation immediately. This would dramatically increase the worker's job speed and efficiency.
-
Better Resource Utilization: Instead of wasting cycles constantly re-processing prompts, the server can use its VRAM and memory to hold ready-to-use contexts.
how does it work
use 8192 MiB of host RAM for caching prompts (default value)
-cram 8192
use as much host RAM is available (i.e. no limit)
-cram -1
disable prompt caching in RAM
-cram 0
This is a high-demand feature in other backends as well, like in Ollama Issue Tracker: ollama/ollama#8577