Describe the issue When running llama.cpp with Sliding Window Attention (SWA) models (specifically Qwen3.5) in a RAG pipeline using dynamic context retrieval (e.g., bge-m3), the server enters a loop of continuous checkpoint invalidation.
The system attempts to validate cached checkpoints from previous requests against the current prompt. Because the retrieved document chunks change in length/content between requests, the token positions shift slightly. The new strict validation logic in the latest commit correctly identifies this mismatch as "invalid," deletes the checkpoint, and forces a full re-processing of the prompt.
This results in:
No actual benefit from cross-request caching (since it's always invalidated).
Significant log spam (erased invalidated context checkpoint).
Unnecessary CPU/GPU overhead re-computing the KV cache for the static parts of the prompt (System/User) on every single request.
Expected behavior Ideally, if the static prefix (System Prompt + User Question template) matches, the model should be able to reuse the cache for that prefix even if the middle section (retrieved docs) varies, OR the warning should only trigger if the user explicitly enables aggressive checkpointing that conflicts with their workflow. Currently, it seems to treat any variation as a total cache failure, forcing a cold start every time.
Steps to reproduce
Run llama-server with a SWA model (e.g., Qwen3.5-35B).
Configure with --ctx-checkpoints (or rely on default checkpoint logic).
Use an external RAG pipeline (e.g., bge-m3) where the number of retrieved documents and their text content vary per query.
Observe logs showing repeated Checking checkpoint... against X followed immediately by forcing full prompt re-processing and erased invalidated context checkpoint.
Environment
Model: Qwen3.5-35B-A3B-Q8_0.gguf (SWA architecture)
Context Size: 131k (or 262k)
Embeddings: bge-m3 (variable chunk count/length)
llama.cpp Version: Latest (built last week, includes PR #13194 changes)
Hardware: Vulkan backend (Linux)
Logs (Excerpt)
I slot update_slots: id 0 | task 750 | Checking checkpoint with [2003, 2003] against 2...
W slot update_slots: id 0 | task 750 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory)
W slot update_slots: id 0 | task 750 | erased invalidated context checkpoint (pos_min = 421, pos_max = 421, n_tokens = 422, n_swa = 0, ...)
Workaround found Removing the --ctx-cross-request-cache flag (or --ctx-checkpoints) stops the loop. The system runs stably without the error spam, though it loses potential speedups for truly static conversations. This confirms the issue is specifically the mismatch tolerance between the cache validation logic and variable RAG inputs.
Additional context This appears to be a friction point for users implementing RAG with dynamic context injection. While strict validation is correct for preventing hallucinations, it renders the checkpoint feature useless for any non-static prompt pattern. A configurable "tolerance" for middle-section changes or a specific flag for "RAG mode" might be useful.
Describe the issue When running llama.cpp with Sliding Window Attention (SWA) models (specifically Qwen3.5) in a RAG pipeline using dynamic context retrieval (e.g., bge-m3), the server enters a loop of continuous checkpoint invalidation.
The system attempts to validate cached checkpoints from previous requests against the current prompt. Because the retrieved document chunks change in length/content between requests, the token positions shift slightly. The new strict validation logic in the latest commit correctly identifies this mismatch as "invalid," deletes the checkpoint, and forces a full re-processing of the prompt.
This results in:
No actual benefit from cross-request caching (since it's always invalidated).
Significant log spam (erased invalidated context checkpoint).
Unnecessary CPU/GPU overhead re-computing the KV cache for the static parts of the prompt (System/User) on every single request.
Expected behavior Ideally, if the static prefix (System Prompt + User Question template) matches, the model should be able to reuse the cache for that prefix even if the middle section (retrieved docs) varies, OR the warning should only trigger if the user explicitly enables aggressive checkpointing that conflicts with their workflow. Currently, it seems to treat any variation as a total cache failure, forcing a cold start every time.
Steps to reproduce
Run llama-server with a SWA model (e.g., Qwen3.5-35B).
Configure with --ctx-checkpoints (or rely on default checkpoint logic).
Use an external RAG pipeline (e.g., bge-m3) where the number of retrieved documents and their text content vary per query.
Observe logs showing repeated Checking checkpoint... against X followed immediately by forcing full prompt re-processing and erased invalidated context checkpoint.
Environment
Model: Qwen3.5-35B-A3B-Q8_0.gguf (SWA architecture)
Context Size: 131k (or 262k)
Embeddings: bge-m3 (variable chunk count/length)
llama.cpp Version: Latest (built last week, includes PR #13194 changes)
Hardware: Vulkan backend (Linux)
Logs (Excerpt)
I slot update_slots: id 0 | task 750 | Checking checkpoint with [2003, 2003] against 2...
W slot update_slots: id 0 | task 750 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory)
W slot update_slots: id 0 | task 750 | erased invalidated context checkpoint (pos_min = 421, pos_max = 421, n_tokens = 422, n_swa = 0, ...)
Workaround found Removing the --ctx-cross-request-cache flag (or --ctx-checkpoints) stops the loop. The system runs stably without the error spam, though it loses potential speedups for truly static conversations. This confirms the issue is specifically the mismatch tolerance between the cache validation logic and variable RAG inputs.
Additional context This appears to be a friction point for users implementing RAG with dynamic context injection. While strict validation is correct for preventing hallucinations, it renders the checkpoint feature useless for any non-static prompt pattern. A configurable "tolerance" for middle-section changes or a specific flag for "RAG mode" might be useful.