Description
When an assistant response contains reasoning_content (thinking tokens), OpenCode correctly handles it for the immediate turn. However, on subsequent turns, when that assistant message is replayed as part of the conversation history, the thinking block is stripped.
This silently removes tokens from the conversation history, which invalidates the KV cache on local inference backends (like llamacpp) at the point where the thinking block used to be. The backend is then forced to reprocess the entire context from that turn forward on every subsequent turn.
While strict providers like Kimi K2.5 hard-fail with a 400 error when this happens #10996 , permissive local backends suffer a "silent fail." This makes thinking models non-interactive on local setups due to the massive latency penalty of constant prompt reprocessing. This might even be present when some providers for some models, which does result in less cached prompts, resulting in added costs.
This issue does not need reprocessing when the model is in a turn with itself, making calls back and forth.
Plugins
No response
OpenCode version
1.2.27
Steps to reproduce
Setup: NixOS, OpenCode communicating with a local llama-server hosting a Qwen3.5-122B model. Qwen3.5 35B, GPT-OSS-120B/20B, were also tested and gave same results of stripping which resulted in invalidation. All Qwen3.5 models, regardless if thinking or not result in the same problem (non thinking still produced empty <think>\n\n</think> block which is also stripped and requires reprocessing.
Prompt 1: Ask the model to "Say Hello". This initializes the system prompt and tools. Context is cached normally.
Prompt 2: Ask the model to do a task (e.g., "Read a file"). The model outputs a thinking block followed by the tool call/action. It also works with normal output, but reading the can be done quicker than genereting 10k tokens.
Prompt 3: Say "Hello" again. OpenCode sends the conversation history, but strips the thinking block from the Prompt 2 assistant message. The llama-server KV cache is invalidated at the Prompt 2 boundary, forcing it to fully reprocess the output of Prompt 2.
Prompt 4: Say "Hello" again. The same thing happens; it has to reprocess the output of Prompt 3.
Screenshot and/or share link
When Prompt 2 completes (tool_call assistant message includes thinking):
<|im_start|>assistant
<think>
The user wants me to read PROJECT_PLAN.md and nothing else, then reply with "Done". Let me read this file.
</think>
<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>
When Prompt 3 replays the same message (thinking block stripped):
<|im_start|>assistant
<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>
Token counts from llama-server logs
| Request |
Task |
Tokens Sent |
Tokens Processed |
Cache Behavior |
| Prompt 1 ("Hello") |
269 |
18,006 |
18,006 |
Fresh start |
| Prompt 2a (tool call) |
312 |
18,048 |
46 |
Not perfect cache hit (sim=0.998) |
| Prompt 2b (tool result) |
384 |
37,013 |
18,896 |
Perfect cache hit (f_keep=1.000) |
| Prompt 3 ("Hello again") |
424 |
36,995 |
18,993 |
Cache invalidated at token ~18,002 |
Prompt 3 sends 18 fewer total tokens than Prompt 2b despite having two additional messages (the "Done" response + user's Prompt 3). The missing tokens correspond exactly to the stripped <think> block.
llama-server cache rollback log (Prompt 3)
slot update_slots: id 0 | task 424 | n_past = 18046, slot.prompt.tokens.size() = 37045
slot update_slots: id 0 | task 424 | Checking checkpoint with [37008, 37008] against 18046...
slot update_slots: id 0 | task 424 | Checking checkpoint with [18001, 18001] against 18046...
slot update_slots: id 0 | task 424 | restored context checkpoint (pos_min = 18001, n_past = 18002)
slot update_slots: id 0 | task 424 | erased invalidated context checkpoint (pos_min = 26308, ...)
slot update_slots: id 0 | task 424 | erased invalidated context checkpoint (pos_min = 34500, ...)
slot update_slots: id 0 | task 424 | erased invalidated context checkpoint (pos_min = 35984, ...)
slot update_slots: id 0 | task 424 | erased invalidated context checkpoint (pos_min = 37008, ...)
prompt eval time = 75166.29 ms / 18993 tokens
The cache diverges at the tool call boundary (~token 18,002) and must reprocess everything after it.
Operating System
NixOS
Terminal
No response
Description
When an assistant response contains reasoning_content (thinking tokens), OpenCode correctly handles it for the immediate turn. However, on subsequent turns, when that assistant message is replayed as part of the conversation history, the thinking block is stripped.
This silently removes tokens from the conversation history, which invalidates the KV cache on local inference backends (like llamacpp) at the point where the thinking block used to be. The backend is then forced to reprocess the entire context from that turn forward on every subsequent turn.
While strict providers like Kimi K2.5 hard-fail with a 400 error when this happens #10996 , permissive local backends suffer a "silent fail." This makes thinking models non-interactive on local setups due to the massive latency penalty of constant prompt reprocessing. This might even be present when some providers for some models, which does result in less cached prompts, resulting in added costs.
This issue does not need reprocessing when the model is in a turn with itself, making calls back and forth.
Plugins
No response
OpenCode version
1.2.27
Steps to reproduce
Setup: NixOS, OpenCode communicating with a local llama-server hosting a Qwen3.5-122B model. Qwen3.5 35B, GPT-OSS-120B/20B, were also tested and gave same results of stripping which resulted in invalidation. All Qwen3.5 models, regardless if thinking or not result in the same problem (non thinking still produced empty
<think>\n\n</think>block which is also stripped and requires reprocessing.Prompt 1: Ask the model to "Say Hello". This initializes the system prompt and tools. Context is cached normally.
Prompt 2: Ask the model to do a task (e.g., "Read a file"). The model outputs a thinking block followed by the tool call/action. It also works with normal output, but reading the can be done quicker than genereting 10k tokens.
Prompt 3: Say "Hello" again. OpenCode sends the conversation history, but strips the thinking block from the Prompt 2 assistant message. The llama-server KV cache is invalidated at the Prompt 2 boundary, forcing it to fully reprocess the output of Prompt 2.
Prompt 4: Say "Hello" again. The same thing happens; it has to reprocess the output of Prompt 3.
Screenshot and/or share link
When Prompt 2 completes (tool_call assistant message includes thinking):
When Prompt 3 replays the same message (thinking block stripped):
Token counts from llama-server logs
Prompt 3 sends 18 fewer total tokens than Prompt 2b despite having two additional messages (the "Done" response + user's Prompt 3). The missing tokens correspond exactly to the stripped
<think>block.llama-server cache rollback log (Prompt 3)
The cache diverges at the tool call boundary (~token 18,002) and must reprocess everything after it.
Operating System
NixOS
Terminal
No response