Skip to content

reasoning_content stripped from assistant messages on replay, causing KV cache invalidation on local inference #19081

@michal-zurkowski

Description

@michal-zurkowski

Description

When an assistant response contains reasoning_content (thinking tokens), OpenCode correctly handles it for the immediate turn. However, on subsequent turns, when that assistant message is replayed as part of the conversation history, the thinking block is stripped.

This silently removes tokens from the conversation history, which invalidates the KV cache on local inference backends (like llamacpp) at the point where the thinking block used to be. The backend is then forced to reprocess the entire context from that turn forward on every subsequent turn.

While strict providers like Kimi K2.5 hard-fail with a 400 error when this happens #10996 , permissive local backends suffer a "silent fail." This makes thinking models non-interactive on local setups due to the massive latency penalty of constant prompt reprocessing. This might even be present when some providers for some models, which does result in less cached prompts, resulting in added costs.

This issue does not need reprocessing when the model is in a turn with itself, making calls back and forth.

Plugins

No response

OpenCode version

1.2.27

Steps to reproduce

Setup: NixOS, OpenCode communicating with a local llama-server hosting a Qwen3.5-122B model. Qwen3.5 35B, GPT-OSS-120B/20B, were also tested and gave same results of stripping which resulted in invalidation. All Qwen3.5 models, regardless if thinking or not result in the same problem (non thinking still produced empty <think>\n\n</think> block which is also stripped and requires reprocessing.

Prompt 1: Ask the model to "Say Hello". This initializes the system prompt and tools. Context is cached normally.
Prompt 2: Ask the model to do a task (e.g., "Read a file"). The model outputs a thinking block followed by the tool call/action. It also works with normal output, but reading the can be done quicker than genereting 10k tokens.
Prompt 3: Say "Hello" again. OpenCode sends the conversation history, but strips the thinking block from the Prompt 2 assistant message. The llama-server KV cache is invalidated at the Prompt 2 boundary, forcing it to fully reprocess the output of Prompt 2.
Prompt 4: Say "Hello" again. The same thing happens; it has to reprocess the output of Prompt 3.

Screenshot and/or share link

When Prompt 2 completes (tool_call assistant message includes thinking):

<|im_start|>assistant
<think>
The user wants me to read PROJECT_PLAN.md and nothing else, then reply with "Done". Let me read this file.
</think>

<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>

When Prompt 3 replays the same message (thinking block stripped):

<|im_start|>assistant
<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>

Token counts from llama-server logs

Request Task Tokens Sent Tokens Processed Cache Behavior
Prompt 1 ("Hello") 269 18,006 18,006 Fresh start
Prompt 2a (tool call) 312 18,048 46 Not perfect cache hit (sim=0.998)
Prompt 2b (tool result) 384 37,013 18,896 Perfect cache hit (f_keep=1.000)
Prompt 3 ("Hello again") 424 36,995 18,993 Cache invalidated at token ~18,002

Prompt 3 sends 18 fewer total tokens than Prompt 2b despite having two additional messages (the "Done" response + user's Prompt 3). The missing tokens correspond exactly to the stripped <think> block.

llama-server cache rollback log (Prompt 3)

slot update_slots: id  0 | task 424 | n_past = 18046, slot.prompt.tokens.size() = 37045
slot update_slots: id  0 | task 424 | Checking checkpoint with [37008, 37008] against 18046...
slot update_slots: id  0 | task 424 | Checking checkpoint with [18001, 18001] against 18046...
slot update_slots: id  0 | task 424 | restored context checkpoint (pos_min = 18001, n_past = 18002)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 26308, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 34500, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 35984, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 37008, ...)
prompt eval time = 75166.29 ms / 18993 tokens

The cache diverges at the tool call boundary (~token 18,002) and must reprocess everything after it.

Operating System

NixOS

Terminal

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcoreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions