reasoning_content stripped from assistant messages on replay, causing KV cache invalidation on local inference

### Description

When an assistant response contains reasoning_content (thinking tokens), OpenCode correctly handles it for the immediate turn. However, on subsequent turns, when that assistant message is replayed as part of the conversation history, the thinking block is stripped.

This silently removes tokens from the conversation history, which invalidates the KV cache on local inference backends (like llamacpp) at the point where the thinking block used to be. The backend is then forced to reprocess the entire context from that turn forward on every subsequent turn.

While strict providers like Kimi K2.5 hard-fail with a 400 error when this happens #10996 , permissive local backends suffer a "silent fail." This makes thinking models non-interactive on local setups due to the massive latency penalty of constant prompt reprocessing. This might even be present when some providers for some models, which does result in less cached prompts, resulting in added costs.

This issue does not need reprocessing when the model is in a turn with itself, making calls back and forth.

### Plugins

_No response_

### OpenCode version

1.2.27

### Steps to reproduce

Setup: NixOS, OpenCode communicating with a local llama-server hosting a Qwen3.5-122B model. Qwen3.5 35B, GPT-OSS-120B/20B, were also tested and gave same results of stripping which resulted in invalidation. All Qwen3.5 models, regardless if thinking or not result in the same problem (non thinking still produced empty `<think>\n\n</think>` block which is also stripped and requires reprocessing.

Prompt 1: Ask the model to "Say Hello". This initializes the system prompt and tools. Context is cached normally.
Prompt 2: Ask the model to do a task (e.g., "Read a file"). The model outputs a thinking block followed by the tool call/action. It also works with normal output, but reading the can be done quicker than genereting 10k tokens.
Prompt 3: Say "Hello" again. OpenCode sends the conversation history, but strips the thinking block from the Prompt 2 assistant message. The llama-server KV cache is invalidated at the Prompt 2 boundary, forcing it to fully reprocess the output of Prompt 2.
Prompt 4: Say "Hello" again. The same thing happens; it has to reprocess the output of Prompt 3.

### Screenshot and/or share link

**When Prompt 2 completes** (tool_call assistant message includes thinking):
```
<|im_start|>assistant
<think>
The user wants me to read PROJECT_PLAN.md and nothing else, then reply with "Done". Let me read this file.
</think>

<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>
```

**When Prompt 3 replays the same message** (thinking block stripped):
```
<|im_start|>assistant
<tool_call>
<function=read>
<parameter=filePath>
/home/delgon/ai-controller/PROJECT_PLAN.md
</parameter>
</function>
</tool_call><|im_end|>
```

### Token counts from llama-server logs

| Request | Task | Tokens Sent | Tokens Processed | Cache Behavior |
|---------|------|-------------|------------------|----------------|
| Prompt 1 ("Hello") | 269 | 18,006 | 18,006 | Fresh start |
| Prompt 2a (tool call) | 312 | 18,048 | 46 | Not perfect cache hit (sim=0.998) |
| Prompt 2b (tool result) | 384 | 37,013 | 18,896 | Perfect cache hit (f_keep=1.000) |
| Prompt 3 ("Hello again") | 424 | 36,995 | 18,993 | Cache invalidated at token ~18,002 |

Prompt 3 sends **18 fewer total tokens** than Prompt 2b despite having **two additional messages** (the "Done" response + user's Prompt 3). The missing tokens correspond exactly to the stripped `<think>` block.

### llama-server cache rollback log (Prompt 3)

```
slot update_slots: id  0 | task 424 | n_past = 18046, slot.prompt.tokens.size() = 37045
slot update_slots: id  0 | task 424 | Checking checkpoint with [37008, 37008] against 18046...
slot update_slots: id  0 | task 424 | Checking checkpoint with [18001, 18001] against 18046...
slot update_slots: id  0 | task 424 | restored context checkpoint (pos_min = 18001, n_past = 18002)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 26308, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 34500, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 35984, ...)
slot update_slots: id  0 | task 424 | erased invalidated context checkpoint (pos_min = 37008, ...)
prompt eval time = 75166.29 ms / 18993 tokens
```

The cache diverges at the tool call boundary (~token 18,002) and must reprocess everything after it.

### Operating System

NixOS

### Terminal

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reasoning_content stripped from assistant messages on replay, causing KV cache invalidation on local inference #19081

Description

Plugins

OpenCode version

Steps to reproduce

Screenshot and/or share link

Token counts from llama-server logs

llama-server cache rollback log (Prompt 3)

Operating System

Terminal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request	Task	Tokens Sent	Tokens Processed	Cache Behavior
Prompt 1 ("Hello")	269	18,006	18,006	Fresh start
Prompt 2a (tool call)	312	18,048	46	Not perfect cache hit (sim=0.998)
Prompt 2b (tool result)	384	37,013	18,896	Perfect cache hit (f_keep=1.000)
Prompt 3 ("Hello again")	424	36,995	18,993	Cache invalidated at token ~18,002

reasoning_content stripped from assistant messages on replay, causing KV cache invalidation on local inference #19081

Description

Description

Plugins

OpenCode version

Steps to reproduce

Screenshot and/or share link

Token counts from llama-server logs

llama-server cache rollback log (Prompt 3)

Operating System

Terminal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions