Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 35 additions & 7 deletions src/copaw/agents/model_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,43 @@ class FileBlockSupportFormatter(base_formatter_class):
"""Formatter with file block support for tool results."""

async def _format(self, msgs):
"""Override to sanitize tool messages before formatting.
"""Override to sanitize tool messages and preserve thinking blocks.

This prevents OpenAI API errors from improperly paired
tool messages.
- Sanitizes improperly paired tool messages.
- After base formatting, injects ``reasoning_content`` from
thinking blocks into assistant messages so that APIs like
Kimi K2.5 (which require it) don't reject the request.
"""
msgs = _sanitize_tool_messages(msgs)
messages = await super()._format(msgs)
return _strip_top_level_message_name(messages)

# Collect thinking content per assistant msg (in order).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we apply this kind of fix to AgentScope library's formatter instead of CoPaw? @rayrayraykk @DavdGao

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lailoo can you make this fix to AgentScope library instead? It will help to keep this code base focusing on the application not the model interaction layer.

Copy link
Copy Markdown
Author

@lailoo lailoo Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll move this fix to the AgentScope library instead. Will create a PR there and link it back here.
@ekzhu

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Closing this PR.

assistant_thinking: list[str] = []
for msg in msgs:
if msg.role == "assistant":
parts = [
b.get("thinking", "")
for b in msg.get_content_blocks()
if b.get("type") == "thinking"
]
assistant_thinking.append(
"\n".join(p for p in parts if p),
)

formatted = await super()._format(msgs)

# Inject reasoning_content into formatted assistant messages.
if assistant_thinking:
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The injection loop is guarded by if assistant_thinking:, but assistant_thinking is populated for every assistant message even when none contain thinking blocks (empty strings). This causes an unnecessary second pass over formatted on most calls; consider guarding with if any(assistant_thinking): (or tracking a has_thinking flag) so the injection work is skipped when there’s nothing to inject.

Suggested change
if assistant_thinking:
if any(assistant_thinking):

Copilot uses AI. Check for mistakes.
asst_idx = 0
for fmt_msg in formatted:
if fmt_msg.get("role") == "assistant":
if (
asst_idx < len(assistant_thinking)
and assistant_thinking[asst_idx]
):
fmt_msg["reasoning_content"] = assistant_thinking[asst_idx]
asst_idx += 1
Comment on lines +100 to +110
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Token undercount vs reasoning 🐞 Bug ⛯ Reliability

The formatter now injects reasoning_content into outgoing assistant messages, but the token
counting logic used for /history and auto memory compaction only counts content. This will
under-estimate prompt size for thinking models and can delay compaction until after requests start
failing due to context limits.
Agent Prompt
## Issue description
The formatter now injects `reasoning_content` into formatted assistant messages for thinking models. However, the token counting logic used for auto memory compaction and `/history` only extracts tokens from `content`, ignoring `reasoning_content`, causing systematic undercounting and increasing the risk of exceeding model context limits before compaction triggers.

## Issue Context
- `FileBlockSupportFormatter._format()` adds `reasoning_content` to formatted assistant messages.
- `safe_count_message_tokens()` relies on `_extract_text_from_messages()` which currently concatenates only `content` fields.
- Memory compaction and `/history` both use this token estimate.

## Fix Focus Areas
- src/copaw/agents/utils/token_counting.py[58-86]
- src/copaw/agents/hooks/memory_compaction.py[161-169]
- src/copaw/agents/command_handler.py[245-251]
- src/copaw/agents/model_factory.py[75-112]

## Suggested change
- In `_extract_text_from_messages()`, append `msg.get("reasoning_content")` (when present) into `parts` similarly to `content`.
- (Optional but recommended) Consider also incorporating text from tool call arguments if present in the formatted schema, to further reduce undercounting for tool-heavy conversations.
- Add/extend a unit test to ensure token counting increases when `reasoning_content` is present in formatted messages.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +100 to +110
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

OpenAI API reasoning_content field assistant messages tool_calls specification

💡 Result:

Is there a reasoning_content field on assistant messages?

In the current OpenAI API docs, no: the chat/assistant message schema does not define a reasoning_content field. For reasoning models, “reasoning” is represented separately (in the Responses API) as reasoning items, not as an assistant-message field. [2]

Chat Completions: assistant message + tool_calls shape

For Chat Completions, an assistant message is essentially:

  • role: "assistant"
  • content: required unless you provide tool_calls (or the deprecated function_call) [1]
  • tool_calls: array of tool calls generated by the model [1]

tool_calls items (function tool call) include:

  • id (tool call id)
  • type (currently function)
  • function: { name, arguments } where arguments is a JSON string [1]

The docs also describe a “custom tool call” shape with:

  • id, type: "custom", and custom: { name, input } [1]

Responses API: where “reasoning content” lives now

In the Responses API, reasoning is carried as a distinct reasoning object/item (not an assistant message field). It includes:

  • id
  • summary[] (summary text parts)
  • optional content[] containing reasoning_text
  • optional encrypted_content (when requested via include) [2][3]

The docs also note you should include these reasoning items back in input for later turns if you are manually managing context. [2]

Sources: Chat Completions message/tool_calls schema [1]; Responses API reasoning object [2]; streaming reasoning_text / encrypted_content details [3].


🏁 Script executed:

# Search for reasoning_content usage in the codebase
rg "reasoning_content" -A 3 -B 3

Repository: agentscope-ai/CoPaw

Length of output: 8281


🏁 Script executed:

# Search for assistant_thinking usage to understand the context
rg "assistant_thinking" -A 2 -B 2

Repository: agentscope-ai/CoPaw

Length of output: 1791


🏁 Script executed:

# Look at the broader context of the model_factory.py file
wc -l src/copaw/agents/model_factory.py

Repository: agentscope-ai/CoPaw

Length of output: 100


🏁 Script executed:

# Read the relevant section of the file for full context
sed -n '90,120p' src/copaw/agents/model_factory.py

Repository: agentscope-ai/CoPaw

Length of output: 1320


🏁 Script executed:

# Search for Kimi or DeepSeek references and backend-specific handling
rg -i "kimi|deepseek" -B 3 -A 3

Repository: agentscope-ai/CoPaw

Length of output: 2114


🏁 Script executed:

# Search for issue `#155` or related discussion in code/tests
rg "#155" -B 3 -A 3

Repository: agentscope-ai/CoPaw

Length of output: 354


🏁 Script executed:

# Check the test file to understand expected behavior better
cat -n tests/test_thinking_block_fix.py

Repository: agentscope-ai/CoPaw

Length of output: 7447


🏁 Script executed:

# Search for tool_calls usage in model_factory to understand the flow
rg "tool_calls" -B 2 -A 2 src/copaw/agents/model_factory.py

Repository: agentscope-ai/CoPaw

Length of output: 45


🌐 Web query:

Kimi K2.5 DeepSeek reasoning_content field OpenAI-compatible API tool_calls

💡 Result:

Kimi K2.5 (Moonshot) — reasoning_content + OpenAI-compatible tool_calls

  • reasoning_content output: In thinking mode, Kimi K2.5 returns a separate reasoning_content alongside content (answer). The K2.5 repo’s example reads it as response.choices[0].message.reasoning_content. You can disable thinking (instant mode) via extra_body (e.g., {"thinking": {"type":"disabled"}} for the official API, or {"chat_template_kwargs": {"thinking": False}} for vLLM/SGLang). [2]
  • OpenAI SDK compatibility note: Some Moonshot “thinking” docs note the OpenAI SDK types may not expose reasoning_content directly, so you may need to access it defensively (e.g., hasattr/getattr) in streaming. [5]
  • Tool calling: Moonshot recommends tool_calls (not legacy function_call), supports parallel tool calls, and stresses that every returned tool_call.id must be matched by a later role="tool" message with the correct tool_call_id—and you must append the assistant message that contained tool_calls into history or you can get tool_call_id not found. [1]

DeepSeek — reasoning_content behavior + tool calls

  • reasoning_content output: DeepSeek’s reasoning model returns both reasoning_content (CoT) and content (final answer) as separate fields. [3]
  • Whether to send reasoning_content back:
    • DeepSeek’s reasoning model guide says: do not feed reasoning_content back into later messages (it can error). [3]
    • DeepSeek’s newer thinking mode + tool calls guide says: during a single question’s multi-step “thinking + tool calls” loop, you must send reasoning_content back so the model can continue; when you start the next user question, remove prior reasoning_content (it will be ignored / should not be kept). [4]

OpenAI-compatible tool_calls shape (reference)

  • In OpenAI Chat Completions, tool calling is controlled via tool_choice (auto/none/required or forcing a specific tool). This is the interface most “OpenAI-compatible” providers try to mirror. [6]

Sources: Moonshot tool-calls guide [1], Kimi K2.5 repo example [2], DeepSeek reasoning model docs [3], DeepSeek thinking-mode tool-calls guide [4], Moonshot thinking blog (reasoning_content access) [5], OpenAI chat completions reference (tool_choice) [6].


Gate reasoning_content injection to tool-call assistant messages only.

The tests verify reasoning_content exclusively on assistant messages with tool_calls (lines 65, 182 in the test file). Backend documentation for DeepSeek and Kimi K2.5 also indicates reasoning_content is most relevant in tool-calling contexts. Injecting into all assistant messages with thinking blocks risks rejection by stricter backends.

Proposed diff
             if assistant_thinking:
                 asst_idx = 0
                 for fmt_msg in formatted:
                     if fmt_msg.get("role") == "assistant":
                         if (
+                            fmt_msg.get("tool_calls")
+                            and
                             asst_idx < len(assistant_thinking)
                             and assistant_thinking[asst_idx]
                         ):
                             fmt_msg["reasoning_content"] = assistant_thinking[asst_idx]
                         asst_idx += 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Inject reasoning_content into formatted assistant messages.
if assistant_thinking:
asst_idx = 0
for fmt_msg in formatted:
if fmt_msg.get("role") == "assistant":
if (
asst_idx < len(assistant_thinking)
and assistant_thinking[asst_idx]
):
fmt_msg["reasoning_content"] = assistant_thinking[asst_idx]
asst_idx += 1
# Inject reasoning_content into formatted assistant messages.
if assistant_thinking:
asst_idx = 0
for fmt_msg in formatted:
if fmt_msg.get("role") == "assistant":
if (
fmt_msg.get("tool_calls")
and
asst_idx < len(assistant_thinking)
and assistant_thinking[asst_idx]
):
fmt_msg["reasoning_content"] = assistant_thinking[asst_idx]
asst_idx += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/copaw/agents/model_factory.py` around lines 100 - 110, The current loop
injects reasoning_content into every assistant formatted message when
assistant_thinking exists; change it to only inject into assistant messages that
are tool-calling by additionally checking for a tool-call indicator (e.g.,
fmt_msg.get("tool_calls") or fmt_msg.get("tool_call")) before setting
fmt_msg["reasoning_content"] so only tool-call assistant messages receive
reasoning_content; keep the existing assistant_thinking indexing
(assistant_thinking, formatted, fmt_msg, asst_idx) and the asst_idx increment
behavior unchanged.


return _strip_top_level_message_name(formatted)

@staticmethod
def convert_tool_result_to_string(
Expand Down Expand Up @@ -282,8 +311,7 @@ def _create_remote_model_instance(
base_url = llm_cfg.base_url
else:
logger.warning(
"No active LLM configured — "
"falling back to DASHSCOPE_API_KEY env var",
"No active LLM configured — " "falling back to DASHSCOPE_API_KEY env var",
)
model_name = "qwen3-max"
api_key = os.getenv("DASHSCOPE_API_KEY", "")
Expand Down
204 changes: 204 additions & 0 deletions tests/test_thinking_block_fix.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# -*- coding: utf-8 -*-
"""Verify fix for #155: thinking blocks preserved as reasoning_content."""
import asyncio
import json

from agentscope.message import (
Msg,
ThinkingBlock,
TextBlock,
ToolUseBlock,
ToolResultBlock,
)
from agentscope.model import OpenAIChatModel

from copaw.agents.model_factory import _create_formatter_instance


def test_reasoning_content_preserved():
"""Thinking blocks must appear as reasoning_content in formatted output."""
formatter = _create_formatter_instance(OpenAIChatModel)

msgs = [
Msg(
name="system",
role="system",
content="You are a helpful assistant.",
),
Msg(name="user", role="user", content="What time is it now?"),
Msg(
name="assistant",
role="assistant",
content=[
ThinkingBlock(
type="thinking",
thinking="I should call get_current_time to answer this.",
),
TextBlock(type="text", text="Let me check."),
ToolUseBlock(
type="tool_use",
id="call_001",
name="get_current_time",
input={},
),
],
),
Msg(
name="system",
role="system",
content=[
ToolResultBlock(
type="tool_result",
id="call_001",
name="get_current_time",
output="2026-03-01 14:30:00 CST",
),
],
),
]

formatted = asyncio.run(formatter.format(msgs))

assistant_msgs = [
m
for m in formatted
if m.get("role") == "assistant" and m.get("tool_calls")
]
assert len(assistant_msgs) == 1, (
f"Expected 1 assistant msg with tool_calls, "
f"got {len(assistant_msgs)}. "
f"All formatted: {json.dumps(formatted, ensure_ascii=False)}"
)

msg = assistant_msgs[0]
reasoning = msg.get("reasoning_content")
assert reasoning, (
f"reasoning_content missing! "
f"Formatted msg: {json.dumps(msg, ensure_ascii=False, indent=2)}"
)
assert "get_current_time" in reasoning
print(f"✅ PASS — reasoning_content preserved: {reasoning}")


def test_no_reasoning_when_no_thinking():
"""Messages without thinking blocks should not get reasoning_content."""
formatter = _create_formatter_instance(OpenAIChatModel)

msgs = [
Msg(name="system", role="system", content="Hi"),
Msg(name="user", role="user", content="Hello"),
Msg(
name="assistant",
role="assistant",
content=[TextBlock(type="text", text="Hi there!")],
),
]

formatted = asyncio.run(formatter.format(msgs))
assistant_msgs = [m for m in formatted if m.get("role") == "assistant"]
assert len(assistant_msgs) == 1

msg = assistant_msgs[0]
assert (
"reasoning_content" not in msg
), (
"reasoning_content should not be present "
"when there are no thinking blocks"
)
print("✅ PASS — no reasoning_content when no thinking blocks")


def test_multiple_assistant_messages():
"""Each assistant message gets its own reasoning_content."""
formatter = _create_formatter_instance(OpenAIChatModel)

msgs = [
Msg(name="system", role="system", content="Hi"),
Msg(name="user", role="user", content="Do two things"),
Msg(
name="assistant",
role="assistant",
content=[
ThinkingBlock(
type="thinking",
thinking="First task thinking",
),
TextBlock(type="text", text="Doing first thing."),
ToolUseBlock(
type="tool_use",
id="c1",
name="tool_a",
input={},
),
],
),
Msg(
name="system",
role="system",
content=[
ToolResultBlock(
type="tool_result",
id="c1",
name="tool_a",
output="done",
),
],
Comment on lines +138 to +145
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several lines in this test file exceed the repo’s max line length (79) and will fail flake8/black (e.g., the one-line ToolResultBlock call here). Please run black and/or wrap these long argument lists across multiple lines so E501 doesn’t block CI/pre-commit.

Copilot uses AI. Check for mistakes.
),
Msg(
name="assistant",
role="assistant",
content=[
ThinkingBlock(
type="thinking",
thinking="Second task thinking",
),
TextBlock(type="text", text="Doing second thing."),
ToolUseBlock(
type="tool_use",
id="c2",
name="tool_b",
input={},
),
],
),
Msg(
name="system",
role="system",
content=[
ToolResultBlock(
type="tool_result",
id="c2",
name="tool_b",
output="done",
),
],
),
]

formatted = asyncio.run(formatter.format(msgs))
assistant_msgs = [
m
for m in formatted
if m.get("role") == "assistant" and m.get("tool_calls")
]
assert (
len(assistant_msgs) == 2
), f"Expected 2 assistant msgs, got {len(assistant_msgs)}"

assert (
assistant_msgs[0].get("reasoning_content") == "First task thinking"
)
assert (
assistant_msgs[1].get("reasoning_content") == "Second task thinking"
)
print(
"✅ PASS — multiple assistant messages "
"each get correct reasoning_content"
)


if __name__ == "__main__":
test_reasoning_content_preserved()
test_no_reasoning_when_no_thinking()
test_multiple_assistant_messages()
print("\nAll tests passed!")
Loading