Skip to content

fix(runtime): self-heal orphaned tool_result blocks on load + compact#5853

Open
itripn wants to merge 3 commits intozeroclaw-labs:masterfrom
itripn:fix/5813-orphan-tool-result-self-heal
Open

fix(runtime): self-heal orphaned tool_result blocks on load + compact#5853
itripn wants to merge 3 commits intozeroclaw-labs:masterfrom
itripn:fix/5813-orphan-tool-result-self-heal

Conversation

@itripn
Copy link
Copy Markdown

@itripn itripn commented Apr 18, 2026

Summary

  • Base branch: master
  • What changed and why:
    • Orphaned tool_result messages (whose paired assistant tool_use was lost to compaction or a crash) bricked Signal-channel sessions with repeated Anthropic 400 "unexpected tool_use_id in tool_result blocks" errors. Users had to manually delete the session file.
    • Load paths (CLI load_interactive_session_history + channel orchestrator hydration) now run remove_orphaned_tool_messages so a corrupt persisted session heals on startup.
    • Compaction's repair_tool_pairs now delegates to the canonical remove_orphaned_tool_messages rather than a weak "tool adjacent to [CONTEXT SUMMARY]" heuristic — it catches orphans wherever they sit after splice.
    • Orphan detection now parses the assistant's structured tool_calls array instead of substring-matching content. Compaction summaries are instructed to preserve identifiers, so an orphan's tool_call_id can legitimately appear in summary prose — string matching falsely concluded the orphan was paired.
  • Scope boundary: No provider/API changes, no config changes, no new dependencies. The healing logic is the existing remove_orphaned_tool_messages function; this PR only strengthens it and wires it into the remaining gaps.
  • Blast radius: Runtime agent loop, context compression, and channel orchestrator session hydration. Same logic already runs pre-send; this closes the gaps on load and in the compaction repair pass.
  • Linked issue(s): Closes #5813

Validation Evidence (required)

cargo fmt --all -- --check    # clean
cargo clippy --all-targets -- -D warnings    # clean
cargo test -p zeroclaw-runtime --lib    # 1607 passed, 0 failed, 1 ignored
cargo test --workspace    # 3137 passed, 4 failed (pre-existing on master, unrelated)
  • Commands run and tail output:
    • Targeted tests (runtime lib): test result: ok. 1607 passed; 0 failed; 1 ignored
    • Full workspace: 4 pre-existing failures in zeroclaw-providers::compatible::tests::flatten_system_messages_* and 2 in zeroclaw-channels::orchestrator::tests::build_channel_by_id_*_telegram* — verified they fail on unmodified master (git stash + re-run).
  • Beyond CI — what did you manually verify? New unit tests exercise the three fix seams: CLI session load heal, orchestrator compaction repair heal, and remove_orphaned_tool_messages not being fooled by tool_call_id appearing in [CONTEXT SUMMARY] prose.
  • If any command was intentionally skipped, why: ./dev/ci.sh all was not run end-to-end given the pre-existing failures unrelated to this change.

Security & Privacy Impact (required)

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No
  • Secrets / tokens / credentials handling changed? No
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? No

Compatibility (required)

  • Backward compatible? Yes
  • Config / env / CLI surface changed? No

Rollback (required for risk: medium and risk: high)

  • Fast rollback command/path: git revert <sha> — self-healing is additive; reverting restores prior behavior (sessions remain bricked until manual deletion, matching pre-fix state).
  • Feature flags or config toggles: None.
  • Observable failure symptoms: Look for Removed N orphaned tool message(s) from history warn-level tracing events (emitted by remove_orphaned_tool_messages). Sudden spike in those after deploy = many sessions needed healing; absence = healthy. Anthropic 400 errors mentioning unexpected tool_use_id should disappear.

🤖 Generated with Claude Code

@singlerider
Copy link
Copy Markdown
Collaborator

@itripn

The commit doesn't seem to be associated with a GitHub account: https://github.com/zeroclaw-labs/zeroclaw/commit/9a8eba4eb7143b42e7c23118de0aafa6ff3761ae.patch

Is your ~/.gitconfig set up to be associated with the correct username?

@singlerider
Copy link
Copy Markdown
Collaborator

singlerider commented Apr 18, 2026

I'm not at a computer to review this, but I definitely want this bug to die. I'll self-assign. Before I review, though:

One minor thing: the channel orchestrator patch at line 5600 runs remove_orphaned_tool_messages after the MAX_CHANNEL_HISTORY drain but before the orphaned-user-turn closure. That ordering is correct (drain first so you don't waste cycles on messages about to be dropped). But if a future change adds another trim step between the drain and the orphan check, orphans could sneak through again. Not a real concern today, just a coupling that someone should be aware of. 'Mind addressing in this PR so we avoid sneaky bugs in the future?

@singlerider singlerider self-assigned this Apr 18, 2026
@singlerider singlerider added the needs-author-action Author action required before merge label Apr 18, 2026
@singlerider
Copy link
Copy Markdown
Collaborator

@WareWolf-MoonWall , could you do a review? This is the bug that refuses to die.

@singlerider singlerider added needs-maintainer-review and removed needs-author-action Author action required before merge labels Apr 19, 2026
itripn pushed a commit to itripn/zeroclaw that referenced this pull request Apr 20, 2026
Run remove_orphaned_tool_messages as the final mutation before pushing
to conversation_histories, rather than mid-pipeline. Any future trim or
rewrite step inserted into the hydration loop is now automatically
covered by the heal guard.

Addresses reviewer feedback on zeroclaw-labs#5853: the previous ordering worked, but
coupled the heal's correctness to the current shape of the pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@itripn
Copy link
Copy Markdown
Author

itripn commented Apr 20, 2026

Thanks for the review! Addressed in b944428 — moved remove_orphaned_tool_messages to run as the final mutation before pushing to conversation_histories, right after the orphan-user-turn closure. Any trim/rewrite step added into the loop in the future is now automatically covered by the heal guard, instead of relying on reviewers to notice the coupling.

On the commit-author note: that's my gitconfig email not being linked to the GitHub profile — I'll rebase the trailers once I've added it at https://github.com/settings/emails so the history attributes cleanly.

@itripn
Copy link
Copy Markdown
Author

itripn commented Apr 20, 2026

Heads up: about to force-push this branch to re-author the three commits under a GitHub-linked email. No content changes
— just author attribution.

itripn and others added 3 commits April 19, 2026 21:42
…zeroclaw-labs#5813)

Orphaned tool_result messages (whose paired assistant tool_use was lost
to compaction or a crash) bricked Signal-channel sessions with repeated
Anthropic 400 "unexpected tool_use_id in tool_result blocks" errors.

- Load paths (CLI session file + orchestrator channel hydration) now run
  remove_orphaned_tool_messages so a corrupt persisted session heals
  instead of requiring manual file deletion.
- Compaction's repair pass now delegates to the canonical
  remove_orphaned_tool_messages instead of a weak adjacency heuristic.
- Orphan detection parses the assistant's structured tool_calls array
  instead of substring-matching content — summaries that preserve the
  orphan's id in prose no longer fool the check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI clippy (1.93) flagged clippy::useless_format on a format!() with no
interpolation args. Replace with a plain &str literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run remove_orphaned_tool_messages as the final mutation before pushing
to conversation_histories, rather than mid-pipeline. Any future trim or
rewrite step inserted into the hydration loop is now automatically
covered by the heal guard.

Addresses reviewer feedback on zeroclaw-labs#5853: the previous ordering worked, but
coupled the heal's correctness to the current shape of the pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@itripn itripn force-pushed the fix/5813-orphan-tool-result-self-heal branch from b944428 to 8061121 Compare April 20, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants