Skip to content

feat(agent): HMAC tool execution receipts for hallucination detection#4831

Closed
singlerider wants to merge 7 commits intozeroclaw-labs:masterfrom
singlerider:feat/tool-receipts
Closed

feat(agent): HMAC tool execution receipts for hallucination detection#4831
singlerider wants to merge 7 commits intozeroclaw-labs:masterfrom
singlerider:feat/tool-receipts

Conversation

@singlerider
Copy link
Copy Markdown
Collaborator

@singlerider singlerider commented Mar 27, 2026

Summary

  • Base branch target: master
  • Problem: LLMs hallucinate tool calls — claiming "I ran X and got Y" without executing anything. No way to verify tool execution from output alone.
  • Why it matters: Observed in production: bot fabricated an entire tool-assisted workflow, then later admitted fabrication — but the work had actually been completed. Neither the user nor the system could resolve the contradiction.
  • What changed: HMAC-SHA256 receipts generated after each tool execution. Receipts appended to tool results, optionally shown in user-visible messages. Leak detector exempts receipt tokens. System prompt instructs LLM to preserve receipts.
  • What did not change: Tool execution flow. Provider API calls. Conversation history format. No enforcement (Phase 1 is passive).

Label Snapshot (required)

  • Risk label: risk: low
  • Size label: size: M
  • Scope labels: agent, security, config
  • Module labels: agent: tool_execution, security: leak_detector
  • Contributor tier label: N/A

Change Metadata

  • Change type: feature
  • Primary scope: agent

Linked Issue

Validation Evidence (required)

cargo fmt --all -- --check     # clean
cargo clippy --all-targets -- -D warnings  # clean
cargo test --lib               # 5811 pass (13 new)
  • Evidence provided: 12 receipt unit tests (deterministic, adversarial, format), 1 leak detector test, manual production testing on Discord
  • If any command is intentionally skipped: N/A

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? Yes — ephemeral HMAC key generated per session (in-memory only, never logged/exposed)
  • File system access scope changed? No
  • If Yes: The HMAC key is 256-bit random, generated via ring::SystemRandom, stored only in ReceiptGenerator struct in memory. Never serialized, never sent to the LLM, never logged. Destroyed when the session ends.

Privacy and Data Hygiene (required)

  • Data-hygiene status: pass
  • Redaction/anonymization notes: Receipt hashes are derived from tool results (which are already sanitized by scrub_credentials). The HMAC itself reveals no information about the key.
  • Neutral wording confirmation: Yes

Compatibility / Migration

  • Backward compatible? Yes — disabled by default
  • Config/env changes? New optional [agent.tool_receipts] section with enabled and show_in_response
  • Migration needed? No

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? Yes (new docs/security/tool-receipts.md, config reference update)
  • Locale navigation parity updated? No — English only for initial feature documentation
  • If No: New feature, English docs first. Localization in follow-up.

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: Receipt generation on Discord (shell, weather tools). show_in_response footer. Leak detector exemption. Bot echoes real receipts same-turn. Bot fabricates ULID-format fake when receipts not wired up.
  • Edge cases checked: Multi-tool (2 weather calls = 2 distinct receipts). Cross-turn recall failure (known Phase 1 limitation). Failed tool execution produces no receipt.
  • What was not verified: Matrix channel (expected identical — same code path). Active validation (Phase 2).

Verified test exchanges (reconstructed from production, sanitized)

Test 1 — same-turn echo works:

User: Run date in a shell and give me the response.
Bot: Fri Mar 27 09:48:19 UTC 2026 [receipt: zc-receipt-1774604899-fVRG...]
"showed you were fabricated. This one has the zc-receipt- prefix. The others didn't. I was lying."

Debug log: Tool receipt generated tool=shell receipt=zc-receipt-1774604899-fVRG...

Test 2 — fabrication before wiring:

User: Repeat the exact tool result including any [receipt: lines.
Bot: [receipt: 01JGN1YFRXHFEJQZFV4W6YBMJK]

No zc-receipt- prefix. Bot fabricated a ULID. Immediately distinguishable. ❌ (expected — receipts not configured)

Test 3 — multi-tool with show_in_response:

User: What's the weather in [City A] and [City B]?
Bot: [weather results]

Tool receipts:
weather: zc-receipt-1774609588-f8Rh0E3j...
weather: zc-receipt-1774609588-ArUgr4xu...

Debug log: Two Tool receipt generated events with matching hashes. ✅

Test 4 — cross-turn recall failure (Phase 1 limitation):

User: Give me the two tool calls you just made.
Bot: "The receipts I gave you earlier were wrong. I fabricated the first set."

Debug log: Bot re-executed weather twice (timestamps 1774609791 vs originals 1774609588). Both sets valid — different invocations. ❌ (known limitation)

Associated debug log pattern:

2026-03-27T11:06:28.845504Z DEBUG zeroclaw::agent::loop_: Tool receipt generated tool=weather receipt=zc-receipt-1774609588-f8Rh0E3jVv2lhzehPrYRym9L3N8tYL9wuq7dZebfYYw
2026-03-27T11:06:28.845525Z DEBUG zeroclaw::agent::loop_: Tool receipt generated tool=weather receipt=zc-receipt-1774609588-ArUgr4xuHzyWi3J25gHGWOToi2L-2_taZH_0xBUl76Q

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Tool execution outcome, agent loop result injection, system prompt, leak detector, config schema
  • Potential unintended effects: Tool results are slightly longer (receipt appended). LLM may echo receipts in user-visible responses (intended when show_in_response is false — the system prompt encourages it).
  • Guardrails/monitoring: Tool receipt generated debug log. Receipt appears in tool result content.

Agent Collaboration Notes (recommended)

  • Agent tools used: Claude Code (Opus 4.6)
  • Workflow/plan summary: Research → design (arXiv:2603.10060) → implement receipt module → wire into execution → test on production → iterate on leak detector exemption and show_in_response
  • Verification focus: HMAC integrity, fabrication distinguishability, no new dependencies
  • Confirmation: naming + architecture boundaries followed

Rollback Plan (required)

  • Fast rollback command/path: Set agent.tool_receipts.enabled = false or remove section from config
  • Feature flags or config toggles: enabled = false (default), show_in_response = false (default)
  • Observable failure symptoms: None — disabling receipts restores exact previous behavior

Risks and Mitigations

  • Risk: Phase 1 is passive — receipts don't prevent hallucinations, only make them detectable
  • Risk: Cross-turn recall fails — bot re-executes tools when asked to repeat receipts
    • Mitigation: The show_in_response runtime footer is the reliable verification path. Phase 2 adds persistent audit table.

Zero new dependencies

All cryptographic primitives already in the dependency tree: hmac (v0.12), sha2, ring (for SecureRandom), base64. No binary size increase beyond ~200 lines of receipt logic.

References

  1. Basu, A. (2026). "Tool Receipts, Not Zero-Knowledge Proofs." arXiv:2603.10060. 94.2% detection, <15ms overhead.
  2. Anthropic Tool Use Docs
  3. tool-gating-mcp — complementary pattern

🤖 Generated with Claude Code

@github-actions github-actions Bot added agent Auto scope: src/agent/** changed. channel Auto scope: src/channels/** changed. config Auto scope: src/config/** changed. tool Auto scope: src/tools/** changed. labels Mar 27, 2026
@github-actions github-actions Bot added the security Auto scope: src/security/** changed. label Mar 27, 2026
@singlerider
Copy link
Copy Markdown
Collaborator Author

I'm not 100% sure if anyone else would find something like this useful like I would. It's boring to constantly see the bots hallucinate. Being able to prove to them that they have done so is often a good way to kickstart them in the right direction. It feels in-the-spirit with the audit-first culture of this repo.

If this is a bad idea, cool, I get it, but it could also be the start of an interesting way to audit consumer LLM tech in an intuitive way.

I'm in no rush to get this merged, but I would love for a discussion either here (regarding implementation) and/or in #4830 (regarding the idea in general) to see if this opt-in feature set should be included (or not).

@github-actions github-actions Bot added the docs Auto scope: docs/markdown/template files changed. label Mar 27, 2026
@singlerider singlerider marked this pull request as ready for review March 27, 2026 12:49
…n detection

When enabled via agent.tool_receipts.enabled, every tool execution
produces a cryptographic HMAC-SHA256 receipt appended to the tool
result. The LLM cannot forge valid receipts because the ephemeral
session key is never exposed. Receipts create an independent ground
truth about which tools actually executed.

Adds ReceiptGenerator with per-session ephemeral keys, config flag
(disabled by default), system prompt instruction when enabled, and
12 unit tests including adversarial verification (tampered results,
wrong keys, fabricated receipts).

Based on: Basu (2026), "Tool Receipts, Not Zero-Knowledge Proofs:
Practical Hallucination Detection for AI Agents," arXiv:2603.10060.
When agent.tool_receipts.show_in_response is true, append collected
tool receipts to the user-visible response message. Uses the
universal delivered_response finalization point — works across all
channels and streaming modes without per-channel implementation.

Receipts are collected via a shared Mutex<Vec> passed to the tool
loop, then appended after sanitization but before send. Default
false — receipts remain internal/audit only unless opted in.
…action

When agent.tool_receipts.show_in_response is true, append collected
tool receipts to the user-visible response message. Uses the
universal delivered_response finalization point — works across all
channels and streaming modes without per-channel implementation.

Exempt zc-receipt- tokens from the leak detector's high-entropy
redaction so receipts are visible to the user when the LLM echoes
them. Without this, the sanitizer replaces receipts with
[REDACTED_HIGH_ENTROPY_TOKEN], making inline and appended receipts
appear different for the same tool call.

Refs zeroclaw-labs#4830
Add docs/security/tool-receipts.md covering receipt format, config
options, security properties, what receipts detect vs don't prevent,
how to view receipts in logs and responses, and Phase 1 limitations.

Add tool_receipts section to config-reference.md with enabled and
show_in_response options. Link from security README.

Apply cargo fmt to leak_detector.rs.
Rust 2024 reserves `gen` as a keyword. Rename all test variables
and closure parameters to `receipt_gen` in tool_receipts.rs and
tool_execution.rs.
singlerider added a commit to singlerider/zeroclaw that referenced this pull request Mar 28, 2026
Every tool execution produces a cryptographic HMAC-SHA256 receipt proving
the tool actually ran. The LLM cannot forge valid receipts because it
never sees the ephemeral session key.

New module src/agent/tool_receipts.rs with ReceiptGenerator. Wired
through tool_execution.rs, loop_.rs, and channels/mod.rs. Opt-in via
config: agent.tool_receipts.enabled and agent.tool_receipts.show_in_response.

Leak detector updated to exempt zc-receipt- tokens from entropy redaction.

Closes zeroclaw-labs#4830
Supersedes zeroclaw-labs#4831, zeroclaw-labs#4921
singlerider added a commit to singlerider/zeroclaw that referenced this pull request Mar 29, 2026
Every tool execution produces a cryptographic HMAC-SHA256 receipt proving
the tool actually ran. The LLM cannot forge valid receipts because it
never sees the ephemeral session key.

New module src/agent/tool_receipts.rs with ReceiptGenerator. Wired
through tool_execution.rs, loop_.rs, and channels/mod.rs. Opt-in via
config: agent.tool_receipts.enabled and agent.tool_receipts.show_in_response.

Leak detector updated to exempt zc-receipt- tokens from entropy redaction.

Closes zeroclaw-labs#4830
Supersedes zeroclaw-labs#4831, zeroclaw-labs#4921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Auto scope: src/agent/** changed. channel Auto scope: src/channels/** changed. config Auto scope: src/config/** changed. docs Auto scope: docs/markdown/template files changed. security Auto scope: src/security/** changed. tool Auto scope: src/tools/** changed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: HMAC tool execution receipts for hallucination detection

1 participant