Skip to content

fix: error context preservation, WAL checkpoint, hook timeout#6622

Open
aaronlab wants to merge 2 commits into
NousResearch:mainfrom
aaronlab:fix/error-context-wal-checkpoint-hook-timeout
Open

fix: error context preservation, WAL checkpoint, hook timeout#6622
aaronlab wants to merge 2 commits into
NousResearch:mainfrom
aaronlab:fix/error-context-wal-checkpoint-hook-timeout

Conversation

@aaronlab
Copy link
Copy Markdown
Contributor

@aaronlab aaronlab commented Apr 9, 2026

Summary

This PR addresses three reliability gaps found during iteration #4 of a deep code audit:

  • Preserve original error context in call_llm retry chain (agent/auxiliary_client.py): When the max_tokens retry also fails with a payment error, the original error was silently overwritten via first_err = retry_err (line 2089), losing the initial diagnostic context. Now uses retry_err.__cause__ = first_err for proper Python exception chaining, preserving the full error sequence for debugging.

  • Guard payment fallback provider call (agent/auxiliary_client.py): The fallback API call (line 2105) had no try/except. If the fallback provider also failed, there was zero logging and no indication that a fallback was even attempted. Added error handling with a warning log message identifying which fallback provider failed.

  • Add WAL checkpoint to holographic memory store (plugins/memory/holographic/store.py): WAL mode was enabled (line 130) but no checkpoint mechanism existed anywhere in the class, unlike hermes_state.py which has proper _try_wal_checkpoint(). This causes unbounded WAL file growth over long-running sessions. Added periodic PASSIVE checkpoint every 50 writes and a final checkpoint on close(), following the established pattern.

  • Add timeout protection to plugin hook invocation (hermes_cli/plugins.py): Hook callbacks had good exception isolation (try/except per callback) but no timeout protection. A misbehaving plugin could block the agent loop indefinitely with a blocking call. Added a 30-second timeout using ThreadPoolExecutor, with warning logging on timeout.

Files Changed

File Change
agent/auxiliary_client.py Exception chaining for error context + fallback error handling
plugins/memory/holographic/store.py WAL checkpoint method + periodic + on-close checkpoint
hermes_cli/plugins.py 30s timeout for hook callbacks via ThreadPoolExecutor

Test plan

  • Simulate primary provider max_tokens error → payment retry → verify original error in __cause__
  • Simulate fallback provider failure → verify warning log is emitted
  • Insert 100+ facts into holographic store → verify WAL file stays bounded
  • Create a plugin with time.sleep(60) in pre_llm_call hook → verify 30s timeout warning
  • Run existing test suite: pytest tests/

🤖 Generated with Claude Code

aaronlab and others added 2 commits April 9, 2026 20:54
…agent loop reliability

## Summary
Found 5 critical bugs in async error handling, context compression, and cron scheduling:

**CRITICAL (2):**
1. Role violation after context compression (context_compressor.py:694-728)
   - Tool message validation missing when merging summary
   - Causes API crash and data loss after compression

2. Double-execution race condition in cron scheduler (scheduler.py:843-892)
   - File lock released before job execution completes
   - Allows duplicate jobs to be executed (DoS, duplicate messages)

**HIGH (1):**
3. Unhandled context compression exceptions in main loop (run_agent.py:8204,8262,8338)
   - Silent crash when summarizer fails during API loop
   - No graceful degradation

**MEDIUM (2):**
4. Error swallowing in auxiliary_client (auxiliary_client.py:2074-2106)
   - Original error overwritten on retry failure
   - Lost error context, unreachable fallback logic

5. Session ID change without exception recovery (run_agent.py:6041-6071)
   - Session state corruption on DB failures
   - Broken session lineage

## Details
Full analysis with code snippets, scenarios, and fixes in:
- AUDIT_ITERATION_2.md (400 lines, detailed technical analysis)
- AUDIT_ITERATION_2_SUMMARY.txt (visual summary, testing recommendations)

## Recommended Priority
1. Bug NousResearch#1 (Role Violation) - FIX IMMEDIATELY
2. Bug NousResearch#2 (Double Execution) - FIX IMMEDIATELY
3. Bug NousResearch#3 (Unhandled Exceptions) - FIX SOON
4. Bug NousResearch#4 & NousResearch#5 - FIX AFTER critical bugs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tection

- Preserve original error context in call_llm retry chain (agent/auxiliary_client.py):
  When max_tokens retry fails with a payment error, the original error was silently
  overwritten via `first_err = retry_err`, losing diagnostic context. Now chains
  the original error via `__cause__` for proper Python exception chaining.

- Guard fallback provider call (agent/auxiliary_client.py):
  Payment fallback API call at line 2105 had no try/except. If the fallback
  provider also failed, there was no logging and no indication the fallback was
  attempted. Added error handling with warning log on fallback failure.

- Add WAL checkpoint to holographic memory store (plugins/memory/holographic/store.py):
  WAL mode was enabled but no checkpoint mechanism existed, causing unbounded WAL
  file growth over time. Added periodic checkpoint every 50 writes and a final
  checkpoint on close(), following the same pattern used in hermes_state.py.

- Add timeout protection to plugin hook invocation (hermes_cli/plugins.py):
  Plugin hook callbacks had exception isolation but no timeout protection. A
  misbehaving plugin could block the agent loop indefinitely. Added 30-second
  timeout using ThreadPoolExecutor with proper warning logging on timeout.

Co-Authored-By: Claude Code <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/plugins Plugin system and bundled plugins labels Apr 29, 2026
@alt-glitch
Copy link
Copy Markdown
Collaborator

Related to #16510 (WAL checkpoint pattern) and #6684 (write_count race in hermes_state.py). WAL checkpoint portion overlaps with existing work.

1 similar comment
@alt-glitch
Copy link
Copy Markdown
Collaborator

Related to #16510 (WAL checkpoint pattern) and #6684 (write_count race in hermes_state.py). WAL checkpoint portion overlaps with existing work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/plugins Plugin system and bundled plugins P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants