fix: error context preservation, WAL checkpoint, hook timeout#6622
Open
aaronlab wants to merge 2 commits into
Open
fix: error context preservation, WAL checkpoint, hook timeout#6622aaronlab wants to merge 2 commits into
aaronlab wants to merge 2 commits into
Conversation
…agent loop reliability ## Summary Found 5 critical bugs in async error handling, context compression, and cron scheduling: **CRITICAL (2):** 1. Role violation after context compression (context_compressor.py:694-728) - Tool message validation missing when merging summary - Causes API crash and data loss after compression 2. Double-execution race condition in cron scheduler (scheduler.py:843-892) - File lock released before job execution completes - Allows duplicate jobs to be executed (DoS, duplicate messages) **HIGH (1):** 3. Unhandled context compression exceptions in main loop (run_agent.py:8204,8262,8338) - Silent crash when summarizer fails during API loop - No graceful degradation **MEDIUM (2):** 4. Error swallowing in auxiliary_client (auxiliary_client.py:2074-2106) - Original error overwritten on retry failure - Lost error context, unreachable fallback logic 5. Session ID change without exception recovery (run_agent.py:6041-6071) - Session state corruption on DB failures - Broken session lineage ## Details Full analysis with code snippets, scenarios, and fixes in: - AUDIT_ITERATION_2.md (400 lines, detailed technical analysis) - AUDIT_ITERATION_2_SUMMARY.txt (visual summary, testing recommendations) ## Recommended Priority 1. Bug NousResearch#1 (Role Violation) - FIX IMMEDIATELY 2. Bug NousResearch#2 (Double Execution) - FIX IMMEDIATELY 3. Bug NousResearch#3 (Unhandled Exceptions) - FIX SOON 4. Bug NousResearch#4 & NousResearch#5 - FIX AFTER critical bugs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tection - Preserve original error context in call_llm retry chain (agent/auxiliary_client.py): When max_tokens retry fails with a payment error, the original error was silently overwritten via `first_err = retry_err`, losing diagnostic context. Now chains the original error via `__cause__` for proper Python exception chaining. - Guard fallback provider call (agent/auxiliary_client.py): Payment fallback API call at line 2105 had no try/except. If the fallback provider also failed, there was no logging and no indication the fallback was attempted. Added error handling with warning log on fallback failure. - Add WAL checkpoint to holographic memory store (plugins/memory/holographic/store.py): WAL mode was enabled but no checkpoint mechanism existed, causing unbounded WAL file growth over time. Added periodic checkpoint every 50 writes and a final checkpoint on close(), following the same pattern used in hermes_state.py. - Add timeout protection to plugin hook invocation (hermes_cli/plugins.py): Plugin hook callbacks had exception isolation but no timeout protection. A misbehaving plugin could block the agent loop indefinitely. Added 30-second timeout using ThreadPoolExecutor with proper warning logging on timeout. Co-Authored-By: Claude Code <noreply@anthropic.com>
Collaborator
1 similar comment
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses three reliability gaps found during iteration #4 of a deep code audit:
Preserve original error context in call_llm retry chain (
agent/auxiliary_client.py): When themax_tokensretry also fails with a payment error, the original error was silently overwritten viafirst_err = retry_err(line 2089), losing the initial diagnostic context. Now usesretry_err.__cause__ = first_errfor proper Python exception chaining, preserving the full error sequence for debugging.Guard payment fallback provider call (
agent/auxiliary_client.py): The fallback API call (line 2105) had no try/except. If the fallback provider also failed, there was zero logging and no indication that a fallback was even attempted. Added error handling with a warning log message identifying which fallback provider failed.Add WAL checkpoint to holographic memory store (
plugins/memory/holographic/store.py): WAL mode was enabled (line 130) but no checkpoint mechanism existed anywhere in the class, unlikehermes_state.pywhich has proper_try_wal_checkpoint(). This causes unbounded WAL file growth over long-running sessions. Added periodic PASSIVE checkpoint every 50 writes and a final checkpoint onclose(), following the established pattern.Add timeout protection to plugin hook invocation (
hermes_cli/plugins.py): Hook callbacks had good exception isolation (try/except per callback) but no timeout protection. A misbehaving plugin could block the agent loop indefinitely with a blocking call. Added a 30-second timeout usingThreadPoolExecutor, with warning logging on timeout.Files Changed
agent/auxiliary_client.pyplugins/memory/holographic/store.pyhermes_cli/plugins.pyTest plan
__cause__time.sleep(60)in pre_llm_call hook → verify 30s timeout warningpytest tests/🤖 Generated with Claude Code