Skip to content

[Bug]: hermes curator run can lose LLM reports because CLI exits while background daemon thread is still running #20555

@steezkelly

Description

@steezkelly

Describe the bug

Manual hermes curator run / hermes curator run --dry-run can report that the LLM pass is running in the background, then return control to the shell before the report is written. In a short-lived CLI process, the background review is run in a daemon thread; when the CLI process exits, that daemon thread can be terminated before it writes REPORT.md / run.json or updates last_report_path correctly.

The visible symptom is that hermes curator status shows either an old/stale last report path or a path that does not exist, even though hermes curator run appeared to start successfully.

Example observed status:

curator: ENABLED
  runs:           1
  last summary:   auto: no changes
  last report:    /tmp/pytest-of-steve/pytest-215/popen-gw2/test_state_atomic_write_no_tmp0/.hermes/logs/curator/20260506-011414
...

The /tmp/pytest-of-... path was stale test state, and no fresh report was produced by the manual CLI invocation.

Expected behavior

A manual CLI run should either:

  1. complete synchronously and write its report before returning, or
  2. explicitly opt into a reliable long-lived background execution mode.

This is especially important because the RFC/documented debug path is hermes curator run --sync — manual runs are usually used to verify curator behavior and inspect reports.

Actual behavior

The CLI default can start the LLM/report phase in a background daemon thread and then exit. Because the process exits, the daemon thread may be killed before report generation and state update complete.

Root cause

The core agent.curator.run_curator_review(...) supports a synchronous flag correctly. The problem is the CLI wrapper defaulting manual invocations into the background/daemon-thread path.

That background path is appropriate for the gateway/idle hook, where the parent process is long-lived. It is not reliable for a short-lived hermes curator run CLI command.

Local fix validated

I patched my local checkout so that hermes_cli/curator.py makes manual CLI run synchronous by default, while retaining an explicit --background flag for the old non-blocking behavior.

Local commit:

c6c74385d fix(curator): make manual runs synchronous by default

Main change:

  • hermes curator run and hermes curator run --dry-run now pass synchronous=True by default.
  • --background opts into the legacy non-blocking path.
  • --sync remains accepted and wins over --background if both are supplied.
  • hermes curator status marks a saved report path as missing/stale when the path no longer exists.

Verification performed

Focused tests:

python -m pytest \
  tests/hermes_cli/test_curator_run.py \
  tests/hermes_cli/test_curator_status.py \
  tests/hermes_cli/test_curator_archive_prune.py \
  tests/agent/test_curator_reports.py \
  tests/agent/test_curator.py \
  tests/agent/test_curator_activity.py \
  tests/agent/test_curator_classification.py \
  tests/agent/test_curator_backup.py \
  -q

Result:

145 passed

Static/syntax checks:

ruff check hermes_cli/curator.py tests/hermes_cli/test_curator_run.py tests/hermes_cli/test_curator_status.py
python -m py_compile hermes_cli/curator.py tests/hermes_cli/test_curator_run.py tests/hermes_cli/test_curator_status.py

Result:

All checks passed

I also ran an isolated temp HERMES_HOME E2E smoke test with a stub LLM response. The synchronous dry-run path created both:

REPORT.md
run.json

and updated state to the fresh report directory before returning.

Suggested fix

Change the CLI wrapper so manual hermes curator run defaults to synchronous execution, and make background mode explicit:

hermes curator run              # synchronous, reliable report write
hermes curator run --dry-run    # synchronous, reliable preview report
hermes curator run --background # legacy non-blocking behavior

Also helpful: when curator status shows last_report_path, check whether the path exists and annotate missing paths as stale/missing rather than presenting them as valid latest reports.

Related context

This appears distinct from the existing first-run dry-run/approval safety fix (#18389) and the per-run reports feature (#17307). It is specifically a CLI lifecycle issue: daemon-thread background work is not reliable after a short-lived CLI command exits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/cliCLI entry point, hermes_cli/, setup wizardtype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions