Bug Description
After commit c5b4c4816 (fix: lazy session creation — defer DB row until first message (#18370)), the gateway agent occasionally returns response=0 chars — the agent completes a full run (many API calls, long elapsed time) but produces no output and sends nothing back to the user.
Symptoms
From gateway agent.log, entries like:
INFO gateway.run: response ready: platform=weixin chat=... time=943.0s api_calls=2 response=0 chars
INFO gateway.run: response ready: platform=weixin chat=... time=789.0s api_calls=10 response=0 chars
INFO gateway.run: response ready: platform=weixin chat=... time=227.6s api_calls=15 response=0 chars
The agent is clearly doing work (high api_calls count, long elapsed time) but returns nothing. This is a silent failure — no error is logged.
Frequency
Observed 4 times in ~24 hours before any upstream update on May 2. The issue predates the May 2 systemd unit update.
Suspected Root Cause
In run_agent.py, the _ensure_db_session() method added by c5b4c48 raises an exception that gets caught and logged, but the agent continues running with _session_db_created = False. The next message to the same session will retry — but the current run may proceed with a partially-initialized session state, causing the final response to be discarded.
The old code used ensure_session() (idempotent, INSERT OR IGNORE) in _flush_messages_or_raise, which never failed silently. The new code relies on _ensure_db_session() called at the top of run_conversation(), but when session row creation fails, the exception is caught and logged — yet the conversation loop continues, potentially completing but then returning no response.
Environment
- Platform: macOS (WeChat gateway)
- Hermes: latest from main (commit f98b5d0)
- Python: 3.11
- Config: gateway mode with WeChat adapter
Logs
2026-05-01 21:01:05,565 INFO gateway.run: response ready: platform=weixin chat=o9cq807w... time=943.0s api_calls=2 response=0 chars
2026-05-02 09:13:05,677 INFO gateway.run: response ready: platform=weixin chat=o9cq807w... time=5.2s api_calls=0 response=0 chars
2026-05-02 11:18:43,925 INFO gateway.run: response ready: platform=weixin chat=o9cq807w... time=788.9s api_calls=10 response=0 chars
2026-05-02 16:14:12,097 INFO gateway.run: response ready: platform=weixin chat=o9cq807w... time=227.6s api_calls=15 response=0 chars
Proposed Fix
The retry logic in _ensure_db_session() should not silently continue the conversation if session creation fails. Either:
- Make
_ensure_db_session() raise instead of silently catching (and let the caller handle it), OR
- Fall back to the old
ensure_session() call in _flush_messages_or_raise as a safety net, OR
- Add a
success flag check at the end of run_conversation() and return an error response if the session was never created
Commit
Bug Description
After commit
c5b4c4816(fix: lazy session creation — defer DB row until first message (#18370)), the gateway agent occasionally returnsresponse=0 chars— the agent completes a full run (many API calls, long elapsed time) but produces no output and sends nothing back to the user.Symptoms
From gateway agent.log, entries like:
The agent is clearly doing work (high api_calls count, long elapsed time) but returns nothing. This is a silent failure — no error is logged.
Frequency
Observed 4 times in ~24 hours before any upstream update on May 2. The issue predates the May 2 systemd unit update.
Suspected Root Cause
In
run_agent.py, the_ensure_db_session()method added by c5b4c48 raises an exception that gets caught and logged, but the agent continues running with_session_db_created = False. The next message to the same session will retry — but the current run may proceed with a partially-initialized session state, causing the final response to be discarded.The old code used
ensure_session()(idempotent, INSERT OR IGNORE) in_flush_messages_or_raise, which never failed silently. The new code relies on_ensure_db_session()called at the top ofrun_conversation(), but when session row creation fails, the exception is caught and logged — yet the conversation loop continues, potentially completing but then returning no response.Environment
Logs
Proposed Fix
The retry logic in
_ensure_db_session()should not silently continue the conversation if session creation fails. Either:_ensure_db_session()raise instead of silently catching (and let the caller handle it), ORensure_session()call in_flush_messages_or_raiseas a safety net, ORsuccessflag check at the end ofrun_conversation()and return an error response if the session was never createdCommit
c5b4c4816— fix: lazy session creation — defer DB row until first message (fix: lazy session creation — defer DB row until first message #18370) — suspected culpritf98b5d00a— fix: gateway systemd unit now retries indefinitely with backoff (fix: gateway systemd unit now retries indefinitely with backoff #18639) — related (exposes the issue more due to restart behavior change)