Skip to content

[Bug]: hermes_tools RPC client mismatches responses under concurrent tool calls from execute_code #17770

@Heltman

Description

@Heltman

Summary

When a script run via execute_code calls sandbox tools (terminal, read_file, …) from multiple threads — e.g. ThreadPoolExecutor — each thread frequently receives another thread's response. The individual responses are well-formed and complete; they just get delivered to the wrong caller. This silently corrupts any concurrent tool-calling pattern inside execute_code.

Environment

  • hermes-agent main at 828d3a3 (also reproduces on older mains with the same _call() code)
  • macOS / Python 3.11 (UDS / local backend). The same bug class exists on the file-based transport used by remote backends.
  • tools/code_execution_tool.py_UDS_TRANSPORT_HEADER._call and _FILE_TRANSPORT_HEADER._call.

Minimal repro

Run this as an execute_code script (or paste into any ThreadPool-driven use of hermes_tools):

from hermes_tools import terminal
import concurrent.futures

def run(i):
    res = terminal(f"echo START-{i}; sleep 0.3; echo END-{i}", timeout=10)
    return i, res.get("output", "").strip()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as ex:
    results = list(ex.map(run, range(10)))

mismatches = 0
for i, out in results:
    ok = (f"START-{i}" in out) and (f"END-{i}" in out)
    flag = "OK" if ok else "MISMATCH"
    if not ok: mismatches += 1
    print(f"i={i} {flag}: {out!r}")
print(f"\nmismatches: {mismatches}/10")

Observed

i=0 MISMATCH: 'START-9\nEND-9'
i=1 MISMATCH: 'START-7\nEND-7'
i=2 MISMATCH: 'START-6\nEND-6'
i=3 MISMATCH: 'START-5\nEND-5'
i=4 MISMATCH: 'START-8\nEND-8'
i=5 MISMATCH: 'START-4\nEND-4'
i=6 MISMATCH: 'START-3\nEND-3'
i=7 MISMATCH: 'START-1\nEND-1'
i=8 MISMATCH: 'START-0\nEND-0'
i=9 MISMATCH: 'START-2\nEND-2'

mismatches: 10/10

Every START-X/END-X pair is intact (streams don't interleave), but pairs are delivered to the wrong caller.

Expected

Each run(i) sees START-i/END-i in its own output. mismatches: 0/10.

Root cause

tools/code_execution_tool.py — the UDS transport (_UDS_TRANSPORT_HEADER) used on the local backend:

_sock = None

def _connect():
    global _sock
    if _sock is None:
        _sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        _sock.connect(os.environ["HERMES_RPC_SOCKET"])
        _sock.settimeout(300)
    return _sock

def _call(tool_name, args):
    conn = _connect()
    request = json.dumps({"tool": tool_name, "args": args}) + "\n"
    conn.sendall(request.encode())
    buf = b""
    while True:
        chunk = conn.recv(65536)
        ...
  • _sock is a shared module-level connection.
  • The newline-framed RPC protocol has no request-id.
  • Server-side _rpc_server_loop accepts a single connection and handles requests serially (one response per request, in arrival order).
  • There is no lock around sendall + recv.

With concurrent callers, multiple threads sendall() their requests, the server responds in FIFO order, and each client thread races to recv() whatever newline-terminated blob arrives next on the shared socket. GIL guarantees single socket operations are atomic, but the multi-step sendall/recv sequence is not → responses get matched to the wrong caller. There is no interleaved byte corruption (responses are whole), just wrong-addressee delivery — which is exactly what the repro above shows.

The file transport (_FILE_TRANSPORT_HEADER._call, used on remote backends) has a sibling bug: _seq += 1 is a non-atomic read-modify-write, so concurrent threads can allocate the same sequence number and clobber each other's request/response files.

Impact

Anything inside execute_code that does concurrent tool calls silently gets wrong data — ThreadPoolExecutor/asyncio.to_thread/threading.Thread over terminal(), read_file(), search_files(), web_search(), etc. Because responses are individually well-formed, the bug doesn't raise; it just returns the wrong values. Easy to hit in:

  • Parallel crawls/fan-outs across regions or shards (how I found it — parallel argus engine trino clusters -r <region> for 13 regions came back completely cross-matched).
  • Concurrent file reads, parallel searches, N-wide API fan-out patterns — all of which are encouraged by the execute_code tool description ("batch processing / loop over N items / parallel requests").

Fix

Smallest correct fix: wrap the send+recv round-trip (UDS) and the seq allocation (file) in a threading.Lock. No protocol change, no server change.

A longer-term fix would add a request-id to the protocol so concurrent round-trips could genuinely be in flight (server and both _calls would need updating), but for now serialization preserves correctness at negligible cost — tool dispatch in the parent already happens on a single connection anyway.

PR (with regression tests, including one that fails without the fix): #NNN (replace with link after opening).

Regression tests

The PR adds three tests in tests/tools/test_code_execution.py:

  • test_uds_transport_serializes_concurrent_calls — asserts _call_lock present and used in generated UDS transport.
  • test_file_transport_serializes_seq_allocation — asserts _seq_lock present and used in generated file transport.
  • test_concurrent_tool_calls_match_responses — runs a sandboxed ThreadPoolExecutor of 10 terminal() calls against a slow mock dispatcher and asserts every caller sees its own tag. Fails 10/10 without the fix; passes with it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundtool/code-execexecute_code sandboxtype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions