Summary
When a script run via execute_code calls sandbox tools (terminal, read_file, …) from multiple threads — e.g. ThreadPoolExecutor — each thread frequently receives another thread's response. The individual responses are well-formed and complete; they just get delivered to the wrong caller. This silently corrupts any concurrent tool-calling pattern inside execute_code.
Environment
- hermes-agent
main at 828d3a3 (also reproduces on older mains with the same _call() code)
- macOS / Python 3.11 (UDS / local backend). The same bug class exists on the file-based transport used by remote backends.
tools/code_execution_tool.py — _UDS_TRANSPORT_HEADER._call and _FILE_TRANSPORT_HEADER._call.
Minimal repro
Run this as an execute_code script (or paste into any ThreadPool-driven use of hermes_tools):
from hermes_tools import terminal
import concurrent.futures
def run(i):
res = terminal(f"echo START-{i}; sleep 0.3; echo END-{i}", timeout=10)
return i, res.get("output", "").strip()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as ex:
results = list(ex.map(run, range(10)))
mismatches = 0
for i, out in results:
ok = (f"START-{i}" in out) and (f"END-{i}" in out)
flag = "OK" if ok else "MISMATCH"
if not ok: mismatches += 1
print(f"i={i} {flag}: {out!r}")
print(f"\nmismatches: {mismatches}/10")
Observed
i=0 MISMATCH: 'START-9\nEND-9'
i=1 MISMATCH: 'START-7\nEND-7'
i=2 MISMATCH: 'START-6\nEND-6'
i=3 MISMATCH: 'START-5\nEND-5'
i=4 MISMATCH: 'START-8\nEND-8'
i=5 MISMATCH: 'START-4\nEND-4'
i=6 MISMATCH: 'START-3\nEND-3'
i=7 MISMATCH: 'START-1\nEND-1'
i=8 MISMATCH: 'START-0\nEND-0'
i=9 MISMATCH: 'START-2\nEND-2'
mismatches: 10/10
Every START-X/END-X pair is intact (streams don't interleave), but pairs are delivered to the wrong caller.
Expected
Each run(i) sees START-i/END-i in its own output. mismatches: 0/10.
Root cause
tools/code_execution_tool.py — the UDS transport (_UDS_TRANSPORT_HEADER) used on the local backend:
_sock = None
def _connect():
global _sock
if _sock is None:
_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
_sock.connect(os.environ["HERMES_RPC_SOCKET"])
_sock.settimeout(300)
return _sock
def _call(tool_name, args):
conn = _connect()
request = json.dumps({"tool": tool_name, "args": args}) + "\n"
conn.sendall(request.encode())
buf = b""
while True:
chunk = conn.recv(65536)
...
_sock is a shared module-level connection.
- The newline-framed RPC protocol has no request-id.
- Server-side
_rpc_server_loop accepts a single connection and handles requests serially (one response per request, in arrival order).
- There is no lock around
sendall + recv.
With concurrent callers, multiple threads sendall() their requests, the server responds in FIFO order, and each client thread races to recv() whatever newline-terminated blob arrives next on the shared socket. GIL guarantees single socket operations are atomic, but the multi-step sendall/recv sequence is not → responses get matched to the wrong caller. There is no interleaved byte corruption (responses are whole), just wrong-addressee delivery — which is exactly what the repro above shows.
The file transport (_FILE_TRANSPORT_HEADER._call, used on remote backends) has a sibling bug: _seq += 1 is a non-atomic read-modify-write, so concurrent threads can allocate the same sequence number and clobber each other's request/response files.
Impact
Anything inside execute_code that does concurrent tool calls silently gets wrong data — ThreadPoolExecutor/asyncio.to_thread/threading.Thread over terminal(), read_file(), search_files(), web_search(), etc. Because responses are individually well-formed, the bug doesn't raise; it just returns the wrong values. Easy to hit in:
- Parallel crawls/fan-outs across regions or shards (how I found it — parallel
argus engine trino clusters -r <region> for 13 regions came back completely cross-matched).
- Concurrent file reads, parallel searches, N-wide API fan-out patterns — all of which are encouraged by the
execute_code tool description ("batch processing / loop over N items / parallel requests").
Fix
Smallest correct fix: wrap the send+recv round-trip (UDS) and the seq allocation (file) in a threading.Lock. No protocol change, no server change.
A longer-term fix would add a request-id to the protocol so concurrent round-trips could genuinely be in flight (server and both _calls would need updating), but for now serialization preserves correctness at negligible cost — tool dispatch in the parent already happens on a single connection anyway.
PR (with regression tests, including one that fails without the fix): #NNN (replace with link after opening).
Regression tests
The PR adds three tests in tests/tools/test_code_execution.py:
test_uds_transport_serializes_concurrent_calls — asserts _call_lock present and used in generated UDS transport.
test_file_transport_serializes_seq_allocation — asserts _seq_lock present and used in generated file transport.
test_concurrent_tool_calls_match_responses — runs a sandboxed ThreadPoolExecutor of 10 terminal() calls against a slow mock dispatcher and asserts every caller sees its own tag. Fails 10/10 without the fix; passes with it.
Summary
When a script run via
execute_codecalls sandbox tools (terminal,read_file, …) from multiple threads — e.g.ThreadPoolExecutor— each thread frequently receives another thread's response. The individual responses are well-formed and complete; they just get delivered to the wrong caller. This silently corrupts any concurrent tool-calling pattern insideexecute_code.Environment
mainat 828d3a3 (also reproduces on older mains with the same_call()code)tools/code_execution_tool.py—_UDS_TRANSPORT_HEADER._calland_FILE_TRANSPORT_HEADER._call.Minimal repro
Run this as an
execute_codescript (or paste into any ThreadPool-driven use ofhermes_tools):Observed
Every
START-X/END-Xpair is intact (streams don't interleave), but pairs are delivered to the wrong caller.Expected
Each
run(i)seesSTART-i/END-iin its own output.mismatches: 0/10.Root cause
tools/code_execution_tool.py— the UDS transport (_UDS_TRANSPORT_HEADER) used on the local backend:_sockis a shared module-level connection._rpc_server_loopaccepts a single connection and handles requests serially (one response per request, in arrival order).sendall + recv.With concurrent callers, multiple threads
sendall()their requests, the server responds in FIFO order, and each client thread races torecv()whatever newline-terminated blob arrives next on the shared socket. GIL guarantees single socket operations are atomic, but the multi-stepsendall/recvsequence is not → responses get matched to the wrong caller. There is no interleaved byte corruption (responses are whole), just wrong-addressee delivery — which is exactly what the repro above shows.The file transport (
_FILE_TRANSPORT_HEADER._call, used on remote backends) has a sibling bug:_seq += 1is a non-atomic read-modify-write, so concurrent threads can allocate the same sequence number and clobber each other's request/response files.Impact
Anything inside
execute_codethat does concurrent tool calls silently gets wrong data —ThreadPoolExecutor/asyncio.to_thread/threading.Threadoverterminal(),read_file(),search_files(),web_search(), etc. Because responses are individually well-formed, the bug doesn't raise; it just returns the wrong values. Easy to hit in:argus engine trino clusters -r <region>for 13 regions came back completely cross-matched).execute_codetool description ("batch processing / loop over N items / parallel requests").Fix
Smallest correct fix: wrap the send+recv round-trip (UDS) and the seq allocation (file) in a
threading.Lock. No protocol change, no server change.A longer-term fix would add a request-id to the protocol so concurrent round-trips could genuinely be in flight (server and both
_calls would need updating), but for now serialization preserves correctness at negligible cost — tool dispatch in the parent already happens on a single connection anyway.PR (with regression tests, including one that fails without the fix): #NNN (replace with link after opening).
Regression tests
The PR adds three tests in
tests/tools/test_code_execution.py:test_uds_transport_serializes_concurrent_calls— asserts_call_lockpresent and used in generated UDS transport.test_file_transport_serializes_seq_allocation— asserts_seq_lockpresent and used in generated file transport.test_concurrent_tool_calls_match_responses— runs a sandboxedThreadPoolExecutorof 10terminal()calls against a slow mock dispatcher and asserts every caller sees its own tag. Fails 10/10 without the fix; passes with it.