[Bug]: hermes_tools RPC client mismatches responses under concurrent tool calls from execute_code

## Summary

When a script run via `execute_code` calls sandbox tools (`terminal`, `read_file`, …) from multiple threads — e.g. `ThreadPoolExecutor` — each thread frequently receives **another thread's response**. The individual responses are well-formed and complete; they just get delivered to the wrong caller. This silently corrupts any concurrent tool-calling pattern inside `execute_code`.

## Environment

- hermes-agent `main` at 828d3a320 (also reproduces on older mains with the same `_call()` code)
- macOS / Python 3.11 (UDS / local backend). The same bug class exists on the file-based transport used by remote backends.
- `tools/code_execution_tool.py` — `_UDS_TRANSPORT_HEADER._call` and `_FILE_TRANSPORT_HEADER._call`.

## Minimal repro

Run this as an `execute_code` script (or paste into any ThreadPool-driven use of `hermes_tools`):

```python
from hermes_tools import terminal
import concurrent.futures

def run(i):
    res = terminal(f"echo START-{i}; sleep 0.3; echo END-{i}", timeout=10)
    return i, res.get("output", "").strip()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as ex:
    results = list(ex.map(run, range(10)))

mismatches = 0
for i, out in results:
    ok = (f"START-{i}" in out) and (f"END-{i}" in out)
    flag = "OK" if ok else "MISMATCH"
    if not ok: mismatches += 1
    print(f"i={i} {flag}: {out!r}")
print(f"\nmismatches: {mismatches}/10")
```

### Observed

```
i=0 MISMATCH: 'START-9\nEND-9'
i=1 MISMATCH: 'START-7\nEND-7'
i=2 MISMATCH: 'START-6\nEND-6'
i=3 MISMATCH: 'START-5\nEND-5'
i=4 MISMATCH: 'START-8\nEND-8'
i=5 MISMATCH: 'START-4\nEND-4'
i=6 MISMATCH: 'START-3\nEND-3'
i=7 MISMATCH: 'START-1\nEND-1'
i=8 MISMATCH: 'START-0\nEND-0'
i=9 MISMATCH: 'START-2\nEND-2'

mismatches: 10/10
```

Every `START-X/END-X` pair is intact (streams don't interleave), but pairs are delivered to the wrong caller.

### Expected

Each `run(i)` sees `START-i`/`END-i` in its own output. `mismatches: 0/10`.

## Root cause

`tools/code_execution_tool.py` — the **UDS transport** (`_UDS_TRANSPORT_HEADER`) used on the local backend:

```python
_sock = None

def _connect():
    global _sock
    if _sock is None:
        _sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        _sock.connect(os.environ["HERMES_RPC_SOCKET"])
        _sock.settimeout(300)
    return _sock

def _call(tool_name, args):
    conn = _connect()
    request = json.dumps({"tool": tool_name, "args": args}) + "\n"
    conn.sendall(request.encode())
    buf = b""
    while True:
        chunk = conn.recv(65536)
        ...
```

- `_sock` is a **shared module-level connection**.
- The newline-framed RPC protocol has **no request-id**.
- Server-side `_rpc_server_loop` accepts a single connection and handles requests **serially** (one response per request, in arrival order).
- There is **no lock** around `sendall + recv`.

With concurrent callers, multiple threads `sendall()` their requests, the server responds in FIFO order, and each client thread races to `recv()` whatever newline-terminated blob arrives next on the shared socket. GIL guarantees single socket operations are atomic, but the multi-step `sendall`/`recv` sequence is not → responses get matched to the wrong caller. There is no interleaved byte corruption (responses are whole), just wrong-addressee delivery — which is exactly what the repro above shows.

The **file transport** (`_FILE_TRANSPORT_HEADER._call`, used on remote backends) has a sibling bug: `_seq += 1` is a non-atomic read-modify-write, so concurrent threads can allocate the same sequence number and clobber each other's request/response files.

## Impact

Anything inside `execute_code` that does concurrent tool calls silently gets wrong data — `ThreadPoolExecutor`/`asyncio.to_thread`/`threading.Thread` over `terminal()`, `read_file()`, `search_files()`, `web_search()`, etc. Because responses are individually well-formed, the bug doesn't raise; it just returns the wrong values. Easy to hit in:

- Parallel crawls/fan-outs across regions or shards (how I found it — parallel `argus engine trino clusters -r <region>` for 13 regions came back completely cross-matched).
- Concurrent file reads, parallel searches, N-wide API fan-out patterns — all of which are encouraged by the `execute_code` tool description ("batch processing / loop over N items / parallel requests").

## Fix

Smallest correct fix: wrap the send+recv round-trip (UDS) and the seq allocation (file) in a `threading.Lock`. No protocol change, no server change.

A longer-term fix would add a request-id to the protocol so concurrent round-trips could genuinely be in flight (server and both `_call`s would need updating), but for now serialization preserves correctness at negligible cost — tool dispatch in the parent already happens on a single connection anyway.

PR (with regression tests, including one that fails without the fix): **#NNN** *(replace with link after opening)*.

## Regression tests

The PR adds three tests in `tests/tools/test_code_execution.py`:

- `test_uds_transport_serializes_concurrent_calls` — asserts `_call_lock` present and used in generated UDS transport.
- `test_file_transport_serializes_seq_allocation` — asserts `_seq_lock` present and used in generated file transport.
- `test_concurrent_tool_calls_match_responses` — runs a sandboxed `ThreadPoolExecutor` of 10 `terminal()` calls against a slow mock dispatcher and asserts every caller sees its own tag. Fails 10/10 without the fix; passes with it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: hermes_tools RPC client mismatches responses under concurrent tool calls from execute_code #17770

Summary

Environment

Minimal repro

Observed

Expected

Root cause

Impact

Fix

Regression tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: hermes_tools RPC client mismatches responses under concurrent tool calls from execute_code #17770

Description

Summary

Environment

Minimal repro

Observed

Expected

Root cause

Impact

Fix

Regression tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions