feat(gateway/ws): tool-approval back-channel via WsApprovalChannel#6387
feat(gateway/ws): tool-approval back-channel via WsApprovalChannel#6387singlerider wants to merge 3 commits intozeroclaw-labs:masterfrom
Conversation
β¦se WS protocol Step 1 of the gateway-approval-bypass fix. Document the WS frame contract for tool approvals so a maintainer reviewer can sign off on the wire shape before the deeper plumbing lands. This commit only adds documentation: - Module-level rustdoc gains a "Tool approvals (in progress, see zeroclaw-labs#6207)" section spelling out the JSON shape both directions, the timeout + auto-deny semantics, and the parity goal with process_channel_message + ApprovalManager. - A TODO comment in the TurnEvent forwarder names the exact code locations that still need wiring: TurnEvent::ApprovalRequest in zeroclaw-runtime, the approval gate in crates/zeroclaw-runtime/src/agent/loop_.rs:1572-1619, and the ApprovalManager construction site that does not yet exist on the gateway path. Not in this commit (deliberately separate to keep the protocol contract reviewable in isolation): - TurnEvent::ApprovalRequest variant on the runtime side. - Threading Option<&ApprovalManager> through Agent::turn_streamed (or switching the gateway to the same channel-message path the actual channels use, which gives approvals for free). - Web UI: rendering the approval card with Approve / Deny / Always buttons in web/src/. - Auto-deny on client disconnect / timeout. Refs zeroclaw-labs#6207.
CI flagged the standalone comment block between match arms; rustfmt prefers it hung off the closing brace of the previous arm. Apply `cargo fmt` to satisfy the lint. Refs zeroclaw-labs#6207.
The agent's tool loop already iterates `channel_handles.ask_user`
looking for a Channel that can answer ChannelApprovalRequest. The
gateway WS path constructed an Agent but never registered itself as
such a channel, so any supervised tool call routed through /ws/chat
would auto-deny with no operator surface.
Add a TurnEvent::ApprovalRequest variant on the runtime side and a
WsApprovalChannel that, when the agent calls request_approval, mints
a request_id, emits the event over a connection-level mpsc, and
parks on a oneshot waiter. The WS forward task drains the mpsc onto
the wire; inbound `approval_response` frames pop the waiter from the
shared pending map. Timeouts and disconnects auto-deny so the agent
never sees None and never falls back to the implicit no-op path.
Wire format:
Server -> Client: { "type": "approval_request",
"request_id": "<uuid>", "tool": ...,
"arguments_summary": ...,
"timeout_secs": 120 }
Client -> Server: { "type": "approval_response",
"request_id": "<uuid>",
"decision": "approve" | "deny" | "always" }
`arguments_summary` is the redacted summary the runtime synthesises
via `zeroclaw_runtime::approval::summarize_args`; raw args (including
secret-bearing fields) never reach the wire. The ACP channel keeps
its own `request_choice` path; ApprovalRequest events are dropped on
the ACP transport since approvals there flow through
session/request_permission.
Closes zeroclaw-labs#6207
Audacity88
left a comment
There was a problem hiding this comment.
I checked the current PR snapshot at head 58a98d4, linked issue #6207, current labels and status checks, the absence of prior comments/reviews, the five-file PR diff, and the adjacent runtime/channel approval contracts. Line references below include current PR-head source and adjacent source where the behavior depends on existing contracts. The request-id/oneshot shape is a good direction, but I think this needs changes because the new path does not actually round-trip an approval during a WebSocket turn yet, and the new WS contract relies on a redaction guarantee the current helper does not provide.
π’ What looks good β reusing the channel approval contract
Using ChannelApprovalRequest / ChannelApprovalResponse for the gateway path is the right abstraction. Once the transport pump is wired into the active turn, this should let WS approvals share the same approve/deny/always policy and audit path as other channel-backed approvals instead of inventing a separate gateway-only policy.
π΄ Blocking β approval requests are queued in a loop that is not running during the turn
handle_socket awaits process_chat_message(...) when it receives a message frame (crates/zeroclaw-gateway/src/ws.rs:536; the first-message fallback has the same shape around :393). While that await is in progress, the outer socket select! is suspended. That matters because the new approval pump also lives in that outer loop: outbound approval_request frames are drained from approval_event_rx at ws.rs:552-575, and inbound approval_response frames are read at ws.rs:466-490.
So a supervised tool call can insert the pending oneshot and send the TurnEvent::ApprovalRequest in crates/zeroclaw-gateway/src/ws_approval.rs:88-108, but the client never sees that request because the only task that forwards it is waiting for the turn to finish. The same blocked loop also cannot read the user's approval response. The practical result is timeout/deny, and then the outer loop can resume and send a stale approval prompt after the pending entry was already removed.
That means this does not satisfy #6207's core requirement yet: the web client still cannot approve the supervised tool call before timeout. I would move approval-event forwarding and approval_response handling into the same concurrent turn driver, or route the approval request through the per-turn event stream that process_chat_message already drains while also keeping response reads active during the turn. A regression test should prove that /ws/chat sends an approval_request, accepts an approve response, and executes the tool before the timeout.
π΄ Blocking β arguments_summary is documented as redacted but the current helper only stringifies values
The new API/WS comments say arguments_summary is secret-redacted (crates/zeroclaw-api/src/agent.rs:36-39, crates/zeroclaw-gateway/src/ws.rs:42-51), and the PR body says raw args with secret-bearing fields never reach the wire. But zeroclaw_runtime::approval::summarize_args currently iterates every key/value and truncates the stringified value (crates/zeroclaw-runtime/src/approval/mod.rs:224-247). WsApprovalChannel copies that string into the new event (crates/zeroclaw-gateway/src/ws_approval.rs:92-96), and the gateway forwards it directly to the browser (crates/zeroclaw-gateway/src/ws.rs:560-566, :705-718).
That means an approval request with an argument such as api_key, token, or another secret-shaped field can still put the secret value into the approval_request frame, just truncated. The helper gap may be pre-existing for other approval transports, but this PR makes that string a documented WebSocket protocol field and uses the claimed redaction as the security/privacy mitigation for the new gateway boundary. I would either route the summary through a real redaction/scrubbing helper before creating or forwarding the approval event, with coverage for secret-shaped fields, or remove the redaction claim until that guarantee actually exists. Given the gateway/security boundary and the PR's explicit privacy claim, I would block this until the wire frame cannot expose secret-bearing tool arguments.
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
Review β PR #6387 Β· feat(gateway/ws): tool-approval back-channel via WsApprovalChannel
Verdict: --request-changes
Head reviewed: 58a98d49071ab03f891d3cb7c9dc9943c7ce64d3
Take-stock before writing
Active block holders:
- Audacity88: CHANGES_REQUESTED β two π΄ blockers (turn-loop concurrency; redaction claim vs. implementation)
- JordanTheJet: Requested, no review posted
- WareWolf-MoonWall: Requested (this review)
- theonlyhennygod: Requested, no review posted
Risk classification: This PR touches crates/zeroclaw-gateway/src/** and crates/zeroclaw-api/src/agent.rs, both of which are explicitly listed as High risk in AGENTS.md. Per FND-006 Β§4.5, code at a security or trust boundary requires explicit, demonstrable correctness β not just a passing CI run.
I independently verified both of Audacity88's blockers against the live source. Both are real. I have additional analysis below, including one finding about a duplicated serialization path that sharpens the picture on the concurrency issue.
π΄ [blocking] β Approval responses can never arrive during a turn: the receive path is not concurrent with process_chat_message
Audacity88 identified this correctly. I want to be precise about the mechanism because the fix path matters.
How the current code works (before and after this PR):
handle_socket runs an outer tokio::select! loop with two arms: client_msg = receiver.next() and event = broadcast_rx.recv(). When a message frame arrives, the client_msg arm takes control and calls:
process_chat_message(&state, &mut agent, &mut sender, &content, &session_key).await;
While this .await runs, the outer select! is fully suspended. receiver.next() is not being polled. No new WebSocket frames can be read from the client.
What the PR adds:
- An
approval_event = approval_event_rx.recv()arm in the outerselect!β unreachable during a turn for the same reason. - An
approval_responsehandler inside theclient_msgarm β also unreachable during a turn, because it's gated onreceiver.next()firing, which the suspendedselect!cannot do. - Inside
process_chat_message's internalforward_fut, aTurnEvent::ApprovalRequestmatch arm that serializes the frame and sends it to the client.
Item (3) is actually correct: process_chat_message uses tokio::join!(turn_fut, forward_fut) internally, so TurnEvent::ApprovalRequest events emitted by the agent can reach the client during a turn β the outbound path works.
The problem is exclusively the inbound path: the client's approval_response frame arrives at the WebSocket as a new message, and there is nothing polling receiver while process_chat_message is running. The WsApprovalChannel's oneshot receiver parks indefinitely until the timeout, then the agent receives Deny and the turn completes. Only then does the outer select! resume and drain the now-stale approval_response frame β but the pending entry was already removed on timeout.
Concretely: the operator sees a prompt, clicks Approve, and the tool gets denied anyway. This doesn't satisfy #6207.
What a resolution looks like:
The process_chat_message signature already holds &mut sender (for streaming output) and &mut agent. A clean fix would thread a mutable reference to the receiver through process_chat_message as well, and drive it concurrently inside the function β alongside turn_fut and forward_fut β routing any approval_response frames to the shared pending_approvals map while the turn runs. The internal tokio::join! (or a tokio::select! inside forward_fut) could handle the three concurrent streams: turn events β wire, broadcast events β wire, and client messages β pending map.
A regression test should demonstrate the complete round-trip: /ws/chat sends an approval_request, the client sends an approve response, and the tool executes successfully before the 120s timeout.
π΄ [blocking] β The arguments_summary redaction claim in the wire-protocol comments and PR body is false
Audacity88 identified this too. I verified it against the source.
The PR introduces arguments_summary as a documented wire-protocol field with an explicit security property. The module doc at crates/zeroclaw-gateway/src/ws.rs states:
"The runtime must not echo any
#[secret]or#[derived_from_secret]field (auth tokens, API keys, OAuth secrets) into the summary."
The PR body says:
"
arguments_summaryis generated upstream bysummarize_args, which already redacts#[secret]and#[derived_from_secret]fields."
I read summarize_args in crates/zeroclaw-runtime/src/approval/mod.rs. The function iterates every key/value in the arguments object, formats each as key: truncated_value, and joins them. It does not check for any attribute annotations, does not consult a secret-field registry, and does not skip or redact any key. A tool argument named api_key or oauth_token appears in the summary with its value truncated to 80 characters but otherwise intact.
This means the documented security guarantee does not exist. A browser client receiving an approval_request frame for a tool call that includes a credential-bearing argument will see that credential in the summary. Per FND-006 Β§4.5:
"An error in an authorization path cannot [be recovered from gracefully]. Know which kind of function you are writing, and let that determination drive how aggressively you surface failures from it."
The PR makes arguments_summary a gateway wire-protocol boundary β exactly the location where this guarantee must be enforced. There are two valid resolutions:
Option A: Implement actual secret-field scrubbing in summarize_args (or in a new scrub_args_for_summary function called before the event is created). This means checking the argument keys against a registry of known secret-shaped field names and replacing their values with [redacted]. A test with api_key, token, password, and oauth_secret arguments must prove redaction. This is the right long-term fix.
Option B: Remove the redaction claim from the wire-protocol docs and the PR body until Option A lands. Change the security impact statement to accurately describe what summarize_args actually does (truncates to 80 chars). Accept that the current behavior has a known gap and track it explicitly in a follow-up issue.
Given the gateway/security boundary classification and the explicit privacy claim in the PR description, I would require Option A before this merges, or a clear acknowledgement of the gap via Option B with a tracking issue. Leaving a false security claim in the wire-protocol documentation is worse than acknowledging the known limitation.
π’ [praise] β Using ChannelApprovalRequest/ChannelApprovalResponse is the right abstraction
Reusing the existing channel approval contract for the gateway path is the architecturally correct choice. Once the transport concurrency is fixed, WS approvals will share the same approve/deny/always policy and audit path as Telegram, CLI, and ACP approvals β no gateway-specific policy divergence. The request-id/oneshot shape is clean and testable. This is worth preserving as the foundation of the fix.
π’ [praise] β ACP TurnEvent::ApprovalRequest drop is correctly handled
The notification_for_turn_event change in acp_server.rs returns Option<JsonRpcNotification> and the ApprovalRequest arm returns None. The comment is accurate: ACP sessions use session/request_permission for approval, not the turn-event stream. The PR correctly avoids routing ApprovalRequest events over ACP β and the tests are updated to expect("...") on the Some(_) variants. This is the right shape.
π‘ [warning] β process_chat_message has a duplicate ApprovalRequest serialization path that should be removed
The PR adds TurnEvent::ApprovalRequest handling in two places:
- Inside
process_chat_message'sforward_fut(ws.rsaround line 705) β this path is reachable and will forward the frame to the client during a turn. This is the path that should survive. - In the outer
select!'sapproval_event_rxarm (ws.rsaround line 552) β this arm drains a separatempscchannel and is unreachable during a turn (blocked by theprocess_chat_messageawait). It exists becauseWsApprovalChannel::request_approvalsends toapproval_event_tx, which feedsapproval_event_rx. But if (1) is already forwarding the frame, this arm is either redundant or represents a design conflict about which event stream carries the approval events.
If the fix for the concurrency blocker threads receiver through process_chat_message (as described above), the approval_event_rx arm in the outer select! may no longer be needed at all β the turn's event channel already carries TurnEvent::ApprovalRequest. Clarifying which path is canonical before adding more code to either will make the fix cleaner.
π΅ [suggestion] β The 120s hardcoded timeout should be a per-connection or per-session config knob
The PR body notes: "The 120s timeout is hardcoded for now and matches TelegramConfig::approval_timeout_secs's default." This is fine for an initial implementation. When this lands, the follow-up should plumb GatewayConfig::ws_approval_timeout_secs (or reuse an existing approval-timeout field) through handle_socket so operators on low-latency setups or batch automation contexts can tune it. Filing a tracking issue for this before merge would prevent it from becoming invisible drift.
π΅ [suggestion] β register_channel("ws", ...) unconditionally registers the approval channel
WsApprovalChannel is registered before the main loop regardless of whether the agent has any supervised tools. In sessions where no tool supervision is active, this burns a parking_lot::Mutex and two Arc allocations per connection and adds a no-op channel to the ask_user iteration. Consider registering only when the agent's ApprovalManager is configured β or document that the cost is acceptable for the connection lifecycle. Non-blocking.
Summary
Both of Audacity88's π΄ blockers are real and independently verified:
- The approval response round-trip doesn't work because
receiveris not polled during the turn. The outbound frame reaches the client; the inbound response never does. summarize_argsdoesn't redact secret-shaped fields, making the documented wire-protocol security guarantee false.
The abstraction choice (channel approval contract, request-id/oneshot, ACP drop) is sound and worth building on. Fix the transport concurrency and either implement real secret-field redaction or honestly document the current behavior, and this will be in good shape.
|
@JordanTheJet β milestone alignment needed: this PR does not clearly fit within the scope boundary of any open milestone. Please advise on placement or deferral. |
Summary
The agent's tool loop already iterates
channel_handles.ask_userlooking for a Channel that can answerChannelApprovalRequest. The gateway WS path constructed an Agent but never registered itself as such a channel, so any supervised tool call routed through/ws/chatwould auto-deny with no operator surface.Two-part fix:
TurnEvent::ApprovalRequest { request_id, tool_name, arguments_summary, timeout_secs }variant on the runtime side.WsApprovalChannelin the gateway. When the agent callsrequest_approval, the channel mints a request_id, emits the event over a connection-level mpsc, and parks on a oneshot waiter. The WS forward task drains the mpsc onto the wire; inboundapproval_responseframes pop the waiter from the shared pending map. Timeouts and disconnects auto-deny so the agent never seesOk(None)and never falls back to the implicit no-op path.Wire format:
arguments_summaryis the redacted summary the runtime synthesises viazeroclaw_runtime::approval::summarize_args; raw args (including secret-bearing fields) never reach the wire.The ACP channel keeps its own
request_choicepath;ApprovalRequestevents are dropped on the ACP transport since approvals there flow throughsession/request_permission.Closes #6207
Validation Evidence
cargo fmt --all -- --check cargo clippy --workspace --exclude zeroclaw-desktop --all-targets --features ci-all -- -D warnings cargo test --package zeroclaw-channels --features channel-acp-server --lib turn_tool_eventsAll three pass locally.
Security & Privacy Impact
arguments_summaryis generated upstream bysummarize_args, which already redacts#[secret]and#[derived_from_secret]fields. Rawargsare not passed across the WS boundary.Compatibility
approval_requestframes simply do not respond, in which case the channel times out and auto-denies after 120s, matching the existing channel default.TelegramConfig::approval_timeout_secs's default.Rollback
git revert <merge-sha>. The runtime variant and the WS channel revert in lockstep; no on-disk state to clean up.