Skip to content

feat(gateway/ws): tool-approval back-channel via WsApprovalChannel#6387

Open
singlerider wants to merge 3 commits intozeroclaw-labs:masterfrom
singlerider:feat/6207-gateway-approval-manager
Open

feat(gateway/ws): tool-approval back-channel via WsApprovalChannel#6387
singlerider wants to merge 3 commits intozeroclaw-labs:masterfrom
singlerider:feat/6207-gateway-approval-manager

Conversation

@singlerider
Copy link
Copy Markdown
Collaborator

@singlerider singlerider commented May 5, 2026

Summary

The agent's tool loop already iterates channel_handles.ask_user looking for a Channel that can answer ChannelApprovalRequest. The gateway WS path constructed an Agent but never registered itself as such a channel, so any supervised tool call routed through /ws/chat would auto-deny with no operator surface.

Two-part fix:

  • New TurnEvent::ApprovalRequest { request_id, tool_name, arguments_summary, timeout_secs } variant on the runtime side.
  • WsApprovalChannel in the gateway. When the agent calls request_approval, the channel mints a request_id, emits the event over a connection-level mpsc, and parks on a oneshot waiter. The WS forward task drains the mpsc onto the wire; inbound approval_response frames pop the waiter from the shared pending map. Timeouts and disconnects auto-deny so the agent never sees Ok(None) and never falls back to the implicit no-op path.

Wire format:

Server -> Client: { "type": "approval_request",
                    "request_id": "<uuid>", "tool": ...,
                    "arguments_summary": ...,
                    "timeout_secs": 120 }
Client -> Server: { "type": "approval_response",
                    "request_id": "<uuid>",
                    "decision": "approve" | "deny" | "always" }

arguments_summary is the redacted summary the runtime synthesises via zeroclaw_runtime::approval::summarize_args; raw args (including secret-bearing fields) never reach the wire.

The ACP channel keeps its own request_choice path; ApprovalRequest events are dropped on the ACP transport since approvals there flow through session/request_permission.

Closes #6207

Validation Evidence

cargo fmt --all -- --check
cargo clippy --workspace --exclude zeroclaw-desktop --all-targets --features ci-all -- -D warnings
cargo test --package zeroclaw-channels --features channel-acp-server --lib turn_tool_events

All three pass locally.

Security & Privacy Impact

  • New permissions, capabilities, or filesystem access scope? No. The change is the missing-back-channel fix; supervised tool gating works the same way every other channel does.
  • New external network calls? No.
  • Secrets / tokens / credentials handling changed? arguments_summary is generated upstream by summarize_args, which already redacts #[secret] and #[derived_from_secret] fields. Raw args are not passed across the WS boundary.
  • PII or real identities in diff, tests, fixtures, or docs? No.

Compatibility

  • Backward compatible? Yes. Clients that do not understand approval_request frames simply do not respond, in which case the channel times out and auto-denies after 120s, matching the existing channel default.
  • Config / env / CLI surface changed? No. The 120s timeout is hardcoded for now and matches TelegramConfig::approval_timeout_secs's default.

Rollback

git revert <merge-sha>. The runtime variant and the WS channel revert in lockstep; no on-disk state to clean up.

…se WS protocol

Step 1 of the gateway-approval-bypass fix. Document the WS frame
contract for tool approvals so a maintainer reviewer can sign off on
the wire shape before the deeper plumbing lands.

This commit only adds documentation:

- Module-level rustdoc gains a "Tool approvals (in progress, see zeroclaw-labs#6207)"
  section spelling out the JSON shape both directions, the timeout +
  auto-deny semantics, and the parity goal with
  process_channel_message + ApprovalManager.
- A TODO comment in the TurnEvent forwarder names the exact code
  locations that still need wiring: TurnEvent::ApprovalRequest in
  zeroclaw-runtime, the approval gate in
  crates/zeroclaw-runtime/src/agent/loop_.rs:1572-1619, and the
  ApprovalManager construction site that does not yet exist on the
  gateway path.

Not in this commit (deliberately separate to keep the protocol
contract reviewable in isolation):

- TurnEvent::ApprovalRequest variant on the runtime side.
- Threading Option<&ApprovalManager> through Agent::turn_streamed (or
  switching the gateway to the same channel-message path the actual
  channels use, which gives approvals for free).
- Web UI: rendering the approval card with Approve / Deny / Always
  buttons in web/src/.
- Auto-deny on client disconnect / timeout.

Refs zeroclaw-labs#6207.
@singlerider singlerider added bug Something isn't working gateway Auto scope: src/gateway/** changed. labels May 5, 2026
@singlerider singlerider requested a review from Audacity88 May 5, 2026 04:14
@singlerider singlerider added runtime Auto scope: src/runtime/** changed. security Auto scope: src/security/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. labels May 5, 2026
@singlerider singlerider added the size: XS Auto size: <=80 non-doc changed lines. label May 5, 2026
@singlerider singlerider self-assigned this May 5, 2026
@github-actions github-actions Bot removed gateway Auto scope: src/gateway/** changed. runtime Auto scope: src/runtime/** changed. security Auto scope: src/security/** changed. labels May 5, 2026
CI flagged the standalone comment block between match arms; rustfmt
prefers it hung off the closing brace of the previous arm. Apply
`cargo fmt` to satisfy the lint.

Refs zeroclaw-labs#6207.
The agent's tool loop already iterates `channel_handles.ask_user`
looking for a Channel that can answer ChannelApprovalRequest. The
gateway WS path constructed an Agent but never registered itself as
such a channel, so any supervised tool call routed through /ws/chat
would auto-deny with no operator surface.

Add a TurnEvent::ApprovalRequest variant on the runtime side and a
WsApprovalChannel that, when the agent calls request_approval, mints
a request_id, emits the event over a connection-level mpsc, and
parks on a oneshot waiter. The WS forward task drains the mpsc onto
the wire; inbound `approval_response` frames pop the waiter from the
shared pending map. Timeouts and disconnects auto-deny so the agent
never sees None and never falls back to the implicit no-op path.

Wire format:

    Server -> Client: { "type": "approval_request",
                        "request_id": "<uuid>", "tool": ...,
                        "arguments_summary": ...,
                        "timeout_secs": 120 }
    Client -> Server: { "type": "approval_response",
                        "request_id": "<uuid>",
                        "decision": "approve" | "deny" | "always" }

`arguments_summary` is the redacted summary the runtime synthesises
via `zeroclaw_runtime::approval::summarize_args`; raw args (including
secret-bearing fields) never reach the wire. The ACP channel keeps
its own `request_choice` path; ApprovalRequest events are dropped on
the ACP transport since approvals there flow through
session/request_permission.

Closes zeroclaw-labs#6207
@singlerider singlerider added this to the v0.7.5 milestone May 5, 2026
@singlerider singlerider changed the title feat(gateway/ws): scaffold #6207 approval-request/response WS protocol feat(gateway/ws): tool-approval back-channel via WsApprovalChannel May 5, 2026
@singlerider singlerider marked this pull request as ready for review May 5, 2026 06:29
@Audacity88 Audacity88 added size: M Auto size: 251-500 non-doc changed lines. and removed size: XS Auto size: <=80 non-doc changed lines. labels May 5, 2026
Copy link
Copy Markdown
Collaborator

@Audacity88 Audacity88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the current PR snapshot at head 58a98d4, linked issue #6207, current labels and status checks, the absence of prior comments/reviews, the five-file PR diff, and the adjacent runtime/channel approval contracts. Line references below include current PR-head source and adjacent source where the behavior depends on existing contracts. The request-id/oneshot shape is a good direction, but I think this needs changes because the new path does not actually round-trip an approval during a WebSocket turn yet, and the new WS contract relies on a redaction guarantee the current helper does not provide.

🟒 What looks good β€” reusing the channel approval contract

Using ChannelApprovalRequest / ChannelApprovalResponse for the gateway path is the right abstraction. Once the transport pump is wired into the active turn, this should let WS approvals share the same approve/deny/always policy and audit path as other channel-backed approvals instead of inventing a separate gateway-only policy.

πŸ”΄ Blocking β€” approval requests are queued in a loop that is not running during the turn

handle_socket awaits process_chat_message(...) when it receives a message frame (crates/zeroclaw-gateway/src/ws.rs:536; the first-message fallback has the same shape around :393). While that await is in progress, the outer socket select! is suspended. That matters because the new approval pump also lives in that outer loop: outbound approval_request frames are drained from approval_event_rx at ws.rs:552-575, and inbound approval_response frames are read at ws.rs:466-490.

So a supervised tool call can insert the pending oneshot and send the TurnEvent::ApprovalRequest in crates/zeroclaw-gateway/src/ws_approval.rs:88-108, but the client never sees that request because the only task that forwards it is waiting for the turn to finish. The same blocked loop also cannot read the user's approval response. The practical result is timeout/deny, and then the outer loop can resume and send a stale approval prompt after the pending entry was already removed.

That means this does not satisfy #6207's core requirement yet: the web client still cannot approve the supervised tool call before timeout. I would move approval-event forwarding and approval_response handling into the same concurrent turn driver, or route the approval request through the per-turn event stream that process_chat_message already drains while also keeping response reads active during the turn. A regression test should prove that /ws/chat sends an approval_request, accepts an approve response, and executes the tool before the timeout.

πŸ”΄ Blocking β€” arguments_summary is documented as redacted but the current helper only stringifies values

The new API/WS comments say arguments_summary is secret-redacted (crates/zeroclaw-api/src/agent.rs:36-39, crates/zeroclaw-gateway/src/ws.rs:42-51), and the PR body says raw args with secret-bearing fields never reach the wire. But zeroclaw_runtime::approval::summarize_args currently iterates every key/value and truncates the stringified value (crates/zeroclaw-runtime/src/approval/mod.rs:224-247). WsApprovalChannel copies that string into the new event (crates/zeroclaw-gateway/src/ws_approval.rs:92-96), and the gateway forwards it directly to the browser (crates/zeroclaw-gateway/src/ws.rs:560-566, :705-718).

That means an approval request with an argument such as api_key, token, or another secret-shaped field can still put the secret value into the approval_request frame, just truncated. The helper gap may be pre-existing for other approval transports, but this PR makes that string a documented WebSocket protocol field and uses the claimed redaction as the security/privacy mitigation for the new gateway boundary. I would either route the summary through a real redaction/scrubbing helper before creating or forwarding the approval event, with coverage for secret-shaped fields, or remove the redaction claim until that guarantee actually exists. Given the gateway/security boundary and the PR's explicit privacy claim, I would block this until the wire frame cannot expose secret-bearing tool arguments.

Copy link
Copy Markdown
Collaborator

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review β€” PR #6387 Β· feat(gateway/ws): tool-approval back-channel via WsApprovalChannel

Verdict: --request-changes
Head reviewed: 58a98d49071ab03f891d3cb7c9dc9943c7ce64d3


Take-stock before writing

Active block holders:

  • Audacity88: CHANGES_REQUESTED β€” two πŸ”΄ blockers (turn-loop concurrency; redaction claim vs. implementation)
  • JordanTheJet: Requested, no review posted
  • WareWolf-MoonWall: Requested (this review)
  • theonlyhennygod: Requested, no review posted

Risk classification: This PR touches crates/zeroclaw-gateway/src/** and crates/zeroclaw-api/src/agent.rs, both of which are explicitly listed as High risk in AGENTS.md. Per FND-006 Β§4.5, code at a security or trust boundary requires explicit, demonstrable correctness β€” not just a passing CI run.

I independently verified both of Audacity88's blockers against the live source. Both are real. I have additional analysis below, including one finding about a duplicated serialization path that sharpens the picture on the concurrency issue.


πŸ”΄ [blocking] β€” Approval responses can never arrive during a turn: the receive path is not concurrent with process_chat_message

Audacity88 identified this correctly. I want to be precise about the mechanism because the fix path matters.

How the current code works (before and after this PR):

handle_socket runs an outer tokio::select! loop with two arms: client_msg = receiver.next() and event = broadcast_rx.recv(). When a message frame arrives, the client_msg arm takes control and calls:

process_chat_message(&state, &mut agent, &mut sender, &content, &session_key).await;

While this .await runs, the outer select! is fully suspended. receiver.next() is not being polled. No new WebSocket frames can be read from the client.

What the PR adds:

  1. An approval_event = approval_event_rx.recv() arm in the outer select! β€” unreachable during a turn for the same reason.
  2. An approval_response handler inside the client_msg arm β€” also unreachable during a turn, because it's gated on receiver.next() firing, which the suspended select! cannot do.
  3. Inside process_chat_message's internal forward_fut, a TurnEvent::ApprovalRequest match arm that serializes the frame and sends it to the client.

Item (3) is actually correct: process_chat_message uses tokio::join!(turn_fut, forward_fut) internally, so TurnEvent::ApprovalRequest events emitted by the agent can reach the client during a turn β€” the outbound path works.

The problem is exclusively the inbound path: the client's approval_response frame arrives at the WebSocket as a new message, and there is nothing polling receiver while process_chat_message is running. The WsApprovalChannel's oneshot receiver parks indefinitely until the timeout, then the agent receives Deny and the turn completes. Only then does the outer select! resume and drain the now-stale approval_response frame β€” but the pending entry was already removed on timeout.

Concretely: the operator sees a prompt, clicks Approve, and the tool gets denied anyway. This doesn't satisfy #6207.

What a resolution looks like:

The process_chat_message signature already holds &mut sender (for streaming output) and &mut agent. A clean fix would thread a mutable reference to the receiver through process_chat_message as well, and drive it concurrently inside the function β€” alongside turn_fut and forward_fut β€” routing any approval_response frames to the shared pending_approvals map while the turn runs. The internal tokio::join! (or a tokio::select! inside forward_fut) could handle the three concurrent streams: turn events β†’ wire, broadcast events β†’ wire, and client messages β†’ pending map.

A regression test should demonstrate the complete round-trip: /ws/chat sends an approval_request, the client sends an approve response, and the tool executes successfully before the 120s timeout.


πŸ”΄ [blocking] β€” The arguments_summary redaction claim in the wire-protocol comments and PR body is false

Audacity88 identified this too. I verified it against the source.

The PR introduces arguments_summary as a documented wire-protocol field with an explicit security property. The module doc at crates/zeroclaw-gateway/src/ws.rs states:

"The runtime must not echo any #[secret] or #[derived_from_secret] field (auth tokens, API keys, OAuth secrets) into the summary."

The PR body says:

"arguments_summary is generated upstream by summarize_args, which already redacts #[secret] and #[derived_from_secret] fields."

I read summarize_args in crates/zeroclaw-runtime/src/approval/mod.rs. The function iterates every key/value in the arguments object, formats each as key: truncated_value, and joins them. It does not check for any attribute annotations, does not consult a secret-field registry, and does not skip or redact any key. A tool argument named api_key or oauth_token appears in the summary with its value truncated to 80 characters but otherwise intact.

This means the documented security guarantee does not exist. A browser client receiving an approval_request frame for a tool call that includes a credential-bearing argument will see that credential in the summary. Per FND-006 Β§4.5:

"An error in an authorization path cannot [be recovered from gracefully]. Know which kind of function you are writing, and let that determination drive how aggressively you surface failures from it."

The PR makes arguments_summary a gateway wire-protocol boundary β€” exactly the location where this guarantee must be enforced. There are two valid resolutions:

Option A: Implement actual secret-field scrubbing in summarize_args (or in a new scrub_args_for_summary function called before the event is created). This means checking the argument keys against a registry of known secret-shaped field names and replacing their values with [redacted]. A test with api_key, token, password, and oauth_secret arguments must prove redaction. This is the right long-term fix.

Option B: Remove the redaction claim from the wire-protocol docs and the PR body until Option A lands. Change the security impact statement to accurately describe what summarize_args actually does (truncates to 80 chars). Accept that the current behavior has a known gap and track it explicitly in a follow-up issue.

Given the gateway/security boundary classification and the explicit privacy claim in the PR description, I would require Option A before this merges, or a clear acknowledgement of the gap via Option B with a tracking issue. Leaving a false security claim in the wire-protocol documentation is worse than acknowledging the known limitation.


🟒 [praise] β€” Using ChannelApprovalRequest/ChannelApprovalResponse is the right abstraction

Reusing the existing channel approval contract for the gateway path is the architecturally correct choice. Once the transport concurrency is fixed, WS approvals will share the same approve/deny/always policy and audit path as Telegram, CLI, and ACP approvals β€” no gateway-specific policy divergence. The request-id/oneshot shape is clean and testable. This is worth preserving as the foundation of the fix.


🟒 [praise] β€” ACP TurnEvent::ApprovalRequest drop is correctly handled

The notification_for_turn_event change in acp_server.rs returns Option<JsonRpcNotification> and the ApprovalRequest arm returns None. The comment is accurate: ACP sessions use session/request_permission for approval, not the turn-event stream. The PR correctly avoids routing ApprovalRequest events over ACP β€” and the tests are updated to expect("...") on the Some(_) variants. This is the right shape.


🟑 [warning] β€” process_chat_message has a duplicate ApprovalRequest serialization path that should be removed

The PR adds TurnEvent::ApprovalRequest handling in two places:

  1. Inside process_chat_message's forward_fut (ws.rs around line 705) β€” this path is reachable and will forward the frame to the client during a turn. This is the path that should survive.
  2. In the outer select!'s approval_event_rx arm (ws.rs around line 552) β€” this arm drains a separate mpsc channel and is unreachable during a turn (blocked by the process_chat_message await). It exists because WsApprovalChannel::request_approval sends to approval_event_tx, which feeds approval_event_rx. But if (1) is already forwarding the frame, this arm is either redundant or represents a design conflict about which event stream carries the approval events.

If the fix for the concurrency blocker threads receiver through process_chat_message (as described above), the approval_event_rx arm in the outer select! may no longer be needed at all β€” the turn's event channel already carries TurnEvent::ApprovalRequest. Clarifying which path is canonical before adding more code to either will make the fix cleaner.


πŸ”΅ [suggestion] β€” The 120s hardcoded timeout should be a per-connection or per-session config knob

The PR body notes: "The 120s timeout is hardcoded for now and matches TelegramConfig::approval_timeout_secs's default." This is fine for an initial implementation. When this lands, the follow-up should plumb GatewayConfig::ws_approval_timeout_secs (or reuse an existing approval-timeout field) through handle_socket so operators on low-latency setups or batch automation contexts can tune it. Filing a tracking issue for this before merge would prevent it from becoming invisible drift.


πŸ”΅ [suggestion] β€” register_channel("ws", ...) unconditionally registers the approval channel

WsApprovalChannel is registered before the main loop regardless of whether the agent has any supervised tools. In sessions where no tool supervision is active, this burns a parking_lot::Mutex and two Arc allocations per connection and adds a no-op channel to the ask_user iteration. Consider registering only when the agent's ApprovalManager is configured β€” or document that the cost is acceptable for the connection lifecycle. Non-blocking.


Summary

Both of Audacity88's πŸ”΄ blockers are real and independently verified:

  1. The approval response round-trip doesn't work because receiver is not polled during the turn. The outbound frame reaches the client; the inbound response never does.
  2. summarize_args doesn't redact secret-shaped fields, making the documented wire-protocol security guarantee false.

The abstraction choice (channel approval contract, request-id/oneshot, ACP drop) is sound and worth building on. Fix the transport concurrency and either implement real secret-field redaction or honestly document the current behavior, and this will be in good shape.

@WareWolf-MoonWall
Copy link
Copy Markdown
Collaborator

@JordanTheJet β€” milestone alignment needed: this PR does not clearly fit within the scope boundary of any open milestone. Please advise on placement or deferral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. size: M Auto size: 251-500 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Web dashboard / WebSocket gateway path bypasses ApprovalManager β€” supervised tool approvals never surface in daemon web UI

3 participants