Skip to content

fix(gateway): keep websocket steering additive and persist committed streamed output#5161

Closed
casualjim wants to merge 1 commit into
zeroclaw-labs:masterfrom
casualjim:casualjim/steering
Closed

fix(gateway): keep websocket steering additive and persist committed streamed output#5161
casualjim wants to merge 1 commit into
zeroclaw-labs:masterfrom
casualjim:casualjim/steering

Conversation

@casualjim
Copy link
Copy Markdown

Summary

Describe this PR in 2-5 bullets:

  • Base branch target (master for all contributions): master
  • Problem: WebSocket steering needed to remain additive without invalidating already-streamed output, while keeping transcript/session persistence correct under steering, queue pressure, and mid-turn failure.
  • Why it matters: In shared OSS runtime behavior, websocket/direct flows must preserve committed output semantics and transcript integrity, and queue admission errors must be explicit/non-persistent.
  • What changed:
    • Added steering-aware streamed turn path in agent (turn_streamed_with_steering_state) that:
      • drains accepted steering messages at safe boundaries,
      • treats already-streamed assistant text as committed,
      • returns both full response and consumed user messages,
      • preserves additive continuation semantics.
    • Updated gateway websocket loop to:
      • route first and subsequent message frames through a shared validation/enqueue path,
      • keep steering connection-local via inbox queue,
      • serialize only top-level turns via session queue,
      • map queue admission failures to explicit websocket error codes.
    • Centralized transcript persistence for admitted turns:
      • accepted user messages persisted as separate entries,
      • assistant output persisted once on success,
      • committed partial assistant output persisted on failure.
    • Updated response cache keying to use full provider-visible transcript encoding (not last-user-only keying).
    • Added focused tests for steering semantics, queue behavior, partial-output persistence, and cache-key structural separation.
  • What did not change (scope boundary):
    • No chunk_reset behavior reintroduced.
    • No changes to auth model, permissions, or gateway security boundaries.
    • No schema/config migrations required.
    • No steering support added to non-websocket channel types.

Label Snapshot (required)

  • Risk label (risk: low|medium|high): risk: high (touches src/gateway/**)
  • Size label (size: XS|S|M|L|XL, auto-managed/read-only): auto-managed (expected: size: L)
  • Scope labels (core|agent|channel|config|cron|daemon|doctor|gateway|health|heartbeat|integration|memory|observability|onboard|provider|runtime|security|service|skillforge|skills|tool|tunnel|docs|dependencies|ci|tests|scripts|dev, comma-separated): agent,gateway,tests,memory
  • Module labels (<module>: <component>, for example channel: telegram, provider: kimi, tool: shell): gateway: websocket, agent: streaming
  • Contributor tier label (trusted contributor|experienced contributor|principal contributor|distinguished contributor, auto-managed/read-only; author merged PRs >=5/10/20/50): auto-managed
  • If any auto-label is incorrect, note requested correction: N/A

Change Metadata

  • Change type (bug|feature|refactor|docs|security|chore): bug
  • Primary scope (runtime|provider|channel|memory|security|ci|docs|multi): multi (agent + gateway)

Linked Issue

  • Closes #N/A
  • Related #N/A
  • Depends on #N/A (if stacked)
  • Supersedes #N/A (if replacing older PR)

Supersede Attribution (required when Supersedes # is used)

  • Superseded PRs + authors (#<pr> by @<author>, one per line): N/A
  • Integrated scope by source PR (what was materially carried forward): N/A
  • Co-authored-by trailers added for materially incorporated contributors? (Yes/No): N/A
  • If No, explain why (for example: inspiration-only, no direct code/design carry-over): N/A
  • Trailer format check (separate lines, no escaped \n): (Pass/Fail): N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check
cargo clippy --all-targets -- -D warnings
cargo test
  • Evidence provided (test/log/trace/screenshot/perf): local command output; websocket integration tests in src/gateway/ws.rs and agent cache/streaming tests in src/agent/agent.rs pass in cargo test.
  • If any command is intentionally skipped, explain why: ./dev/ci.sh all was not used here due local Docker credential helper setup; equivalent Rust validation commands were run directly.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • New external network calls? (Yes/No): No (existing provider calls only)
  • Secrets/tokens handling changed? (Yes/No): No
  • File system access scope changed? (Yes/No): No
  • If any Yes, describe risk and mitigation: N/A

Privacy and Data Hygiene (required)

  • Data-hygiene status (pass|needs-follow-up): pass
  • Redaction/anonymization notes: tests and fixtures use neutral synthetic values.
  • Neutral wording confirmation (use ZeroClaw/project-native labels if identity-like wording is needed): confirmed.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? (Yes/No): No
  • If Yes, locale navigation parity updated in README*, docs/README*, and docs/SUMMARY.md for supported locales (en, zh-CN, ja, ru, fr, vi)? (Yes/No)
  • If Yes, localized runtime-contract docs updated where equivalents exist (minimum for fr/vi: commands-reference, config-reference, troubleshooting)? (Yes/No/N.A.)
  • If Yes, Vietnamese canonical docs under docs/i18n/vi/** synced and compatibility shims under docs/*.vi.md validated? (Yes/No/N.A.)
  • If any No/N.A., link follow-up issue/PR and explain scope decision: N/A

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios:
    • Steering continuation remains additive across streamed partial output.
    • Accepted steering messages persist as separate transcript entries.
    • Rejected queue-overflow messages are not persisted.
    • Session queue full/timeout produce explicit websocket errors and do not persist rejected turns.
    • Committed partial assistant output persists when later continuation fails.
  • Edge cases checked:
    • First-frame message path equals later-frame acceptance path.
    • No chunk_reset in steering flow.
    • Structurally different transcripts do not collide under updated response-cache transcript encoding.
  • What was not verified:
    • Live provider credentials/end-to-end external live tests.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: websocket gateway chat loop, streamed agent turn handling, response cache key construction, websocket integration tests.
  • Potential unintended effects: broader response-cache key behavior affects all agent turns (intentional but wider than websocket-only path).
  • Guardrails/monitoring for early detection: websocket error codes (STEERING_QUEUE_FULL, SESSION_QUEUE_FULL, SESSION_QUEUE_TIMEOUT) and existing event broadcast logs (agent_start, agent_end, error).

Agent Collaboration Notes (recommended)

  • Agent tools used (if any): local git/status/diff + cargo validation commands.
  • Workflow/plan summary (if any): review-first iteration focused on websocket steering semantics, transcript persistence, and queue behavior; added/updated tests to prove corrected behavior.
  • Verification focus: additive streaming semantics, persistence correctness, queue admission boundaries, cache-key structural correctness.
  • Confirmation: naming + architecture boundaries followed (AGENTS.md + CONTRIBUTING.md): Yes

Rollback Plan (required)

  • Fast rollback command/path: revert this PR commit(s) (git revert <commit>), redeploy.
  • Feature flags or config toggles (if any): none.
  • Observable failure symptoms: websocket turn stalls, incorrect transcript persistence (missing/extra user entries), assistant output mismatch between streamed chunks and persisted transcript.

Risks and Mitigations

List real risks in this PR (or write None).

  • Risk: Response-cache key logic now depends on full transcript encoding and affects all agent paths.
    • Mitigation: deterministic encoding helper + unit and behavior-level tests for structurally distinct transcripts.
  • Risk: Websocket coordination complexity (reader/writer/inbox/session-queue) could regress ordering.
    • Mitigation: integration tests cover happy path, overflow, queue full/timeout, and partial-output-on-error persistence.

@github-actions github-actions Bot added agent Auto scope: src/agent/** changed. gateway Auto scope: src/gateway/** changed. labels Apr 1, 2026
@casualjim casualjim force-pushed the casualjim/steering branch from f2c6245 to 4daa05b Compare April 3, 2026 14:49
Keep websocket steering additive so accepted follow-up messages extend an active turn without discarding committed streamed output. Persist accepted user messages as separate transcript entries, surface queue admission errors explicitly, and key response caching on the full provider-visible transcript to avoid cross-turn collisions.
@casualjim casualjim force-pushed the casualjim/steering branch from 4daa05b to 83e6c11 Compare April 3, 2026 16:23
@singlerider singlerider added the needs-author-action Author action required before merge label Apr 17, 2026
@singlerider singlerider added this to the v0.7.5-web milestone Apr 27, 2026
@singlerider singlerider requested a review from Audacity88 April 29, 2026 08:18
@singlerider singlerider added bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. size: XL Auto size: >1000 non-doc changed lines. labels Apr 29, 2026
@singlerider singlerider modified the milestones: v0.7.5-web, v0.7.5 May 1, 2026
@Audacity88
Copy link
Copy Markdown
Collaborator

Quick status note: this PR is currently dirty against master and still targets pre-workspace-split paths (src/agent/agent.rs, src/gateway/ws.rs). I do not think it is reviewable as-is. If the websocket steering fix is still wanted, the next useful step is probably a fresh or rebased PR against the current crate layout.

@Audacity88
Copy link
Copy Markdown
Collaborator

Thanks for the detailed work here. I’m going to close this PR because the branch is dirty against current master and still targets the pre-workspace-split paths src/agent/agent.rs and src/gateway/ws.rs. The active code now lives under crates/zeroclaw-runtime and crates/zeroclaw-gateway, so this needs a fresh implementation path rather than reviewer cleanup.

I filed #6661 to preserve the useful behavior from this branch: websocket steering should stay additive and should not lose already-streamed committed output or corrupt transcript persistence. If you want to revisit it, a fresh focused PR against the current crate layout, with the steering/transcript tests from this branch carried forward, would be the right next step.

@Audacity88 Audacity88 closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Auto scope: src/agent/** changed. bug Something isn't working gateway Auto scope: src/gateway/** changed. needs-author-action Author action required before merge risk: high Auto risk: security/runtime/gateway/tools/workflows. size: XL Auto size: >1000 non-doc changed lines.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants