Skip to content

fix(observability): restore broken SSE /logs stream; add build-stamped version and health pulse for remote/Docker deployments#6553

Open
WareWolf-MoonWall wants to merge 1 commit into
masterfrom
fix/sse-gateway-observability-wiring
Open

fix(observability): restore broken SSE /logs stream; add build-stamped version and health pulse for remote/Docker deployments#6553
WareWolf-MoonWall wants to merge 1 commit into
masterfrom
fix/sse-gateway-observability-wiring

Conversation

@WareWolf-MoonWall
Copy link
Copy Markdown
Collaborator

Summary

  • Base branch: master
  • What changed and why:
    • The /logs page in the gateway WebUI has been functionally broken since it was built. The SSE connection itself opened successfully (the browser shows the green “Connected” indicator), but no agent events ever appeared. Every agent entry-point — process_message, agent::run(), the cron scheduler, and the heartbeat worker — each called create_observer() internally and discarded the result, so the BroadcastObserver wired to the SSE bus was never reached. “Waiting for events…” was the permanent state for all users.
    • This is especially severe for remote and Docker deployments where the terminal is inaccessible: the WebUI and configured channels are the only windows into runtime behaviour. A permanently empty /logs page, no version indicator, and no proof-of-life signal between agent turns is a significant trust and UX gap.
    • This PR fixes the broken wiring, extends SSE coverage to cron and heartbeat agent runs, adds a 30-second health pulse so idle systems show proof-of-life, surfaces daemon lifecycle events (reload, restart, shutdown), and adds a build-stamped version display that is useful for both release users and source builders.
  • Scope boundary: Does not change observability config schema, memory/provider/channel/security behaviour, Prometheus or OTel observer implementations, or the CLI interactive path. No config keys added. No new crate dependencies.
  • Blast radius:
    • execute_job_now (public API): gains a third Option<Arc<dyn Observer>> parameter — breaking for external callers, but this crate is publish = false and the only workspace caller (gateway api.rs) is updated in this PR.
    • agent::run() (public API): same — gains an observer parameter; all five workspace call sites updated (None for CLI and integration tests, Some(observer) for cron and heartbeat).
    • SSE stream: clients receive new event types (pulse, gateway_started, daemon_*, component_restart, heartbeat_tick, channel_message, turn_complete, llm_response); unknown types fall through gracefully in the frontend.
    • spawn_component_supervisor: gains an event_tx parameter — private function, zero external blast radius.
    • Health pulse adds one JSON broadcast every 30 seconds (256-slot buffer); negligible overhead.
  • Linked issue(s): None (emerged from operational user feedback on remote/Docker deployments).
  • Labels: risk: high, gateway, observability, agent, daemon, cron, heartbeat

Validation Evidence (required)

cargo fmt --all -- --check
cargo clippy --all-targets -- -D warnings
cargo test
  • Commands run and tail output:
    • cargo fmt --all -- --check: ✅ clean exit, no output
    • cargo clippy --all-targets -- -D warnings: ✅ Finished dev profile [unoptimized + debuginfo] target(s) in 1m 42s — zero warnings, zero errors
    • cargo test: ✅ all tests pass including scheduled_no_conversation_leak_5415 integration test which exercises the changed agent::run() signature
    • npx tsc --noEmit (web/): ✅ zero TypeScript errors
    • cargo check -p zeroclaw-gateway after adding emit_git_info() to build.rs: ✅ compiles in 2.85s
  • Beyond CI — what did you manually verify: Signature changes thread through all five agent::run() call sites and the full cron scheduler chain (run()catch_up_overdue_jobsprocess_due_jobsexecute_and_persist_jobexecute_job_with_retryrun_agent_jobagent::run()). Confirmed i18n.ts diff is exactly one line ('dashboard.version': 'Version') — no collateral formatting churn.
  • If any command was intentionally skipped, why: N/A

Security & Privacy Impact (required)

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No
  • Secrets / tokens / credentials handling changed? No — version string is CARGO_PKG_VERSION + git SHA + dirty flag, all build-time constants
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? No

Compatibility (required)

  • Backward compatible? Noexecute_job_now and agent::run() signatures changed. Both crates are publish = false; all workspace callers are updated in this PR.
  • Config / env / CLI surface changed? No/api/status gains an additive version field; SSE gains new event types. Both are backwards-compatible at the consumer level. Custom SSE clients that consume /api/events will receive new event types they can safely ignore.
  • Exact upgrade steps for existing users: None required. No config changes needed.

Rollback (required for risk: medium and risk: high)

  • Fast rollback command/path: git revert 866861b5e
  • Feature flags or config toggles: None — the health pulse task is unconditionally active when the gateway runs; the observer wiring is structural.
  • Observable failure symptoms after rollback: /logs returns to “Waiting for events…” permanently; version card shows ; pulse events stop; cron/heartbeat agent activity disappears from the stream.

@WareWolf-MoonWall WareWolf-MoonWall self-assigned this May 9, 2026
@github-actions github-actions Bot added the core Auto scope: root src/*.rs files changed. label May 9, 2026
@WareWolf-MoonWall WareWolf-MoonWall added this to the v0.7.6 milestone May 9, 2026
@WareWolf-MoonWall WareWolf-MoonWall force-pushed the fix/sse-gateway-observability-wiring branch 2 times, most recently from c6b6f65 to 33fa707 Compare May 9, 2026 15:17
…d version and health pulse for remote/Docker deployments

The gateway's /logs page has been functionally broken since it was built.
The SSE connection itself opened successfully (showing the green 'Connected'
indicator), but no agent events ever appeared. Every agent entry-point —
process_message, agent::run(), the cron scheduler, and the heartbeat worker
— each called create_observer() internally and discarded the result, so the
BroadcastObserver wired to the SSE bus was never reached. 'Waiting for
events...' was the permanent state for all users.

This is especially impactful for remote and Docker deployments where the
terminal is inaccessible: the WebUI and its configured channels are the
only windows into runtime behaviour. A permanently empty /logs page, no
version indicator, and no proof-of-life signal between agent turns creates
a serious trust and UX gap.

What this PR fixes and adds:

FIXES (broken behaviour):
- Wire broadcast observer through process_message() — gateway webhook/WS
  agent activity now reaches /logs
- Introduce SseBroadcastObserver in zeroclaw-runtime so cron scheduler and
  heartbeat worker can emit to SSE without importing gateway types
- Wire broadcast observer through agent::run() — cron-triggered and
  heartbeat-triggered agent runs now reach /logs
- Extend BroadcastObserver to forward HeartbeatTick, ChannelMessage,
  TurnComplete, and LlmResponse (previously silently dropped by _ => return)

ADDS (observability for remote/Docker users):
- 30-second health pulse task in run_gateway broadcasts {type:"pulse",
  uptime_seconds, components} so the dashboard shows the daemon is alive
  even when the agent is idle — replaces silence with a proof-of-life signal
- gateway_started, daemon_reload, daemon_shutdown, and component_restart
  events so daemon lifecycle is visible without terminal access
- version field in /api/status composed from CARGO_PKG_VERSION + git SHA
  + dirty flag (captured at build time via build.rs) — shows as
  '0.7.5 (771cbbc)' for source builds, '0.7.5 (771cbbc, dirty)' for
  dirty builds, and plain '0.7.5' when git is unavailable (tarball/CI)
- Version status card in Dashboard Overview tab (5th in the grid)
- Logs.tsx banner updated to accurately describe what now flows over SSE

API surface changes (all workspace-internal, publish = false):
- execute_job_now gains Option<Arc<dyn Observer>> parameter; gateway caller updated
- agent::run() gains observer: Option<Arc<dyn Observer>> parameter; all five
  call sites updated (None for CLI and integration tests)
@WareWolf-MoonWall WareWolf-MoonWall force-pushed the fix/sse-gateway-observability-wiring branch from 33fa707 to d9801ae Compare May 9, 2026 15:33
@WareWolf-MoonWall
Copy link
Copy Markdown
Collaborator Author

CI note: The matrix-sdk query depth overflow in the Lint job is pre-existing on master and unrelated to this PR. Confirmed by running the full CI clippy command against the master tip before our commit — same failure is present there. Everything else in the lint output is clean.

@WareWolf-MoonWall WareWolf-MoonWall marked this pull request as ready for review May 9, 2026 19:39
@Audacity88 Audacity88 added bug Something isn't working cron Auto scope: src/cron/** changed. daemon Auto scope: src/daemon/** changed. gateway Auto scope: src/gateway/** changed. heartbeat Auto scope: src/heartbeat/** changed. observability Auto scope: src/observability/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: L Auto size: 501-1000 non-doc changed lines. labels May 10, 2026
@WareWolf-MoonWall WareWolf-MoonWall requested a review from tidux May 12, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core Auto scope: root src/*.rs files changed. cron Auto scope: src/cron/** changed. daemon Auto scope: src/daemon/** changed. gateway Auto scope: src/gateway/** changed. heartbeat Auto scope: src/heartbeat/** changed. observability Auto scope: src/observability/** changed. risk: high Auto risk: security/runtime/gateway/tools/workflows. runtime Auto scope: src/runtime/** changed. size: L Auto size: 501-1000 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants