feat: multi-modal agent — voice, file sending, vision by brandontan · Pull Request #644 · qwibitai/nanoclaw

brandontan · 2026-03-02T09:27:33Z

Summary

Voice transcription: Telegram voice/audio messages and Discord audio attachments are transcribed via OpenRouter (Gemini 2.0 Flash) before reaching the agent. Graceful fallback to placeholder on API failure.
File/image sending: send_message tool accepts optional filePath (relative to /workspace/group/). IPC resolves paths with traversal protection. Discord uses AttachmentBuilder, Telegram uses sendPhoto/sendDocument.
Vision: Images downloaded to groups/{folder}/.attachments/, metadata stored in DB (new attachments column), passed as base64 content blocks to Claude. 5MB max, auto-cleanup of files >24h.

Files changed

src/transcription.ts — new shared transcription utility
src/channels/telegram.ts — voice/audio/photo handling + file sending
src/channels/discord.ts — audio/image handling + file sending
src/types.ts — FileAttachment, MessageAttachment, updated Channel interface
src/router.ts — attachment metadata in formatted messages, file param on routeOutbound
src/ipc.ts — filePath handling with path traversal protection
src/db.ts — attachments column migration + serialization
src/index.ts — attachment collection, file param passthrough
src/container-runner.ts — ContainerInputAttachment type
container/agent-runner/src/index.ts — multi-part content blocks, image cleanup
container/agent-runner/src/tools/messaging.ts — filePath param on send_message
.env.example — multi-modal documentation

Test plan

Send a voice note in Telegram → agent receives transcript, responds to content
Send an audio file in Discord → agent receives transcript
Ask agent "Create an image and send it to me" → agent generates + sends file
Send a photo in Telegram → agent describes the image
Send an image in Discord → agent describes the image
Verify graceful degradation: disable API key → voice messages fall back to placeholder

🤖 Generated with Claude Code

Core features built on NanoClaw: - Delegation system: agents spawn sub-agents via IPC (max 3 concurrent workers) - BM25 memory search: pure JS, zero deps, with EMBEDDING_URL hook for semantic search - x402 payments: host-side handler, private keys never enter containers - Credential scrubbing: API keys/tokens auto-redacted from logs and outbound messages - DM allowlist: restrict who can interact with the agent - Per-task model override: cheap models for grunt work, smart models for conversations - Cron auto-pause: auto-disables after 5 consecutive failures - Discord channel support - Agent OS template: universal agent config inherited by all agents - Expanded container: python3, ffmpeg, imagemagick, 15+ npm packages pre-installed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed observations Implements issue #1 (v0.2.0 milestone). After substantial Discord conversations (5+ user messages), the observer compresses messages into dated, prioritized observations (🔴 Critical / 🟡 Useful / 🟢 Noise) via Sonnet 4.6. Observations are appended to daily/observer/{date}.md and found by BM25 recall. Security: credential scrubbing, hard delimiters, injection validation, output scrubbing. Operational: 1-call step budget, ~$0.03 cost ceiling, 30s timeout, circuit breaker (3 failures + 15min auto-reset), pino trace logging, kill switch (OBSERVER_ENABLED=false). Fire-and-forget — never blocks conversation delivery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: observer agent — auto-compress conversations into observations

Add central schema registry (schemas.ts) with Zod schemas for all 7 agent types, plus a reusable LLM output validation utility (validate-llm.ts) that parses JSON, validates against Zod, and retries once with error feedback. Retrofit observer to request structured JSON from LLM instead of freeform markdown, validate with Zod before writing to disk. Fix credential scrubbing to run after parse (scrubbing raw JSON breaks structure due to greedy regex). 46 tests pass (16 schema + 10 validation + 15 observer + 5 eval assertions), 4 eval scaffolds skipped. Typecheck and build clean. Closes #20 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: Zod-validated LLM output schemas (#20)

Add JSONL conversation logger that scores each interaction with heuristic quality signals (positive/negative/neutral) based on user language patterns. Enables offline analysis of which topics the bot handles well vs poorly. - Heuristic signal extraction (no LLM call, zero cost) - JSONL append to {group}/store/conversations.jsonl - Credential scrubbing on all logged content - 1MB file size cap, kill switch, message truncation - Fire-and-forget hook in processGroupMessages - ConversationLogEntrySchema added to central schema registry 18 tests pass. Typecheck and build clean. Closes #6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: conversation quality tracker with JSONL logging (#6)

When a user corrects the bot ("No, it's X not Y", "Actually...", "I meant..."), detect the correction via regex, call LLM to extract structured learning (wrong → right, knowledge file, context), and append to {group}/learnings/LEARNINGS.md. Regex gate before LLM call ensures zero cost for non-correction messages. Same operational safeguards as observer: circuit breaker, cooldown, file size cap, credential scrubbing, kill switch. 21 tests pass. Typecheck and build clean. Closes #4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add auto-learner — detect corrections and log learnings

Combines Zod schema validation with content-level checks (length bounds, input grounding) to catch hallucinations and bad data before they're written to disk. Retry-once logic with error feedback on failure. 32 tests covering helpers, validators, schema/content validation flows, retry behavior, and StepValidation schema conformance. Closes #22 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add per-step evaluation — validate agent outputs before use

Prunes observer entries by priority + age: noise at 30 days, useful at 90 days, critical kept forever. Parses observer markdown blocks, applies configurable retention policy, rewrites or deletes files. No LLM calls. 30 tests covering parser, age computation, filtering, reassembly, integration, and ReflectorOutput schema conformance. Closes #2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add reflector — deterministic memory garbage collection

Splits memory into domain files (operational, people, incidents, decisions) with regex-based categorization. Append-only writes with credential scrubbing, 200KB cap per domain, and migration helper for single-file to structured split. No LLM calls. 23 tests covering categorization, CRUD, migration, schema conformance, and credential scrubbing. Closes #3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add structured memory — categorized knowledge storage

…ions (#5) (#36) Detects frustration/failure signals (explicit frustration, abandonment, repeated corrections) via regex gate requiring >= 2 signals before triggering LLM analysis. Extracts structured HindsightReport (failureType, whatWentWrong, whatShouldHaveBeen, actionableLearning, severity) and appends to LEARNINGS.md. Operational safeguards: kill switch, circuit breaker (3 failures / 15min auto-reset), 10min per-group cooldown, 200KB file cap, credential scrubbing, 30s LLM timeout, message truncation, never-throws. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…#37) Replaces 4 hardcoded fire-and-forget blocks in index.ts with a single rule engine entry point. Rules are evaluated in order against conversation context (message count, correction patterns, frustration signals). Config-driven via optional router-rules.json per group (Zod-validated, falls back to defaults). Supports composable conditions (always, minMessages, correctionDetected, frustrationDetected, all, any). Every routing decision is trace-logged with input → rule matched → action taken. No LLM calls — all decisions are code-based. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…re (#24) (#38) Stage 1: recall tool returns compact summaries (file, category, score, first line) Stage 2: recall_detail tool fetches full file content on demand - Add src/progressive-recall.ts with pure functions for summary extraction - Add src/progressive-recall.test.ts (35 tests) - Add mode param (layered/full) to container recall tool, default layered - Add recall_detail tool with path traversal protection - Remove duplicate grep-based recall tool (bug fix) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

… MCP tool calls (#25) (#39) Wraps every MCP tool handler with timing, credential scrubbing, and async JSONL logging. Daily-rotated files at /workspace/ipc/tool-calls-YYYY-MM-DD.jsonl. - Add src/tool-observability.ts with pure functions for log entries and scrubbing - Add src/tool-observability.test.ts (19 tests) - Monkey-patch server.tool in container to inject observability wrapper - Fire-and-forget async append — zero impact on tool execution latency Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) (#40) Keyword classifier determines task type (research, grunt, conversation, analysis, content, code, quick-check) then routes to the best model. Explicit model override always wins. Configurable per-group via model-routing.json. - Add src/model-router.ts with classifier, selector, Zod config schema - Add src/model-router.test.ts (40 tests) - Wire into task-scheduler.ts and delegation-handler.ts Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ern guardrails (#9) (#41) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

#10) (#42) Blocks dangerous patterns (rm -rf, DROP TABLE, etc.) in both MCP tools and Bash commands. Per-agent config via tool-guard.json with 3-tier fallback. Whitespace-normalized matching resists evasion. Config injection hardened. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…rk (#28) Rate limits: send_sms 10/hr, make_call 5/hr. Daily spend cap $10 on x402_fetch. In-memory per-session state. Completes 5-axis audit (UX, guardrails, concurrency, observability, autonomy) across all 16 MCP tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: tool guardrails audit — rate limits, spend caps (#28)

Progressive disclosure MCP tool: overview when asked "what can you do?", detailed section on request. Reads structured capabilities.json from workspace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: self-knowledge — agent explains its own capabilities (#27)

Implements Agent Client Protocol so Sovereign agents can be driven from Zed, Cursor, and other ACP-compatible clients. Bridges ACP sessions to the container-runner pipeline. Off by default (ACP_ENABLED=true to enable). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Immutable release snapshots in releases/<sha>/, atomic symlink switch via rename, instant rollback to previous release, auto-prune keeping last 5 releases. Pure functions with 20 tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: atomic rollback deploys — symlink-based release management (#11)

The Claude Agent SDK writes debug logs to /home/node/.claude/debug/ inside the container. This directory was never created, causing every agent invocation to crash with ENOENT. Found during live deployment testing on Hetzner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Claude Agent SDK writes to /home/node/.claude/debug/ but container-runner.ts mounts host sessions dir over /home/node/.claude/, hiding the debug dir created in the Dockerfile. Create it host-side. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Host creates the dir as root but the container runs as UID 1000 (node). Without world-writable permissions the agent SDK gets EACCES. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Host creates volume-mounted directories as root but the container runs as UID 1000 (node). This caused EACCES on IPC file unlink and .claude debug writes. Applies chmod 777 to group dir, sessions dir, debug dir, and all IPC subdirectories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

6-step guided onboarding flow for non-technical users: Welcome → Identity → AI Engine → Channel → Build → Done - Validates API keys live (Anthropic + OpenRouter) - Validates Discord/Slack tokens via platform APIs - WhatsApp path with QR code explanation - Personalized build phases ("Compiling Adam's brain...") - Human-readable error messages throughout - Localhost-only, first-run-only security - State persisted to store/wizard-state.json - Writes .env, model-routing.json, groups/main/CLAUDE.md - Dashboard starts automatically during wizard mode UX flow refined with Gemini 2.5 Pro review feedback: combined 8 steps to 6, momentum-based ordering, humanized build step, error message philosophy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dash (default /bin/sh on Ubuntu) interprets parentheses in `process.exit(0)` as a subshell, causing the validate phase to fail with exit code 2. Removing shell: true also eliminates shell injection risk. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Weekly GitHub Action checks NanoClaw upstream for new commits, opens PR with merge (flags conflicts if any) - Separate job checks for Claude Code SDK version bumps in the container Dockerfile - Deploy script sets up launchd (macOS) or systemd (Linux) service automatically Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Quickstart now leads with the setup wizard instead of manual .env editing - Added bug report and feature request issue templates with structured forms - Added SECURITY.md at repo root for GitHub Security tab - Manual setup preserved as "advanced" fallback section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Animated GIF showing 5-step wizard flow (identity → provider → channel → build → done) - README Quick Start now points to wizard instead of manual .env editing - Fix: skip Docker check during wizard mode so new users without Docker can still see the wizard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Brings in CJK font support, WhatsApp message normalization fix, /update-nanoclaw skill (replaces old /update engine, -1508 lines), and docs updates. Kept Sovereign name/version, SignalWire/wallet secrets, and Mac commands in README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New users without .env get the wizard immediately instead of crashing on WhatsApp/Discord auth. Early return in main() when wizard is incomplete — only init DB and start dashboard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous OpenClaw comparison had several inaccurate claims (false: "single-threaded", "no payments", "manual setup", "Mac only"). Rewritten to be honest and focus on real differentiators: security by default, self-improving memory, codebase simplicity, revenue tools. Also adds FAQ section and corrects line count to ~20K. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… skills, Telegram channel 5 features for agent autonomy: 1. Identity tools (update_identity, read_identity) — agent can modify its own CLAUDE.md with guardrails: immutable sections, injection blocking, audit log 2. Runtime npm packages — NODE_PATH extended to include persistent /workspace/group/.packages/, npm_config_ignore_scripts=true for supply chain safety 3. Skills tool (list_skills, create_skill) — agent can create persistent custom skills that survive across sessions, built-in skills protected from overwrite 4. Telegram channel — grammy-based, same Channel interface as Discord/Slack, JID format tg:{chat_id}, activated via TELEGRAM_BOT_TOKEN env var 5. Container memory limits — --memory 1536m prevents OOM crashes on VPS, configurable via CONTAINER_MEMORY_LIMIT env var Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Reverse NODE_PATH order so trusted /app/node_modules loads first - Move npm_config_ignore_scripts after build-time installs (fixes sharp) - Add ALLOWED_USERS check to Telegram channel - Block immutable heading injection in update_identity content - Remove read_identity (redundant), BLOCKED_PATTERNS (bypassable), dead code - Sanitize YAML newlines in skills.ts description - Add 50KB content size limit to create_skill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Keep only Discord + Telegram as active channels - Remove bump-version, skill-drift, update-tokens, upstream-sync workflows (require upstream GitHub App secrets, always fail on fork) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Voice: transcribe audio/voice messages via OpenRouter (Gemini Flash) before storing, so the agent sees text instead of placeholders. File sending: extend send_message tool with optional filePath param, pipe through IPC with path traversal protection, send via platform APIs (Discord AttachmentBuilder, Telegram sendPhoto/sendDocument). Vision: download images to .attachments/, persist metadata in DB, pass as base64 content blocks to Claude so the agent can see photos. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brandontan · 2026-03-02T09:38:26Z

Merged directly into fork's main. Not contributing upstream.

brandontan and others added 30 commits February 27, 2026 16:37

docs: Sovereign README with architecture, roadmap, and quickstart

52cc2b8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #19 from brandontan/feat/observer-agent

b986e61

feat: observer agent — auto-compress conversations into observations

Merge pull request #29 from brandontan/feat/zod-validated-outputs

dab383b

feat: Zod-validated LLM output schemas (#20)

Merge pull request #31 from brandontan/feat/quality-tracker

4898fd4

feat: conversation quality tracker with JSONL logging (#6)

Merge pull request #32 from brandontan/feat/auto-learning

521beec

feat: add auto-learner — detect corrections and log learnings

Merge pull request #33 from brandontan/feat/per-step-eval

0e58377

feat: add per-step evaluation — validate agent outputs before use

Merge pull request #34 from brandontan/feat/reflector

a61a3e5

feat: add reflector — deterministic memory garbage collection

Merge pull request #35 from brandontan/feat/structured-memory

319477f

feat: add structured memory — categorized knowledge storage

feat: add task templates — reusable structured prompts with anti-patt…

bc92477

…ern guardrails (#9) (#41) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #43 from brandontan/feat/tool-guardrails-audit

eba1231

feat: tool guardrails audit — rate limits, spend caps (#28)

feat: self-knowledge — agent can explain its own capabilities (#27)

494bddc

Progressive disclosure MCP tool: overview when asked "what can you do?", detailed section on request. Reads structured capabilities.json from workspace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #44 from brandontan/feat/self-knowledge

1d0b873

feat: self-knowledge — agent explains its own capabilities (#27)

Merge pull request #45 from brandontan/feat/atomic-deploys

d6be27e

feat: atomic rollback deploys — symlink-based release management (#11)

brandontan and others added 21 commits March 2, 2026 07:22

merge: production hardening and security refactor

6ad3f55

hardening: fix .env inline comment parsing and cover dashboard env

4aa53a4

merge: second production hardening pass

692358c

fix: chmod debug dir so container node user can write to it

bcba32b

Host creates the dir as root but the container runs as UID 1000 (node). Without world-writable permissions the agent SDK gets EACCES. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: prettier formatting pass

e666c5f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add unreachable throw to satisfy TypeScript exhaustiveness check

4ba230f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brandontan requested review from gabi-simons and gavrielc as code owners March 2, 2026 09:27

brandontan closed this Mar 2, 2026

github-actions Bot mentioned this pull request Mar 3, 2026

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#46

Closed

brandontan deleted the feat/multi-modal-agent branch March 3, 2026 07:07

This was referenced Mar 3, 2026

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#61

Closed

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#66

Open

🦞 OpenClaw 生态日报 2026-03-04 duanyytop/agents-radar#71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-modal agent — voice, file sending, vision#644

feat: multi-modal agent — voice, file sending, vision#644
brandontan wants to merge 74 commits intoqwibitai:mainfrom
brandontan:feat/multi-modal-agent

brandontan commented Mar 2, 2026

Uh oh!

brandontan commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant