Skip to content

feat: multi-modal agent — voice, file sending, vision#644

Closed
brandontan wants to merge 74 commits intoqwibitai:mainfrom
brandontan:feat/multi-modal-agent
Closed

feat: multi-modal agent — voice, file sending, vision#644
brandontan wants to merge 74 commits intoqwibitai:mainfrom
brandontan:feat/multi-modal-agent

Conversation

@brandontan
Copy link
Copy Markdown

Summary

  • Voice transcription: Telegram voice/audio messages and Discord audio attachments are transcribed via OpenRouter (Gemini 2.0 Flash) before reaching the agent. Graceful fallback to placeholder on API failure.
  • File/image sending: send_message tool accepts optional filePath (relative to /workspace/group/). IPC resolves paths with traversal protection. Discord uses AttachmentBuilder, Telegram uses sendPhoto/sendDocument.
  • Vision: Images downloaded to groups/{folder}/.attachments/, metadata stored in DB (new attachments column), passed as base64 content blocks to Claude. 5MB max, auto-cleanup of files >24h.

Files changed

  • src/transcription.tsnew shared transcription utility
  • src/channels/telegram.ts — voice/audio/photo handling + file sending
  • src/channels/discord.ts — audio/image handling + file sending
  • src/types.ts — FileAttachment, MessageAttachment, updated Channel interface
  • src/router.ts — attachment metadata in formatted messages, file param on routeOutbound
  • src/ipc.ts — filePath handling with path traversal protection
  • src/db.ts — attachments column migration + serialization
  • src/index.ts — attachment collection, file param passthrough
  • src/container-runner.ts — ContainerInputAttachment type
  • container/agent-runner/src/index.ts — multi-part content blocks, image cleanup
  • container/agent-runner/src/tools/messaging.ts — filePath param on send_message
  • .env.example — multi-modal documentation

Test plan

  • Send a voice note in Telegram → agent receives transcript, responds to content
  • Send an audio file in Discord → agent receives transcript
  • Ask agent "Create an image and send it to me" → agent generates + sends file
  • Send a photo in Telegram → agent describes the image
  • Send an image in Discord → agent describes the image
  • Verify graceful degradation: disable API key → voice messages fall back to placeholder

🤖 Generated with Claude Code

brandontan and others added 30 commits February 27, 2026 16:37
Core features built on NanoClaw:
- Delegation system: agents spawn sub-agents via IPC (max 3 concurrent workers)
- BM25 memory search: pure JS, zero deps, with EMBEDDING_URL hook for semantic search
- x402 payments: host-side handler, private keys never enter containers
- Credential scrubbing: API keys/tokens auto-redacted from logs and outbound messages
- DM allowlist: restrict who can interact with the agent
- Per-task model override: cheap models for grunt work, smart models for conversations
- Cron auto-pause: auto-disables after 5 consecutive failures
- Discord channel support
- Agent OS template: universal agent config inherited by all agents
- Expanded container: python3, ffmpeg, imagemagick, 15+ npm packages pre-installed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed observations

Implements issue #1 (v0.2.0 milestone). After substantial Discord conversations
(5+ user messages), the observer compresses messages into dated, prioritized
observations (🔴 Critical / 🟡 Useful / 🟢 Noise) via Sonnet 4.6. Observations
are appended to daily/observer/{date}.md and found by BM25 recall.

Security: credential scrubbing, hard delimiters, injection validation, output
scrubbing. Operational: 1-call step budget, ~$0.03 cost ceiling, 30s timeout,
circuit breaker (3 failures + 15min auto-reset), pino trace logging, kill switch
(OBSERVER_ENABLED=false). Fire-and-forget — never blocks conversation delivery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: observer agent — auto-compress conversations into observations
Add central schema registry (schemas.ts) with Zod schemas for all 7
agent types, plus a reusable LLM output validation utility
(validate-llm.ts) that parses JSON, validates against Zod, and retries
once with error feedback.

Retrofit observer to request structured JSON from LLM instead of
freeform markdown, validate with Zod before writing to disk. Fix
credential scrubbing to run after parse (scrubbing raw JSON breaks
structure due to greedy regex).

46 tests pass (16 schema + 10 validation + 15 observer + 5 eval
assertions), 4 eval scaffolds skipped. Typecheck and build clean.

Closes #20

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Zod-validated LLM output schemas (#20)
Add JSONL conversation logger that scores each interaction with
heuristic quality signals (positive/negative/neutral) based on user
language patterns. Enables offline analysis of which topics the bot
handles well vs poorly.

- Heuristic signal extraction (no LLM call, zero cost)
- JSONL append to {group}/store/conversations.jsonl
- Credential scrubbing on all logged content
- 1MB file size cap, kill switch, message truncation
- Fire-and-forget hook in processGroupMessages
- ConversationLogEntrySchema added to central schema registry

18 tests pass. Typecheck and build clean.

Closes #6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: conversation quality tracker with JSONL logging (#6)
When a user corrects the bot ("No, it's X not Y", "Actually...",
"I meant..."), detect the correction via regex, call LLM to extract
structured learning (wrong → right, knowledge file, context), and
append to {group}/learnings/LEARNINGS.md.

Regex gate before LLM call ensures zero cost for non-correction
messages. Same operational safeguards as observer: circuit breaker,
cooldown, file size cap, credential scrubbing, kill switch.

21 tests pass. Typecheck and build clean.

Closes #4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add auto-learner — detect corrections and log learnings
Combines Zod schema validation with content-level checks (length bounds,
input grounding) to catch hallucinations and bad data before they're
written to disk. Retry-once logic with error feedback on failure.

32 tests covering helpers, validators, schema/content validation flows,
retry behavior, and StepValidation schema conformance.

Closes #22

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add per-step evaluation — validate agent outputs before use
Prunes observer entries by priority + age: noise at 30 days, useful at
90 days, critical kept forever. Parses observer markdown blocks, applies
configurable retention policy, rewrites or deletes files. No LLM calls.

30 tests covering parser, age computation, filtering, reassembly,
integration, and ReflectorOutput schema conformance.

Closes #2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add reflector — deterministic memory garbage collection
Splits memory into domain files (operational, people, incidents,
decisions) with regex-based categorization. Append-only writes with
credential scrubbing, 200KB cap per domain, and migration helper
for single-file to structured split. No LLM calls.

23 tests covering categorization, CRUD, migration, schema conformance,
and credential scrubbing.

Closes #3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add structured memory — categorized knowledge storage
…ions (#5) (#36)

Detects frustration/failure signals (explicit frustration, abandonment,
repeated corrections) via regex gate requiring >= 2 signals before
triggering LLM analysis. Extracts structured HindsightReport (failureType,
whatWentWrong, whatShouldHaveBeen, actionableLearning, severity) and
appends to LEARNINGS.md.

Operational safeguards: kill switch, circuit breaker (3 failures / 15min
auto-reset), 10min per-group cooldown, 200KB file cap, credential
scrubbing, 30s LLM timeout, message truncation, never-throws.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…#37)

Replaces 4 hardcoded fire-and-forget blocks in index.ts with a single
rule engine entry point. Rules are evaluated in order against conversation
context (message count, correction patterns, frustration signals).

Config-driven via optional router-rules.json per group (Zod-validated,
falls back to defaults). Supports composable conditions (always,
minMessages, correctionDetected, frustrationDetected, all, any).

Every routing decision is trace-logged with input → rule matched →
action taken. No LLM calls — all decisions are code-based.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…re (#24) (#38)

Stage 1: recall tool returns compact summaries (file, category, score, first line)
Stage 2: recall_detail tool fetches full file content on demand

- Add src/progressive-recall.ts with pure functions for summary extraction
- Add src/progressive-recall.test.ts (35 tests)
- Add mode param (layered/full) to container recall tool, default layered
- Add recall_detail tool with path traversal protection
- Remove duplicate grep-based recall tool (bug fix)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
… MCP tool calls (#25) (#39)

Wraps every MCP tool handler with timing, credential scrubbing, and async
JSONL logging. Daily-rotated files at /workspace/ipc/tool-calls-YYYY-MM-DD.jsonl.

- Add src/tool-observability.ts with pure functions for log entries and scrubbing
- Add src/tool-observability.test.ts (19 tests)
- Monkey-patch server.tool in container to inject observability wrapper
- Fire-and-forget async append — zero impact on tool execution latency

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) (#40)

Keyword classifier determines task type (research, grunt, conversation,
analysis, content, code, quick-check) then routes to the best model.
Explicit model override always wins. Configurable per-group via
model-routing.json.

- Add src/model-router.ts with classifier, selector, Zod config schema
- Add src/model-router.test.ts (40 tests)
- Wire into task-scheduler.ts and delegation-handler.ts

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ern guardrails (#9) (#41)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
#10) (#42)

Blocks dangerous patterns (rm -rf, DROP TABLE, etc.) in both MCP tools and
Bash commands. Per-agent config via tool-guard.json with 3-tier fallback.
Whitespace-normalized matching resists evasion. Config injection hardened.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rk (#28)

Rate limits: send_sms 10/hr, make_call 5/hr. Daily spend cap $10 on x402_fetch.
In-memory per-session state. Completes 5-axis audit (UX, guardrails, concurrency,
observability, autonomy) across all 16 MCP tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: tool guardrails audit — rate limits, spend caps (#28)
Progressive disclosure MCP tool: overview when asked "what can you do?",
detailed section on request. Reads structured capabilities.json from workspace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: self-knowledge — agent explains its own capabilities (#27)
Implements Agent Client Protocol so Sovereign agents can be driven from
Zed, Cursor, and other ACP-compatible clients. Bridges ACP sessions to
the container-runner pipeline. Off by default (ACP_ENABLED=true to enable).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Immutable release snapshots in releases/<sha>/, atomic symlink switch
via rename, instant rollback to previous release, auto-prune keeping
last 5 releases. Pure functions with 20 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: atomic rollback deploys — symlink-based release management (#11)
brandontan and others added 21 commits March 2, 2026 07:22
The Claude Agent SDK writes debug logs to /home/node/.claude/debug/
inside the container. This directory was never created, causing every
agent invocation to crash with ENOENT. Found during live deployment
testing on Hetzner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Claude Agent SDK writes to /home/node/.claude/debug/ but
container-runner.ts mounts host sessions dir over /home/node/.claude/,
hiding the debug dir created in the Dockerfile. Create it host-side.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Host creates the dir as root but the container runs as UID 1000 (node).
Without world-writable permissions the agent SDK gets EACCES.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Host creates volume-mounted directories as root but the container runs
as UID 1000 (node). This caused EACCES on IPC file unlink and .claude
debug writes. Applies chmod 777 to group dir, sessions dir, debug dir,
and all IPC subdirectories.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6-step guided onboarding flow for non-technical users:
Welcome → Identity → AI Engine → Channel → Build → Done

- Validates API keys live (Anthropic + OpenRouter)
- Validates Discord/Slack tokens via platform APIs
- WhatsApp path with QR code explanation
- Personalized build phases ("Compiling Adam's brain...")
- Human-readable error messages throughout
- Localhost-only, first-run-only security
- State persisted to store/wizard-state.json
- Writes .env, model-routing.json, groups/main/CLAUDE.md
- Dashboard starts automatically during wizard mode

UX flow refined with Gemini 2.5 Pro review feedback:
combined 8 steps to 6, momentum-based ordering,
humanized build step, error message philosophy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dash (default /bin/sh on Ubuntu) interprets parentheses in
`process.exit(0)` as a subshell, causing the validate phase to
fail with exit code 2. Removing shell: true also eliminates
shell injection risk.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Weekly GitHub Action checks NanoClaw upstream for new commits,
  opens PR with merge (flags conflicts if any)
- Separate job checks for Claude Code SDK version bumps in
  the container Dockerfile
- Deploy script sets up launchd (macOS) or systemd (Linux)
  service automatically

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Quickstart now leads with the setup wizard instead of manual .env editing
- Added bug report and feature request issue templates with structured forms
- Added SECURITY.md at repo root for GitHub Security tab
- Manual setup preserved as "advanced" fallback section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Animated GIF showing 5-step wizard flow (identity → provider → channel → build → done)
- README Quick Start now points to wizard instead of manual .env editing
- Fix: skip Docker check during wizard mode so new users without Docker can still see the wizard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in CJK font support, WhatsApp message normalization fix,
/update-nanoclaw skill (replaces old /update engine, -1508 lines),
and docs updates. Kept Sovereign name/version, SignalWire/wallet
secrets, and Mac commands in README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New users without .env get the wizard immediately instead of
crashing on WhatsApp/Discord auth. Early return in main() when
wizard is incomplete — only init DB and start dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous OpenClaw comparison had several inaccurate claims (false:
"single-threaded", "no payments", "manual setup", "Mac only").
Rewritten to be honest and focus on real differentiators: security
by default, self-improving memory, codebase simplicity, revenue
tools. Also adds FAQ section and corrects line count to ~20K.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… skills, Telegram channel

5 features for agent autonomy:

1. Identity tools (update_identity, read_identity) — agent can modify its own
   CLAUDE.md with guardrails: immutable sections, injection blocking, audit log
2. Runtime npm packages — NODE_PATH extended to include persistent
   /workspace/group/.packages/, npm_config_ignore_scripts=true for supply chain safety
3. Skills tool (list_skills, create_skill) — agent can create persistent custom
   skills that survive across sessions, built-in skills protected from overwrite
4. Telegram channel — grammy-based, same Channel interface as Discord/Slack,
   JID format tg:{chat_id}, activated via TELEGRAM_BOT_TOKEN env var
5. Container memory limits — --memory 1536m prevents OOM crashes on VPS,
   configurable via CONTAINER_MEMORY_LIMIT env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reverse NODE_PATH order so trusted /app/node_modules loads first
- Move npm_config_ignore_scripts after build-time installs (fixes sharp)
- Add ALLOWED_USERS check to Telegram channel
- Block immutable heading injection in update_identity content
- Remove read_identity (redundant), BLOCKED_PATTERNS (bypassable), dead code
- Sanitize YAML newlines in skills.ts description
- Add 50KB content size limit to create_skill

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Keep only Discord + Telegram as active channels
- Remove bump-version, skill-drift, update-tokens, upstream-sync workflows
  (require upstream GitHub App secrets, always fail on fork)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Voice: transcribe audio/voice messages via OpenRouter (Gemini Flash)
before storing, so the agent sees text instead of placeholders.

File sending: extend send_message tool with optional filePath param,
pipe through IPC with path traversal protection, send via platform
APIs (Discord AttachmentBuilder, Telegram sendPhoto/sendDocument).

Vision: download images to .attachments/, persist metadata in DB,
pass as base64 content blocks to Claude so the agent can see photos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@brandontan
Copy link
Copy Markdown
Author

Merged directly into fork's main. Not contributing upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant