Skip to content

Commit ded598f

Browse files
doxycompM2noaclaudedoxycursoragent
authored
Merge main: keep upstream tooling & security, add discord_voice, Wyoming, Polly/Edge/Deepgram, fallbacks (#11)
* feat(tts): add Kokoro TTS provider - Add `kokoro-js` dependency - Implement `KokoroTTSProvider` in `src/tts.ts` for offline/local TTS - Update configuration schema in `src/config.ts` to support Kokoro options - Update documentation (README, SKILL, CHANGELOG) with usage instructions Co-Authored-By: Claude (gemini-3-pro-low) <noreply@anthropic.com> * feat(stt): add Local Whisper STT provider - Add `@xenova/transformers` and `wavefile` dependencies - Implement `LocalWhisperSTT` in `src/stt.ts` using Transformers.js - Update configuration in `src/config.ts` to support `local-whisper` options - Update documentation with usage instructions for offline STT Co-Authored-By: Claude (gemini-3-pro-low) <noreply@anthropic.com> * feat: add local whisper STT support and update docs * refactor(stt): fix race condition in local model init and use wavefile for wav generation * fix(voice): prevent stream.push() after EOF crash by handling stream lifecycle manually * feat: add logging interface and type checking script - Introduced a `StreamingSTTLogger` interface for improved logging capabilities in the streaming STT classes. - Updated `DeepgramStreamingSTT` and `StreamingSTTManager` to accept a logger instance for better error and info logging. - Added a new script in `package.json` for TypeScript type checking. * fix: improve audio handling during streaming STT sessions - Implemented a mechanism to buffer audio chunks while the WebSocket connection is being established, preventing audio loss during the handshake. - Introduced a `waitForReady()` method to ensure callers can await the connection before sending audio. - Updated `sendAudio` to buffer audio instead of dropping it when the connection is not yet ready. * fix: enhance audio buffering and connection handling during streaming STT - Improved the audio buffering mechanism to handle scenarios where the WebSocket connection is delayed. - Ensured that audio chunks are stored until the connection is ready, preventing loss during the initial handshake. - Updated the `sendAudio` method to utilize the buffering system effectively. * feat: OpenClaw compatibility, core-bridge, manifest, smoke test Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add support for GPT-4o mini transcription and enhance STT provider options - Introduced `gpt4o-mini` as a new Speech-to-Text provider for higher quality transcriptions. - Updated configuration files and documentation to reflect the new STT provider options. - Enhanced ElevenLabs model ID configuration to support v2, v3, and turbo models. - Implemented a new `Gpt4oMiniTranscribeSTT` class for handling transcriptions using OpenAI's GPT-4o mini model. * feat: enhance STT provider options and add thinking sound configuration - Expanded Speech-to-Text provider options to include `gpt4o-transcribe` and `gpt4o-transcribe-diarize` for improved transcription capabilities. - Introduced a new `thinkingSound` configuration to play a sound while processing, with customizable options for enabling, file path, and volume. - Updated documentation and configuration files to reflect these changes, ensuring clarity on new features and usage. * feat: enhance OpenClaw integration and troubleshooting documentation - Added a fallback mechanism to resolve the OpenClaw root directory using `require`, improving compatibility with the OpenClaw gateway process. - Updated README.md to include troubleshooting steps for the "Unable to resolve OpenClaw root" error, detailing how to set the `OPENCLAW_ROOT` environment variable. - Documented the `OPENCLAW_ROOT` variable in the troubleshooting section for clarity on its usage. * feat: enhance OpenClaw integration with configurable root path and thinking sound - Added `openclawRoot` configuration option to specify the OpenClaw package root directory, improving compatibility with the OpenClaw gateway. - Updated `thinkingSound` documentation in README.md to clarify configuration options for enabling, file path, and volume. - Refactored core functions to utilize the new `openclawRoot` parameter for loading core dependencies. - Enhanced JSON schema in plugin manifests to include `openclawRoot` description for better user guidance. * fix: update README and package.json for troubleshooting and dependency management - Added troubleshooting steps in README.md for the "Cannot find module structures/ClientUser" error, including commands to reinstall dependencies. - Updated package.json to include the @sinclair/typebox dependency in the main dependencies section, ensuring it is available for runtime use. * Phase 0: voice pipeline latency, config safeguards, English docs - Kokoro: return null from createStreamingTTSProvider (direct batch fallback) - silenceThresholdMs default 800ms - thinkingSound.stopDelayMs configurable (default 50ms) - Validate timeout values (silenceThresholdMs, minAudioMs, etc.) >= 0 - REFACTORING.md: translate to English, update LOC, mark Phase 0 done - README: stopDelayMs, silenceThresholdMs defaults Co-authored-by: Cursor <cursoragent@cursor.com> * feat: configurable no-emoji hint for TTS voice prompt - noEmojiHint: true | false | string - true (default): inject default 'no emojis' text - false: do not inject - string: use custom hint text - Export DEFAULT_NO_EMOJI_HINT from config Co-authored-by: Cursor <cursoragent@cursor.com> * feat: TTS fallback provider for quota/rate limit errors - ttsFallbackProvider config (openai | elevenlabs | kokoro) - Detect retryable errors: quota_exceeded, 401, 429, 503 - Try fallback when primary fails (streaming + batch) - Kokoro as free local fallback option - README: fallback config, Kokoro marked as free Co-authored-by: Cursor <cursoragent@cursor.com> * fix: convert Kokoro PCM to WAV for Discord playback Kokoro returns raw PCM; Discord needs a format FFmpeg can decode. Wrap PCM in WAV header via wavefile before createAudioResource. Shared createResourceFromTTSResult() for main and fallback flows. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: add kokoro and local-whisper to plugin config schema - ttsProvider enum: add kokoro (was missing, caused validation error) - sttProvider enum: add local-whisper - openclaw.plugin.json + clawdbot.plugin.json Co-authored-by: Cursor <cursoragent@cursor.com> * fix(kokoro): convert PCM Buffer to Int16Array for wavefile WAV creation wavefile.fromScratch() expects sample values (-32768..32767), not raw bytes. Passing a Buffer caused corrupt WAV output and no audio playback. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: session fallback, emoji strip, provider voices, ElevenLabs default - TTS fallback: stay on fallback provider for rest of session once switched - Emoji stripping: remove emojis before TTS when noEmojiHint is set (Kokoro) - Provider-specific voices: openai.voice (nova), elevenlabs.voiceId, kokoro.voice - OpenAI: validate voice, resolve to nova if invalid - ElevenLabs: default modelId eleven_turbo_v2_5 - README: docs for new options Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add Deepgram & Polly TTS, ordered fallback providers - Add DeepgramTTS (REST API, Opus/OGG) - Add PollyTTS (AWS SDK, MP3) - Add ttsFallbackProviders array for ordered fallback chain - Backward compat: ttsFallbackProvider (single) still supported - Session stores which fallback succeeded for rest of channel Co-authored-by: Cursor <cursoragent@cursor.com> * fix: use deepgram model for TTS when it's an Aura voice (e.g. aura-2-kara-de) When model is set to an Aura TTS model (aura-*), use it for TTS instead of defaulting to aura-asteria-en. STT model stays nova-2. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add Wyoming Faster Whisper STT provider (remote over TCP) - WyomingWhisperSTT connects to wyoming-faster-whisper server via TCP - Config: wyomingWhisper.host, port, uri, language, connectTimeoutMs - Wyoming protocol: transcribe -> audio-start/chunk/stop -> transcript - Default: 127.0.0.1:10300 Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add STT fallback providers (quota, rate limit, Wyoming unreachable) - sttFallbackProviders array (sttFallbackProvider legacy) - isRetryableSttError: quota, 401/429/503, ECONNREFUSED, timeout, wyoming - Session stores fallbackSttProvider on success - tryTranscribeWithProvider for fallback transcription Co-authored-by: Cursor <cursoragent@cursor.com> * fix: increase Wyoming Whisper default timeout to 60s faster-whisper inference can take 15-30s; 10s was too short Co-authored-by: Cursor <cursoragent@cursor.com> * fix: Wyoming protocol - parse data_length/payload_length correctly Wyoming sends transcript text in separate data bytes, not in header. Implement proper Wyoming message parsing (header + data_length + payload_length). Default timeout back to 10s. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(stt): lazy-load Xenova to prevent OpenClaw crash when local-whisper not used Top-level import of @xenova/transformers could crash OpenClaw/Electron on startup. Now Xenova is loaded only when LocalWhisperSTT is actually used (dynamic import in ensureInitialized). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(voice): prevent duplicate speech and unheard TTS when agent uses discord_voice speak - Add spokeViaToolThisRun flag to skip processRecording speak when agent already spoke via discord_voice tool - Inject system prompt: 'do not use discord_voice speak tool; just return reply as text' for voice transcript flow - Stop thinking sound and re-subscribe main player before speak() when agent calls speak tool during run (connection was on thinkingPlayer, so TTS was not heard) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(voice): prevent duplicate transcript processing and double speak - Clear previous silence timer on each 'end' event to avoid multiple processRecording calls from duplicate Discord speaking.end events - Add transcript dedupe: skip same transcript within 5s window - Use current session from map for spokeViaToolThisRun check (agent auto-join may replace session during run) - Clear silence timer handle when callback fires Co-authored-by: Cursor <cursoragent@cursor.com> * fix: prevent duplicate plugin init when loaded via both openclaw and clawdbot extensions Add discordVoiceRegistered singleton guard - skip duplicate register() to avoid two Discord clients and double voice event processing. Reset on stop/enabled=false. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(stt): Wyoming always retried, other primaries stick to fallback - When primary is wyoming-whisper: try primary each time (service may be temporarily down), never persist fallback - When primary is other (gpt4o-mini, etc.): stick to fallback for rest of session once switched Co-authored-by: Cursor <cursoragent@cursor.com> * feat: runtime provider/model/think overrides via set-stt, set-tts, set-model, set-think - Session overrides for STT/TTS provider, model, thinkLevel - TTS speak() uses effective provider (override or config) - handleTranscript uses session model/think overrides for agent run - Gateway: discord-voice.set-stt, set-tts, set-model, set-think, models - Agent tool: set-stt, set-tts, set-model, set-think actions - CLI: voice set-stt, set-tts, set-model, set-think - getAvailableModels() from agents.list/defaults for suggestions - Status includes model, thinkLevel, availableModels - README: document new commands Co-authored-by: Cursor <cursoragent@cursor.com> * fix: add wyomingWhisper and polly.accessKeyId/secretAccessKey to clawdbot schema Co-authored-by: Cursor <cursoragent@cursor.com> * refactor: use discord_voice prefix for slash commands and CLI to avoid overlap with TTS/voice commands Co-authored-by: Cursor <cursoragent@cursor.com> * fix: allow additional config properties, add localWhisper and thinkingSound.stopDelayMs to schema Co-authored-by: Cursor <cursoragent@cursor.com> * chore: exclude .claude/ and REFACTORING.md from repo Co-authored-by: Cursor <cursoragent@cursor.com> * Allow additional config properties in plugin schema Set additionalProperties: true so OpenClaw config validation accepts config keys not explicitly listed (fixes 'must NOT have additional properties' error when loading plugin). Co-authored-by: Cursor <cursoragent@cursor.com> * chore: add hono@4.11.9 to satisfy peer dep and fix npm ci Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Minoa <altgithub@minoa.cat> Co-authored-by: Claude (gemini-3-pro-low) <noreply@anthropic.com> Co-authored-by: doxy <doxycomp@gmx.net> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 0a7723b commit ded598f

16 files changed

+2158
-1084
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1919
- Streaming TTS for real-time audio playback
2020
- Barge-in support to interrupt bot responses
2121
- Auto-reconnect with heartbeat monitoring
22-
- Discord slash commands: `/voice join`, `/voice leave`, `/voice status`
22+
- Discord slash commands: `/discord_voice join`, `/discord_voice leave`, `/discord_voice status`
2323
- CLI commands for voice management
2424
- Agent tool `discord_voice` for programmatic control
2525
- Configurable VAD sensitivity (low/medium/high)

README.md

Lines changed: 144 additions & 33 deletions
Large diffs are not rendered by default.

SKILL.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -184,21 +184,21 @@ Add these to your bot's OAuth2 URL or configure in Discord Developer Portal.
184184

185185
Once registered with Discord, use these commands:
186186

187-
- `/voice join <channel>` - Join a voice channel
188-
- `/voice leave` - Leave the current voice channel
189-
- `/voice status` - Show voice connection status
187+
- `/discord_voice join <channel>` - Join a voice channel
188+
- `/discord_voice leave` - Leave the current voice channel
189+
- `/discord_voice status` - Show voice connection status
190190

191191
### CLI Commands
192192

193193
```bash
194194
# Join a voice channel
195-
clawdbot voice join <channelId>
195+
clawdbot discord_voice join <channelId>
196196

197197
# Leave voice
198-
clawdbot voice leave --guild <guildId>
198+
clawdbot discord_voice leave --guild <guildId>
199199

200200
# Check status
201-
clawdbot voice status
201+
clawdbot discord_voice status
202202
```
203203

204204
### Agent Tool

clawdbot.plugin.json

Lines changed: 87 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -27,32 +27,50 @@
2727
},
2828
"configSchema": {
2929
"type": "object",
30-
"additionalProperties": false,
30+
"additionalProperties": true,
3131
"properties": {
32-
"enabled": {
33-
"type": "boolean",
34-
"default": true
35-
},
32+
"enabled": { "type": "boolean", "default": true },
3633
"sttProvider": {
3734
"type": "string",
38-
"enum": ["whisper", "gpt4o-mini", "gpt4o-transcribe", "gpt4o-transcribe-diarize", "deepgram", "local-whisper"],
35+
"enum": [
36+
"whisper",
37+
"gpt4o-mini",
38+
"gpt4o-transcribe",
39+
"gpt4o-transcribe-diarize",
40+
"deepgram",
41+
"local-whisper",
42+
"wyoming-whisper"
43+
],
3944
"default": "whisper"
4045
},
46+
"sttFallbackProviders": {
47+
"type": "array",
48+
"items": {
49+
"type": "string",
50+
"enum": [
51+
"whisper",
52+
"gpt4o-mini",
53+
"gpt4o-transcribe",
54+
"gpt4o-transcribe-diarize",
55+
"deepgram",
56+
"local-whisper",
57+
"wyoming-whisper"
58+
]
59+
},
60+
"description": "Fallback STT when primary fails (quota, rate limit, Wyoming unreachable)"
61+
},
4162
"streamingSTT": {
4263
"type": "boolean",
4364
"default": true,
4465
"description": "Use streaming STT for lower latency (Deepgram only)"
4566
},
4667
"ttsProvider": {
4768
"type": "string",
48-
"enum": ["openai", "elevenlabs", "kokoro"],
69+
"enum": ["openai", "elevenlabs", "deepgram", "polly", "edge", "kokoro"],
4970
"default": "openai",
50-
"description": "openai, elevenlabs, or kokoro (free local)"
51-
},
52-
"ttsVoice": {
53-
"type": "string",
54-
"default": "nova"
71+
"description": "openai, elevenlabs, edge (free), or kokoro (free local)"
5572
},
73+
"ttsVoice": { "type": "string", "default": "nova" },
5674
"vadSensitivity": {
5775
"type": "string",
5876
"enum": ["low", "medium", "high"],
@@ -73,42 +91,24 @@
7391
"default": "off",
7492
"description": "Thinking level for voice responses (lower = faster)"
7593
},
76-
"allowedUsers": {
77-
"type": "array",
78-
"items": { "type": "string" },
79-
"default": []
80-
},
81-
"silenceThresholdMs": {
82-
"type": "number",
83-
"default": 1500
84-
},
85-
"minAudioMs": {
86-
"type": "number",
87-
"default": 500
88-
},
89-
"maxRecordingMs": {
90-
"type": "number",
91-
"default": 30000
92-
},
94+
"allowedUsers": { "type": "array", "items": { "type": "string" }, "default": [] },
95+
"silenceThresholdMs": { "type": "number", "default": 1500 },
96+
"minAudioMs": { "type": "number", "default": 500 },
97+
"maxRecordingMs": { "type": "number", "default": 30000 },
9398
"heartbeatIntervalMs": {
9499
"type": "number",
95100
"default": 30000,
96101
"description": "Connection health check interval in ms"
97102
},
98-
"autoJoinChannel": {
99-
"type": "string",
100-
"description": "Voice channel ID to auto-join on startup"
101-
},
102-
"openclawRoot": {
103-
"type": "string",
104-
"description": "OpenClaw package root if auto-detection fails"
105-
},
103+
"autoJoinChannel": { "type": "string", "description": "Voice channel ID to auto-join on startup" },
104+
"openclawRoot": { "type": "string", "description": "OpenClaw package root if auto-detection fails" },
106105
"thinkingSound": {
107106
"type": "object",
108107
"properties": {
109108
"enabled": { "type": "boolean", "default": true },
110109
"path": { "type": "string", "default": "assets/thinking.mp3" },
111-
"volume": { "type": "number", "default": 0.7, "minimum": 0, "maximum": 1 }
110+
"volume": { "type": "number", "default": 0.7, "minimum": 0, "maximum": 1 },
111+
"stopDelayMs": { "type": "number", "default": 50 }
112112
}
113113
},
114114
"openai": {
@@ -124,6 +124,29 @@
124124
}
125125
}
126126
},
127+
"polly": {
128+
"type": "object",
129+
"properties": {
130+
"region": { "type": "string", "default": "us-east-1" },
131+
"voiceId": { "type": "string", "default": "Joanna" },
132+
"engine": { "type": "string", "enum": ["standard", "neural", "long-form", "generative"] },
133+
"accessKeyId": { "type": "string" },
134+
"secretAccessKey": { "type": "string" }
135+
}
136+
},
137+
"edge": {
138+
"type": "object",
139+
"properties": {
140+
"voice": { "type": "string", "default": "de-DE-KatjaNeural" },
141+
"lang": { "type": "string", "default": "de-DE" },
142+
"outputFormat": { "type": "string", "default": "webm-24khz-16bit-mono-opus" },
143+
"rate": { "type": "string" },
144+
"pitch": { "type": "string" },
145+
"volume": { "type": "string" },
146+
"proxy": { "type": "string" },
147+
"timeoutMs": { "type": "number" }
148+
}
149+
},
127150
"kokoro": {
128151
"type": "object",
129152
"properties": {
@@ -148,8 +171,25 @@
148171
"type": "object",
149172
"properties": {
150173
"apiKey": { "type": "string" },
151-
"model": { "type": "string", "default": "nova-2" }
174+
"model": { "type": "string", "default": "nova-2" },
175+
"ttsModel": { "type": "string", "default": "aura-asteria-en" }
152176
}
177+
},
178+
"wyomingWhisper": {
179+
"type": "object",
180+
"properties": {
181+
"host": { "type": "string", "default": "127.0.0.1" },
182+
"port": { "type": "number", "default": 10300 },
183+
"uri": { "type": "string" },
184+
"language": { "type": "string" },
185+
"connectTimeoutMs": { "type": "number", "default": 10000 }
186+
},
187+
"description": "Wyoming Faster Whisper (remote STT over TCP)"
188+
},
189+
"localWhisper": {
190+
"type": "object",
191+
"properties": { "model": { "type": "string" }, "quantized": { "type": "boolean" } },
192+
"description": "Local Whisper STT (Xenova)"
153193
}
154194
}
155195
},
@@ -164,7 +204,7 @@
164204
},
165205
"ttsProvider": {
166206
"label": "Text-to-Speech Provider",
167-
"help": "Use 'openai' or 'elevenlabs'"
207+
"help": "openai, elevenlabs, deepgram, polly, edge, or kokoro"
168208
},
169209
"ttsVoice": {
170210
"label": "TTS Voice (deprecated)",
@@ -174,36 +214,15 @@
174214
"label": "OpenAI TTS Voice",
175215
"help": "nova, shimmer, echo, onyx, fable, alloy, ash, sage, coral"
176216
},
177-
"kokoro.voice": {
178-
"label": "Kokoro TTS Voice",
179-
"help": "af_heart, af_bella, af_nicole, etc."
180-
},
181-
"vadSensitivity": {
182-
"label": "VAD Sensitivity",
183-
"help": "Voice activity detection sensitivity (low/medium/high)"
184-
},
185-
"allowedUsers": {
186-
"label": "Allowed Users",
187-
"help": "Discord user IDs allowed to use voice (empty = all allowed)"
188-
},
189-
"openai.apiKey": {
190-
"label": "OpenAI API Key",
191-
"sensitive": true
192-
},
193-
"elevenlabs.apiKey": {
194-
"label": "ElevenLabs API Key",
195-
"sensitive": true
196-
},
197-
"elevenlabs.modelId": {
198-
"label": "ElevenLabs Model",
199-
"help": "turbo, flash, v2, v3 (or full model ID)"
200-
},
217+
"kokoro.voice": { "label": "Kokoro TTS Voice", "help": "af_heart, af_bella, af_nicole, etc." },
218+
"vadSensitivity": { "label": "VAD Sensitivity", "help": "Voice activity detection sensitivity (low/medium/high)" },
219+
"allowedUsers": { "label": "Allowed Users", "help": "Discord user IDs allowed to use voice (empty = all allowed)" },
220+
"openai.apiKey": { "label": "OpenAI API Key", "sensitive": true },
221+
"elevenlabs.apiKey": { "label": "ElevenLabs API Key", "sensitive": true },
222+
"elevenlabs.modelId": { "label": "ElevenLabs Model", "help": "turbo, flash, v2, v3 (or full model ID)" },
201223
"thinkingSound.enabled": { "label": "Thinking Sound", "help": "Play sound while processing" },
202224
"thinkingSound.path": { "label": "Thinking Sound File", "help": "Path to MP3" },
203225
"thinkingSound.volume": { "label": "Thinking Sound Volume", "help": "Volume 0-1" },
204-
"deepgram.apiKey": {
205-
"label": "Deepgram API Key",
206-
"sensitive": true
207-
}
226+
"deepgram.apiKey": { "label": "Deepgram API Key", "sensitive": true }
208227
}
209228
}

0 commit comments

Comments
 (0)