Real-time voice conversations in Discord voice channels. Join a voice channel, speak, and have your words transcribed, processed by Claude, and spoken back.
- Join/Leave Voice Channels: Via slash commands, CLI, or agent tool
- Voice Activity Detection (VAD): Automatically detects when users are speaking
- Speech-to-Text: Whisper API (OpenAI), Deepgram, or Local Whisper (Offline)
- Streaming STT: Real-time transcription with Deepgram WebSocket (~1s latency reduction)
- Agent Integration: Transcribed speech is routed through the Clawdbot agent
- Text-to-Speech: OpenAI TTS, ElevenLabs, Deepgram Aura, Amazon Polly, Edge TTS (Microsoft, free), or Kokoro (Local/Offline)
- Audio Playback: Responses are spoken back in the voice channel
- Barge-in Support: Stops speaking immediately when user starts talking
- Thinking Sound: Optional looping sound while processing (configurable)
- Auto-reconnect: Automatic heartbeat monitoring and reconnection on disconnect
- Discord bot with voice permissions (Connect, Speak, Use Voice Activity)
- API keys for STT and TTS providers
- System dependencies for voice:
ffmpeg(audio processing)- Native build tools for
@discordjs/opusandsodium-native
# Ubuntu/Debian
sudo apt-get install ffmpeg build-essential python3
# Fedora/RHEL
sudo dnf install ffmpeg gcc-c++ make python3
# macOS
brew install ffmpeg# When installed as OpenClaw plugin
cd ~/.openclaw/extensions/discord-voice
npm install
# Or for development (link from OpenClaw workspace)
openclaw plugins install ./path/to/discord-voice{
plugins: {
entries: {
"discord-voice": {
enabled: true,
config: {
sttProvider: "whisper",
ttsProvider: "openai",
ttsVoice: "nova",
vadSensitivity: "medium",
allowedUsers: [], // Empty = allow all users
silenceThresholdMs: 800,
maxRecordingMs: 30000,
openai: {
apiKey: "sk-...", // Or use OPENAI_API_KEY env var
},
},
},
},
},
}Complete example (Grok + ElevenLabs + GPT-4o-mini STT):
{
plugins: {
entries: {
"discord-voice": {
enabled: true,
config: {
autoJoinChannel: "DISCORDCHANNELID",
model: "xai/grok-4-1-fast-non-reasoning",
thinkLevel: "off",
sttProvider: "gpt4o-mini",
ttsProvider: "elevenlabs",
ttsVoice: "VOICEID",
vadSensitivity: "medium",
bargeIn: true,
openai: { apiKey: "sk-proj-..." },
elevenlabs: { apiKey: "sk_...", modelId: "turbo" },
},
},
},
},
}Replace DISCORDCHANNELID with your Discord voice channel ID and VOICEID with your ElevenLabs voice ID.
Ensure your Discord bot has these permissions:
- Connect - Join voice channels
- Speak - Play audio
- Use Voice Activity - Detect when users speak
Add these to your bot's OAuth2 URL or configure in Discord Developer Portal.
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Enable/disable the plugin |
sttProvider |
string | "whisper" |
"whisper", "local-whisper", "wyoming-whisper", "gpt4o-mini", "gpt4o-transcribe", "gpt4o-transcribe-diarize" (OpenAI), or "deepgram" |
sttFallbackProvider |
string | undefined |
Single fallback (legacy). Prefer sttFallbackProviders. |
sttFallbackProviders |
string[] | undefined |
Fallback STT when primary fails (quota, rate limit, Wyoming unreachable). E.g. ["local-whisper", "wyoming-whisper"]. |
streamingSTT |
boolean | true |
Use streaming STT (Deepgram only, ~1s faster) |
ttsProvider |
string | "openai" |
"openai", "elevenlabs", "deepgram", "polly", "edge", or "kokoro" |
ttsVoice |
string | "nova" |
Deprecated – use provider-specific: openai.voice, elevenlabs.voiceId, kokoro.voice |
vadSensitivity |
string | "medium" |
"low", "medium", or "high" |
bargeIn |
boolean | true |
Stop speaking when user talks |
allowedUsers |
string[] | [] |
User IDs allowed (empty = all) |
silenceThresholdMs |
number | 800 |
Silence before processing (ms); lower = snappier |
maxRecordingMs |
number | 30000 |
Max recording length (ms) |
heartbeatIntervalMs |
number | 30000 |
Connection health check interval |
autoJoinChannel |
string | undefined |
Channel ID to auto-join on startup |
openclawRoot |
string | undefined |
OpenClaw package root if auto-detection fails |
thinkingSound |
object | see Thinking Sound | Sound played while processing |
noEmojiHint |
boolean | string | true |
Inject TTS hint into agent prompt; when set, emojis are also stripped from responses before TTS (avoids Kokoro reading them aloud) |
ttsFallbackProvider |
string | undefined |
Single fallback (legacy). Prefer ttsFallbackProviders. |
ttsFallbackProviders |
string[] | undefined |
Fallback TTS providers tried in order when primary fails (quota/rate limit). E.g. ["edge", "polly", "kokoro"]. Once one succeeds, the session stays on it until the bot leaves the channel. |
When a plugin option is not set, the plugin uses values from the main OpenClaw config when available:
| Plugin option | Fallback source(s) |
|---|---|
model |
agents.defaults.model.primary or agents.list[0].model |
ttsProvider |
tts.provider |
ttsVoice |
tts.voice |
OpenAI apiKey |
talk.apiKey, providers.openai.apiKey, or models.providers.openai.apiKey |
ElevenLabs apiKey |
plugins.entries.elevenlabs.config.apiKey |
The Discord bot token is always read from channels.discord.token (or channels.discord.accounts.default.token).
{
openai: {
apiKey: "sk-...",
whisperModel: "whisper-1", // or use sttProvider: "gpt4o-mini"
ttsModel: "tts-1",
voice: "nova", // nova, shimmer, echo, onyx, fable, alloy, ash, sage, coral (default: nova)
},
}OpenAI STT options: whisper (legacy), gpt4o-mini (faster, cheaper), gpt4o-transcribe (higher quality), gpt4o-transcribe-diarize (with speaker identification).
{
elevenlabs: {
apiKey: "...",
voiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel (ElevenLabs voice ID)
modelId: "turbo", // turbo | flash | v2 | v3
},
}modelId: "turbo"– eleven_turbo_v2_5 (default, fastest, lowest latency)modelId: "flash"– eleven_flash_v2_5 (fast)modelId: "v2"– eleven_multilingual_v2 (balanced)modelId: "v3"– eleven_multilingual_v3 (most expressive)
{
sttProvider: "deepgram",
deepgram: {
apiKey: "...",
model: "nova-2", // STT model
ttsModel: "aura-asteria-en", // TTS model (Aura), default
},
}Use ttsProvider: "deepgram" for TTS. Aura models: aura-asteria-en, aura-2-thalia-en, etc. Output: Opus/OGG for Discord.
Uses AWS credentials (env vars, profile, or explicit keys). Default voice: Joanna.
{
ttsProvider: "polly",
polly: {
region: "us-east-1",
voiceId: "Joanna",
engine: "neural", // optional: standard | neural | long-form | generative
accessKeyId: "...", // optional, else uses AWS default chain
secretAccessKey: "...",
},
}No API key required. Runs locally using Xenova/Transformers.
{
sttProvider: "local-whisper",
localWhisper: {
model: "Xenova/whisper-tiny.en", // Optional, default
quantized: true, // Optional, smaller/faster
},
}Connects to a Wyoming Faster Whisper server over TCP. Run the server on a host (e.g. Docker) and point the plugin at host:port.
{
sttProvider: "wyoming-whisper",
wyomingWhisper: {
host: "192.168.1.10", // or "remote-host.local"
port: 10300, // default Wyoming port
language: "de", // optional hint (de, en, etc.)
},
}
// Alternative: use uri instead of host+port
{
wyomingWhisper: {
uri: "192.168.1.10:10300",
},
}Run Wyoming Faster Whisper (Docker): docker run -p 10300:10300 -v /data:/data rhasspy/wyoming-whisper --model tiny-int8 --language en
No API key required. Uses Microsoft's online neural TTS via node-edge-tts. Default voice: Katja (de-DE). Output format optimized for Discord (WebM/Opus).
{
ttsProvider: "edge",
edge: {
voice: "de-DE-KatjaNeural", // Default: Katja (German)
lang: "de-DE",
outputFormat: "webm-24khz-16bit-mono-opus", // Best for Discord
rate: "+0%", // Optional: e.g. "+10%", "-5%"
pitch: "+0%", // Optional
volume: "+0%", // Optional
proxy: undefined, // Optional: proxy URL
timeoutMs: 30000,
},
}No API key required. Runs locally on CPU. Use as primary or in ttsFallbackProviders when ElevenLabs/OpenAI hit quota limits. With noEmojiHint enabled (default), emojis are stripped from responses before TTS so Kokoro does not try to read them aloud.
{
ttsProvider: "kokoro",
kokoro: {
voice: "af_heart", // af_heart, af_bella, af_nicole, etc. (default: af_heart)
modelId: "onnx-community/Kokoro-82M-v1.0-ONNX", // Optional
dtype: "fp32", // Optional: "fp32", "q8", "q4"
},
}When the primary TTS fails with quota exceeded or rate limit, fallback providers are tried in order. Once one succeeds, the session stays on it until the bot leaves the voice channel.
// Multiple fallbacks (tried in order: edge → polly → kokoro)
{
ttsProvider: "elevenlabs",
ttsFallbackProviders: ["edge", "polly", "kokoro"],
elevenlabs: { apiKey: "...", voiceId: "...", modelId: "turbo" },
}
// Single fallback (legacy, same as ttsFallbackProviders: ["kokoro"])
{
ttsProvider: "elevenlabs",
ttsFallbackProvider: "kokoro",
elevenlabs: { apiKey: "...", voiceId: "...", modelId: "turbo" },
}When the primary STT fails (quota, rate limit, or Wyoming unreachable), fallback providers are tried in order. Once one succeeds, the session stays on it until the bot leaves the voice channel.
{
sttProvider: "wyoming-whisper",
sttFallbackProviders: ["local-whisper", "whisper"],
wyomingWhisper: { host: "192.168.1.10", port: 10300 },
}Once registered with Discord, use these commands (prefix /discord_voice to avoid overlap with other voice/TTS commands):
/discord_voice join <channel>- Join a voice channel/discord_voice leave- Leave the current voice channel/discord_voice status- Show voice connection status, STT/TTS provider, model, think level, and available models/discord_voice reset-fallback- Reset STT/TTS fallbacks; next request will try primary providers again/discord_voice set-stt <provider>- Set STT provider (whisper, gpt4o-mini, deepgram, local-whisper, wyoming-whisper, etc.)/discord_voice set-tts <provider>- Set TTS provider (openai, elevenlabs, deepgram, polly, kokoro, edge)/discord_voice set-model <model>- Set LLM model (e.g. google-gemini-cli/gemini-3-fast-preview, xai/grok-4-1-fast-non-reasoning)/discord_voice set-think <level>- Set thinking level (off, low, medium, high)
# Join a voice channel
clawdbot discord_voice join <channelId>
# Leave voice
clawdbot discord_voice leave --guild <guildId>
# Check status (includes STT/TTS, model, think level, available models)
clawdbot discord_voice status
# Set STT provider
clawdbot discord_voice set-stt <provider> [--guild <guildId>]
# Set TTS provider
clawdbot discord_voice set-tts <provider> [--guild <guildId>]
# Set LLM model
clawdbot discord_voice set-model <model> [--guild <guildId>]
# Set thinking level
clawdbot discord_voice set-think <level> [--guild <guildId>]
# Reset fallbacks – use primary providers on next request
clawdbot discord_voice reset-fallback --guild <guildId>The agent can use the discord_voice tool:
Join voice channel 1234567890
The tool supports actions:
join- Join a voice channel (requires channelId)leave- Leave voice channelspeak- Speak text in the voice channelstatus- Get current voice status (STT/TTS, model, think level, available models)reset-fallback- Reset fallbacks; next request tries primary providersset-stt- Set STT provider for sessionset-tts- Set TTS provider for sessionset-model- Set LLM model (e.g. google-gemini-cli/gemini-3-fast-preview)set-think- Set thinking level (off, low, medium, high)
- Join: Bot joins the specified voice channel
- Listen: VAD detects when users start/stop speaking
- Record: Audio is buffered while user speaks
- Transcribe: On silence, audio is sent to STT provider
- Process: Transcribed text is sent to Clawdbot agent
- Synthesize: Agent response is converted to audio via TTS
- Play: Audio is played back in the voice channel
When using Deepgram as your STT provider, streaming mode is enabled by default. This provides:
- ~1 second faster end-to-end latency
- Real-time feedback with interim transcription results
- Automatic keep-alive to prevent connection timeouts
- Fallback to batch transcription if streaming fails
To use streaming STT:
{
sttProvider: "deepgram",
streamingSTT: true, // default
deepgram: {
apiKey: "...",
model: "nova-2",
},
}When enabled (default), the bot will immediately stop speaking if a user starts talking. This creates a more natural conversational flow where you can interrupt the bot.
To disable (let the bot finish speaking):
{
bargeIn: false,
}While the bot processes speech and generates a response, it can play a short looping sound. A default thinking.mp3 is included in assets/. Configure via thinkingSound:
{
thinkingSound: {
enabled: true,
path: "assets/thinking.mp3",
volume: 0.7,
stopDelayMs: 50,
},
}enabled:trueby default. Set tofalseto disable.path: Path to MP3 (relative to plugin root or absolute). Defaultassets/thinking.mp3.volume: 0–1, default0.7.stopDelayMs: Delay (ms) after stopping thinking sound before playing response. Default50. Range 0–500. Lower = snappier.
If the file is missing, no sound is played. Any short ambient or notification MP3 works (e.g. 2–5 seconds, looped).
The plugin includes automatic connection health monitoring:
- Heartbeat checks every 30 seconds (configurable)
- Auto-reconnect on disconnect with exponential backoff
- Max 3 attempts before giving up
If the connection drops, you'll see logs like:
[discord-voice] Disconnected from voice channel
[discord-voice] Reconnection attempt 1/3
[discord-voice] Reconnected successfully
- low: Picks up quiet speech, may trigger on background noise
- medium: Balanced (recommended)
- high: Requires louder, clearer speech
If you see this when processing voice input, set openclawRoot in your plugin config to the directory that contains dist/extensionAPI.js:
{
plugins: {
entries: {
"discord-voice": {
enabled: true,
config: {
openclawRoot: "/home/openclaw-user/.openclaw/extensions/discord-voice/node_modules/openclaw",
},
},
},
},
}Example: if your extensionAPI.js is at .../discord-voice/node_modules/openclaw/dist/extensionAPI.js, use the node_modules/openclaw path (the directory containing dist/). Alternatively, set the OPENCLAW_ROOT environment variable.
Ensure the Discord channel is configured and the bot is connected before using voice.
This can happen with a corrupted or incomplete install. In the plugin directory, run:
cd ~/.openclaw/extensions/discord-voice
rm -rf node_modules package-lock.json
npm installThen restart the gateway.
Install build tools:
npm install -g node-gyp
npm rebuild @discordjs/opus sodium-native- Check bot has Connect + Speak permissions
- Check bot isn't server muted
- Verify TTS API key is valid
- Check STT API key is valid
- Check audio is being recorded (see debug logs)
- Try adjusting VAD sensitivity
DEBUG=discord-voice clawdbot gateway start| Variable | Description |
|---|---|
OPENCLAW_ROOT |
OpenClaw package root (if auto-detection fails) |
OPENAI_API_KEY |
OpenAI API key (Whisper + TTS) |
ELEVENLABS_API_KEY |
ElevenLabs API key |
DEEPGRAM_API_KEY |
Deepgram API key |
- Only one voice channel per guild at a time
- Maximum recording length: 30 seconds (configurable)
- Requires stable network for real-time audio
- TTS output may have slight delay due to synthesis
This plugin targets OpenClaw (formerly Clawdbot). It uses the same core bridge pattern as the official voice-call plugin: it loads the agent API from OpenClaw's dist/extensionAPI.js. The plugin is discovered via openclaw.extensions in package.json and openclaw.plugin.json.
If auto-detection fails, set openclawRoot in the plugin config (see Troubleshooting) or OPENCLAW_ROOT to the directory containing dist/extensionAPI.js (e.g. .../discord-voice/node_modules/openclaw).
MIT