feat(voice): audio transcription for voice messages with Local Whisper support#1476
Conversation
Discord voice messages are .ogg files, but agentscope's OpenAIChatFormatter only accepts wav and mp3 extensions, causing "Unsupported audio file extension" errors. This fix: - Adds .ogg, .flac, .m4a, .aac to _media_type_from_path mapping - Adds _convert_audio_to_wav() using ffmpeg to convert unsupported audio formats before sending to the formatter - Gracefully falls back to original file if ffmpeg is unavailable https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Discord/Telegram voice messages (.ogg) fail because the agentscope
OpenAIChatFormatter only accepts wav/mp3, and most models (Ollama/Qwen)
can't process audio at all.
Add a transcription layer using the OpenAI-compatible
/v1/audio/transcriptions endpoint (which accepts ogg natively):
- New audio_transcription.py: finds an OpenAI-compatible provider
and transcribes audio via whisper-1
- New audio_mode config ("auto"|"transcribe"|"native"):
- auto (default): try transcription, fall back to native audio
- transcribe: always convert audio to text
- native: send audio blocks directly (needs ffmpeg for ogg→wav)
- Refactor message_processing.py to route audio blocks through
the appropriate path based on config
https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Add user-facing configuration for audio_mode (auto/transcribe/native): - API: GET/PUT /agent/audio-mode endpoints - CLI: audio mode prompt in `copaw init` - Console: new Voice Transcription settings page under Settings - i18n: English and Chinese translations https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's ability to process incoming audio messages, particularly addressing Discord's Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature for handling audio messages, particularly .ogg files from Discord, by adding transcription capabilities. It adds a new audio_mode configuration to control whether audio is transcribed, sent natively, or handled automatically. The changes span the backend, CLI, and the console UI, including a new settings page.
The implementation is well-structured. I've provided a few suggestions to improve maintainability and debuggability:
- Making the transcription model name configurable instead of hardcoding it.
- Improving the extensibility of finding transcription providers.
- Enhancing error logging for
ffmpegconversion failures and for API calls in the frontend.
Overall, this is a great addition that significantly improves the agent's ability to handle multimedia messages.
There was a problem hiding this comment.
Pull request overview
Adds configurable handling for incoming voice/audio messages (notably Discord .ogg) by introducing auto transcription via an OpenAI-compatible Whisper endpoint, plus UI/CLI/API surfaces to configure behavior.
Changes:
- Introduces
agents.audio_mode(auto/transcribe/native) and exposes it via CLI init flow and new agent API endpoints. - Updates message media processing to support more audio MIME types, attempt transcription, and optionally convert audio to
.wavvia ffmpeg for native-audio forwarding. - Adds a Console UI settings page (plus navigation + i18n) for configuring voice transcription/audio handling.
Reviewed changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds agents.audio_mode config field. |
| src/copaw/cli/init_cmd.py | Prompts for audio_mode during interactive init. |
| src/copaw/app/routers/agent.py | Adds GET/PUT /agent/audio-mode endpoints. |
| src/copaw/agents/utils/message_processing.py | Adds audio transcription flow and ffmpeg .ogg→.wav conversion fallback; expands audio MIME mapping. |
| src/copaw/agents/utils/audio_transcription.py | New utility to call OpenAI-compatible /v1/audio/transcriptions. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New settings page to view/update audio mode. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styles for the new settings page. |
| console/src/locales/en.json | Adds nav label + page strings for Voice Transcription. |
| console/src/locales/zh.json | Adds nav label + page strings for Voice Transcription. |
| console/src/layouts/Sidebar.tsx | Adds sidebar entry + route mapping for Voice Transcription. |
| console/src/layouts/MainLayout/index.tsx | Adds route + selection mapping for Voice Transcription page. |
| console/src/api/modules/agent.ts | Adds getAudioMode / updateAudioMode API calls. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add console.error() logging in frontend catch blocks for debuggability - Add Japanese (ja.json) and Russian (ru.json) translations for voiceTranscription nav key and settings page strings - Include ffmpeg stderr output in audio conversion error logs https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
There was a problem hiding this comment.
Pull request overview
Adds configurable handling for incoming voice/audio messages so Discord .ogg voice notes can work across model backends by transcribing to text via an OpenAI-compatible Whisper endpoint, with an optional ffmpeg conversion path for native-audio models.
Changes:
- Introduces
agents.audio_mode(auto/transcribe/native) surfaced via config andcopaw init. - Adds
/agent/audio-modeGET/PUT endpoints and a Console settings page to view/update the setting. - Extends message media processing to (a) transcribe audio to text and (b) optionally convert unsupported audio formats to
.wavwith ffmpeg.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds agents.audio_mode config field. |
| src/copaw/cli/init_cmd.py | Prompts for audio mode during interactive init. |
| src/copaw/app/routers/agent.py | Adds API endpoints to get/set audio mode. |
| src/copaw/agents/utils/message_processing.py | Implements audio transcription + ffmpeg conversion behavior for audio blocks. |
| src/copaw/agents/utils/audio_transcription.py | New Whisper transcription helper using OpenAI-compatible /v1/audio/transcriptions. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New Console UI page for selecting audio mode. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styles for the new settings page. |
| console/src/api/modules/agent.ts | Adds getAudioMode / updateAudioMode client calls. |
| console/src/layouts/Sidebar.tsx | Adds nav entry/route mapping for Voice Transcription. |
| console/src/layouts/MainLayout/index.tsx | Wires the new route to the new page. |
| console/src/locales/en.json | Adds nav + page strings. |
| console/src/locales/zh.json | Adds nav + page strings. |
| console/src/locales/ru.json | Adds nav + page strings. |
| console/src/locales/ja.json | Adds nav + page strings. |
| console/src/pages/Settings/Models/components/modals/ProviderConfigModal.tsx | Formatting-only change. |
| console/src/pages/Settings/Models/components/cards/RemoteProviderCard.tsx | Formatting-only change. |
| console/src/pages/Control/Sessions/index.tsx | Formatting-only change. |
| console/src/components/MarkdownCopy/MarkdownCopy.tsx | Formatting-only change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Wrap _convert_audio_to_wav calls with asyncio.to_thread so the blocking subprocess doesn't stall the event loop (up to 30s timeout) - Change AgentsConfig.audio_mode from str to Literal["auto", "transcribe", "native"] for load-time validation https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
There was a problem hiding this comment.
Pull request overview
Adds configurable handling for incoming voice/audio messages (notably Discord .ogg) by introducing transcription support and exposing an audio_mode setting across backend, CLI, API, and Console UI.
Changes:
- Add
audio_modeconfig (auto/transcribe/native) and expose it via CLI init prompts and new FastAPI endpoints. - Implement audio block processing that can transcribe via an OpenAI-compatible Whisper endpoint and optionally convert audio via
ffmpeg. - Add Console UI settings page + i18n strings for “Voice Transcription”.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds AgentsConfig.audio_mode setting. |
| src/copaw/cli/init_cmd.py | Prompts for audio_mode during copaw init interactive flow. |
| src/copaw/app/routers/agent.py | Adds GET/PUT /agent/audio-mode endpoints. |
| src/copaw/agents/utils/message_processing.py | Adds transcription/conversion pipeline for audio blocks and expands supported audio media types. |
| src/copaw/agents/utils/audio_transcription.py | New utility to transcribe audio via OpenAI-compatible /audio/transcriptions. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New settings page to view/edit audio mode. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styling for the new settings page. |
| console/src/pages/Settings/Models/components/modals/ProviderConfigModal.tsx | Formatting-only change. |
| console/src/pages/Settings/Models/components/cards/RemoteProviderCard.tsx | Formatting-only change. |
| console/src/pages/Control/Sessions/index.tsx | Formatting-only change. |
| console/src/locales/en.json | Adds nav + page strings for Voice Transcription. |
| console/src/locales/zh.json | Adds nav + page strings for Voice Transcription. |
| console/src/locales/ru.json | Adds nav + page strings for Voice Transcription. |
| console/src/locales/ja.json | Adds nav + page strings for Voice Transcription. |
| console/src/layouts/Sidebar.tsx | Adds sidebar entry/route mapping for Voice Transcription. |
| console/src/layouts/MainLayout/index.tsx | Adds route for the Voice Transcription page. |
| console/src/components/MarkdownCopy/MarkdownCopy.tsx | Formatting-only change. |
| console/src/api/modules/agent.ts | Adds getAudioMode / updateAudioMode API calls. |
Comments suppressed due to low confidence (1)
src/copaw/agents/utils/message_processing.py:343
- For audio blocks that get transcribed/replaced with a text block (or replaced with the “(transcription unavailable)” placeholder),
_process_single_blockstill returnslocal_path. This later triggersprocess_file_and_media_blocks_in_messageto insert a “User uploaded a file, downloaded to …” text block, which is noisy and can leak local filesystem paths even though the model no longer needs the audio file. Consider returningNone(or a separate flag) when the audio block is converted to text, and only adding entries todownloaded_filesfor blocks that remain as downloadable media.
"Updated %s block with local path: %s",
block_type,
local_path,
)
return local_path
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Remove formatting-only diffs in files not related to voice transcription to keep the PR focused. https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
|
Minor suggestion: For |
Show an ffmpeg installation status alert when native audio mode is selected, similar to the existing dependency check for Local Whisper. Helps users verify ffmpeg is available before selecting this mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds end-to-end voice message handling by introducing configurable audio handling modes and transcription backends (remote Whisper API or local openai-whisper), plus Console UI and API endpoints to configure and monitor the feature.
Changes:
- Add agent config + CLI init prompts for audio mode and transcription provider type.
- Implement audio block handling in message processing (transcribe in
auto, send audio innativewith ffmpeg conversion fallback). - Add backend API endpoints and Console settings page (incl. Local Whisper dependency checks) for voice transcription configuration.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds audio_mode + transcription-related config fields. |
| src/copaw/cli/init_cmd.py | Adds interactive prompts for audio mode and transcription provider type. |
| src/copaw/app/routers/agent.py | Adds REST endpoints for audio/transcription settings and status checks. |
| src/copaw/app/channels/dingtalk/content_utils.py | Emits AudioContent for DingTalk voice messages. |
| src/copaw/app/channels/dingtalk/constants.py | Maps DingTalk voice type to audio. |
| src/copaw/app/channels/base.py | Lets audio-only messages bypass no-text debounce. |
| src/copaw/agents/utils/message_processing.py | Implements transcription/native-audio handling and ffmpeg conversion for audio blocks. |
| src/copaw/agents/utils/audio_transcription.py | New utility implementing Whisper API + Local Whisper transcription flows. |
| pyproject.toml | Adds optional dependency extra copaw[whisper]. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New Console settings page to configure voice transcription and check provider status. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styling for the new settings page. |
| console/src/layouts/Sidebar.tsx | Adds navigation entry for Voice Transcription settings. |
| console/src/layouts/MainLayout/index.tsx | Adds route for /voice-transcription. |
| console/src/api/modules/agent.ts | Adds API client methods for new backend endpoints. |
| console/src/locales/en.json | Adds nav label + full Voice Transcription translations. |
| console/src/locales/zh.json | Adds nav label + full Voice Transcription translations. |
| console/src/locales/ru.json | Adds nav label + full Voice Transcription translations. |
| console/src/locales/ja.json | Adds nav label + full Voice Transcription translations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Ensure OpenAI-compatible provider base URLs end with /v1 before passing to AsyncOpenAI. Fixes transcription failures for providers configured without the /v1 suffix (e.g. DeepSeek). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end voice message support by introducing configurable audio handling (transcribe vs native audio), a transcription backend abstraction (remote Whisper API or local openai-whisper), backend APIs to manage these settings, and a Console UI settings page to configure and validate the setup.
Changes:
- Add agent configuration for audio handling mode and transcription backend selection.
- Implement audio block processing: secure local-path handling, optional Whisper transcription, and ffmpeg-based conversion for native audio.
- Add Console settings page + backend API endpoints to manage and inspect voice transcription configuration/status.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds config schema fields for audio_mode and transcription settings. |
| src/copaw/cli/init_cmd.py | Adds copaw init prompts for audio mode and transcription provider type. |
| src/copaw/app/routers/agent.py | Adds REST endpoints to get/set audio mode, transcription provider type/provider ID, and local whisper dependency status. |
| src/copaw/app/channels/dingtalk/content_utils.py | Switches DingTalk voice to an AudioContent block (vs file). |
| src/copaw/app/channels/dingtalk/constants.py | Maps DingTalk voice messages to the audio content type. |
| src/copaw/app/channels/base.py | Bypasses no-text debounce buffering for messages containing audio blocks. |
| src/copaw/agents/utils/message_processing.py | Implements audio-mode aware audio processing (transcribe vs native + conversion) and expands media allowlist roots. |
| src/copaw/agents/utils/audio_transcription.py | New transcription utility supporting Whisper API and local openai-whisper. |
| pyproject.toml | Adds optional whisper extra and includes it in full. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New Console settings page for voice transcription configuration and status. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styling for the new settings page. |
| console/src/locales/{en,zh,ru,ja}.json | Adds UI strings for Voice Transcription page + nav entry. |
| console/src/layouts/Sidebar.tsx | Adds navigation entry for Voice Transcription. |
| console/src/layouts/MainLayout/index.tsx | Adds route for the Voice Transcription settings page. |
| console/src/api/modules/agent.ts | Adds client functions for the new backend endpoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds configurable voice-message handling so audio attachments (e.g., Discord/Telegram .ogg) can work across model backends by either transcribing to text (remote Whisper API or local openai-whisper) or sending audio natively (with ffmpeg conversion), plus Console UI and API/CLI configuration surfaces.
Changes:
- Introduces
audio_modeand transcription provider configuration in backend config + CLI init prompts. - Adds backend audio block processing (transcription and ffmpeg conversion) and a new
audio_transcriptionutility. - Adds Console Settings page + navigation + API bindings for configuring voice transcription.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/copaw/config/config.py | Adds new agent config fields for audio mode + transcription provider selection. |
| src/copaw/cli/init_cmd.py | Prompts for audio mode and transcription provider during copaw init. |
| src/copaw/app/routers/agent.py | Adds REST endpoints for reading/updating audio/transcription settings and status. |
| src/copaw/app/channels/dingtalk/content_utils.py | Switches DingTalk voice payloads to runtime AudioContent. |
| src/copaw/app/channels/dingtalk/constants.py | Maps DingTalk voice message type to audio. |
| src/copaw/app/channels/base.py | Bypasses no-text debounce for audio-only messages so voice messages are processed immediately. |
| src/copaw/agents/utils/message_processing.py | Adds audio-specific processing: transcription in auto mode and ffmpeg conversion + native send in native mode. |
| src/copaw/agents/utils/audio_transcription.py | New module providing Whisper API and local-whisper transcription backends + provider listing/status helpers. |
| pyproject.toml | Adds copaw[whisper] extra and includes it in copaw[full]. |
| console/src/pages/Settings/VoiceTranscription/index.tsx | New Console settings page for audio mode and transcription provider configuration. |
| console/src/pages/Settings/VoiceTranscription/index.module.less | Styles for the new settings page. |
| console/src/locales/en.json | Adds navigation label and page strings for Voice Transcription settings. |
| console/src/locales/zh.json | Adds navigation label and page strings for Voice Transcription settings. |
| console/src/locales/ru.json | Adds navigation label and page strings for Voice Transcription settings. |
| console/src/locales/ja.json | Adds navigation label and page strings for Voice Transcription settings. |
| console/src/layouts/Sidebar.tsx | Adds Settings nav entry for Voice Transcription. |
| console/src/layouts/MainLayout/index.tsx | Adds /voice-transcription route to render the new page. |
| console/src/api/modules/agent.ts | Adds frontend API wrappers for the new backend endpoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return AudioContent( | ||
| type=ContentType.AUDIO, | ||
| data=url, | ||
| format="amr", |

Description
Voice messages from channels (Discord, Telegram, DingTalk, etc.) are
.oggfiles, which most LLM backends cannot process directly. This PR adds audio transcription support so voice messages work with all model backends, along with a configurable audio handling mode and a Console UI settings page.Key changes:
/v1/audio/transcriptions(Whisper API) or locally installedopenai-whisperlibraryauto(transcribe if provider available, else show file placeholder) andnative(send audio directly to model)openai-whisperavailable as optional dependency viacopaw[whisper]copaw initprompts for audio mode and transcription providerRelated Issue: Fixes Discord voice message support
Security Considerations: N/A
Type of Change
Component(s) Affected
Checklist
pre-commit run --all-fileslocally and it passespytestor as relevant) and they passTesting
audio_mode: "native"with ffmpeg installed → audio is converted to wav and sent natively to modelcopaw init→ verify audio mode and transcription provider prompts appearLocal Verification Evidence