Skip to content

feat(voice): audio transcription for voice messages with Local Whisper support#1476

Merged
xieyxclack merged 25 commits intoagentscope-ai:mainfrom
ekzhu:claude/fix-discord-voice-messages-NVPD1
Mar 17, 2026
Merged

feat(voice): audio transcription for voice messages with Local Whisper support#1476
xieyxclack merged 25 commits intoagentscope-ai:mainfrom
ekzhu:claude/fix-discord-voice-messages-NVPD1

Conversation

@ekzhu
Copy link
Copy Markdown
Collaborator

@ekzhu ekzhu commented Mar 14, 2026

Description

Voice messages from channels (Discord, Telegram, DingTalk, etc.) are .ogg files, which most LLM backends cannot process directly. This PR adds audio transcription support so voice messages work with all model backends, along with a configurable audio handling mode and a Console UI settings page.

截屏2026-03-16 10 11 49

Key changes:

  • Audio transcription via OpenAI-compatible /v1/audio/transcriptions (Whisper API) or locally installed openai-whisper library
  • Two audio modes: auto (transcribe if provider available, else show file placeholder) and native (send audio directly to model)
  • Transcription provider selection: Disabled, Whisper API (remote) or Local Whisper (requires ffmpeg + openai-whisper)
  • openai-whisper available as optional dependency via copaw[whisper]
  • Console UI settings page for Voice Transcription with provider status checks
  • CLI copaw init prompts for audio mode and transcription provider
  • ffmpeg-based ogg→wav conversion for native audio mode

Related Issue: Fixes Discord voice message support

Security Considerations: N/A

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Refactoring

Component(s) Affected

  • Core / Backend (app, agents, config, providers, utils, local_models)
  • Console (frontend web UI)
  • Channels (DingTalk, Feishu, QQ, Discord, iMessage, etc.)
  • Skills
  • CLI
  • Documentation (website)
  • Tests
  • CI/CD
  • Scripts / Deploy

Checklist

  • I ran pre-commit run --all-files locally and it passes
  • If pre-commit auto-fixed files, I committed those changes and reran checks
  • I ran tests locally (pytest or as relevant) and they pass
  • Documentation updated (if needed)
  • Ready for review

Testing

  1. Send a voice message (.ogg) from Discord/Telegram with an OpenAI provider configured → voice is transcribed to text via Whisper API
  2. Send a voice message with no transcription provider configured, or disabled → file-uploaded placeholder is shown to model
  3. Set audio_mode: "native" with ffmpeg installed → audio is converted to wav and sent natively to model
  4. Set transcription provider type to Local Whisper with ffmpeg + openai-whisper installed → transcription runs locally
  5. Open Console UI → Settings → Voice Transcription → verify audio mode toggle, provider type selection, provider picker, and Local Whisper status all work
  6. Run copaw init → verify audio mode and transcription provider prompts appear
  7. Existing text and image messages continue to work normally

Local Verification Evidence

pre-commit run --all-files   # All passed
npm run format               # All formatted

claude added 3 commits March 14, 2026 04:43
Discord voice messages are .ogg files, but agentscope's
OpenAIChatFormatter only accepts wav and mp3 extensions,
causing "Unsupported audio file extension" errors. This fix:

- Adds .ogg, .flac, .m4a, .aac to _media_type_from_path mapping
- Adds _convert_audio_to_wav() using ffmpeg to convert unsupported
  audio formats before sending to the formatter
- Gracefully falls back to original file if ffmpeg is unavailable

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Discord/Telegram voice messages (.ogg) fail because the agentscope
OpenAIChatFormatter only accepts wav/mp3, and most models (Ollama/Qwen)
can't process audio at all.

Add a transcription layer using the OpenAI-compatible
/v1/audio/transcriptions endpoint (which accepts ogg natively):

- New audio_transcription.py: finds an OpenAI-compatible provider
  and transcribes audio via whisper-1
- New audio_mode config ("auto"|"transcribe"|"native"):
  - auto (default): try transcription, fall back to native audio
  - transcribe: always convert audio to text
  - native: send audio blocks directly (needs ffmpeg for ogg→wav)
- Refactor message_processing.py to route audio blocks through
  the appropriate path based on config

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Add user-facing configuration for audio_mode (auto/transcribe/native):
- API: GET/PUT /agent/audio-mode endpoints
- CLI: audio mode prompt in `copaw init`
- Console: new Voice Transcription settings page under Settings
- i18n: English and Chinese translations

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Copilot AI review requested due to automatic review settings March 14, 2026 06:16
@ekzhu ekzhu had a problem deploying to maintainer-approved March 14, 2026 06:16 — with GitHub Actions Failure
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's ability to process incoming audio messages, particularly addressing Discord's .ogg format. It introduces a flexible audio handling mechanism that allows users to choose between automatic transcription, forced transcription, or native audio processing with optional format conversion. This ensures broader compatibility with different AI models and improves the overall user experience for voice interactions.

Highlights

  • Audio Transcription via OpenAI-compatible API: Implemented audio transcription using OpenAI-compatible /v1/audio/transcriptions (Whisper) to enable voice messages to work with all model backends, addressing the issue of .ogg files from Discord.
  • Configurable Audio Mode: Introduced a new configurable audio_mode setting (auto/transcribe/native) which is exposed through the configuration, CLI, API, and a new Console UI page.
  • FFmpeg-based OGG to WAV Conversion: Added fallback FFmpeg-based .ogg to .wav conversion for models that natively support audio, providing a robust solution for various audio handling scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • console/src/api/modules/agent.ts
    • Added getAudioMode and updateAudioMode API functions for managing audio processing settings.
  • console/src/layouts/MainLayout/index.tsx
    • Imported VoiceTranscriptionPage and added a new route for it under settings.
  • console/src/layouts/Sidebar.tsx
    • Added a new 'Voice Transcription' entry to the sidebar navigation with a microphone icon.
  • console/src/locales/en.json
    • Added new localization keys and values for the 'Voice Transcription' settings page in English.
  • console/src/locales/zh.json
    • Added new localization keys and values for the 'Voice Transcription' settings page in Chinese.
  • console/src/pages/Settings/VoiceTranscription/index.module.less
    • Added new CSS styles for the Voice Transcription settings page layout and components.
  • console/src/pages/Settings/VoiceTranscription/index.tsx
    • Added a new React component for the Voice Transcription settings page, allowing users to configure audio mode.
  • src/copaw/agents/utils/audio_transcription.py
    • Added a new utility module for transcribing audio files using OpenAI-compatible API endpoints, including provider discovery logic.
  • src/copaw/agents/utils/message_processing.py
    • Expanded supported audio media types to include .ogg, .flac, .m4a, and .aac.
    • Added a new function _convert_audio_to_wav to convert audio files to WAV format using FFmpeg if not natively supported.
    • Introduced _process_audio_block to handle audio blocks based on the configured audio_mode, integrating transcription and conversion logic.
    • Modified _process_single_block to dispatch audio blocks to the new _process_audio_block function.
  • src/copaw/app/routers/agent.py
    • Added new GET and PUT API endpoints (/audio-mode) to retrieve and update the agent's audio handling mode.
  • src/copaw/cli/init_cmd.py
    • Added an interactive prompt during CLI initialization to allow users to select their preferred audio mode.
  • src/copaw/config/config.py
    • Added an audio_mode field to the AgentsConfig class with a default value of 'auto' and a description of its purpose.
Activity
  • The author has indicated that pre-commit run --all-files and pytest are pending, suggesting local verification is still in progress.
  • No other review comments or activity have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Generative AI Prohibited Use Policy, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ekzhu ekzhu had a problem deploying to maintainer-approved March 14, 2026 06:19 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature for handling audio messages, particularly .ogg files from Discord, by adding transcription capabilities. It adds a new audio_mode configuration to control whether audio is transcribed, sent natively, or handled automatically. The changes span the backend, CLI, and the console UI, including a new settings page.

The implementation is well-structured. I've provided a few suggestions to improve maintainability and debuggability:

  • Making the transcription model name configurable instead of hardcoding it.
  • Improving the extensibility of finding transcription providers.
  • Enhancing error logging for ffmpeg conversion failures and for API calls in the frontend.

Overall, this is a great addition that significantly improves the agent's ability to handle multimedia messages.

Comment thread console/src/pages/Settings/VoiceTranscription/index.tsx Outdated
Comment thread console/src/pages/Settings/VoiceTranscription/index.tsx Outdated
Comment thread src/copaw/agents/utils/audio_transcription.py Outdated
Comment thread src/copaw/agents/utils/audio_transcription.py Outdated
Comment thread src/copaw/agents/utils/message_processing.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable handling for incoming voice/audio messages (notably Discord .ogg) by introducing auto transcription via an OpenAI-compatible Whisper endpoint, plus UI/CLI/API surfaces to configure behavior.

Changes:

  • Introduces agents.audio_mode (auto / transcribe / native) and exposes it via CLI init flow and new agent API endpoints.
  • Updates message media processing to support more audio MIME types, attempt transcription, and optionally convert audio to .wav via ffmpeg for native-audio forwarding.
  • Adds a Console UI settings page (plus navigation + i18n) for configuring voice transcription/audio handling.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds agents.audio_mode config field.
src/copaw/cli/init_cmd.py Prompts for audio_mode during interactive init.
src/copaw/app/routers/agent.py Adds GET/PUT /agent/audio-mode endpoints.
src/copaw/agents/utils/message_processing.py Adds audio transcription flow and ffmpeg .ogg.wav conversion fallback; expands audio MIME mapping.
src/copaw/agents/utils/audio_transcription.py New utility to call OpenAI-compatible /v1/audio/transcriptions.
console/src/pages/Settings/VoiceTranscription/index.tsx New settings page to view/update audio mode.
console/src/pages/Settings/VoiceTranscription/index.module.less Styles for the new settings page.
console/src/locales/en.json Adds nav label + page strings for Voice Transcription.
console/src/locales/zh.json Adds nav label + page strings for Voice Transcription.
console/src/layouts/Sidebar.tsx Adds sidebar entry + route mapping for Voice Transcription.
console/src/layouts/MainLayout/index.tsx Adds route + selection mapping for Voice Transcription page.
console/src/api/modules/agent.ts Adds getAudioMode / updateAudioMode API calls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread console/src/layouts/Sidebar.tsx
Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/agents/utils/audio_transcription.py Outdated
Comment thread src/copaw/agents/utils/audio_transcription.py Outdated
Comment thread src/copaw/config/config.py Outdated
@ekzhu ekzhu had a problem deploying to maintainer-approved March 14, 2026 06:35 — with GitHub Actions Failure
- Add console.error() logging in frontend catch blocks for debuggability
- Add Japanese (ja.json) and Russian (ru.json) translations for
  voiceTranscription nav key and settings page strings
- Include ffmpeg stderr output in audio conversion error logs

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Copilot AI review requested due to automatic review settings March 14, 2026 09:46
@ekzhu ekzhu had a problem deploying to maintainer-approved March 14, 2026 09:46 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable handling for incoming voice/audio messages so Discord .ogg voice notes can work across model backends by transcribing to text via an OpenAI-compatible Whisper endpoint, with an optional ffmpeg conversion path for native-audio models.

Changes:

  • Introduces agents.audio_mode (auto/transcribe/native) surfaced via config and copaw init.
  • Adds /agent/audio-mode GET/PUT endpoints and a Console settings page to view/update the setting.
  • Extends message media processing to (a) transcribe audio to text and (b) optionally convert unsupported audio formats to .wav with ffmpeg.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds agents.audio_mode config field.
src/copaw/cli/init_cmd.py Prompts for audio mode during interactive init.
src/copaw/app/routers/agent.py Adds API endpoints to get/set audio mode.
src/copaw/agents/utils/message_processing.py Implements audio transcription + ffmpeg conversion behavior for audio blocks.
src/copaw/agents/utils/audio_transcription.py New Whisper transcription helper using OpenAI-compatible /v1/audio/transcriptions.
console/src/pages/Settings/VoiceTranscription/index.tsx New Console UI page for selecting audio mode.
console/src/pages/Settings/VoiceTranscription/index.module.less Styles for the new settings page.
console/src/api/modules/agent.ts Adds getAudioMode / updateAudioMode client calls.
console/src/layouts/Sidebar.tsx Adds nav entry/route mapping for Voice Transcription.
console/src/layouts/MainLayout/index.tsx Wires the new route to the new page.
console/src/locales/en.json Adds nav + page strings.
console/src/locales/zh.json Adds nav + page strings.
console/src/locales/ru.json Adds nav + page strings.
console/src/locales/ja.json Adds nav + page strings.
console/src/pages/Settings/Models/components/modals/ProviderConfigModal.tsx Formatting-only change.
console/src/pages/Settings/Models/components/cards/RemoteProviderCard.tsx Formatting-only change.
console/src/pages/Control/Sessions/index.tsx Formatting-only change.
console/src/components/MarkdownCopy/MarkdownCopy.tsx Formatting-only change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/config/config.py Outdated
- Wrap _convert_audio_to_wav calls with asyncio.to_thread so the
  blocking subprocess doesn't stall the event loop (up to 30s timeout)
- Change AgentsConfig.audio_mode from str to
  Literal["auto", "transcribe", "native"] for load-time validation

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
Copilot AI review requested due to automatic review settings March 15, 2026 08:00
@ekzhu ekzhu had a problem deploying to maintainer-approved March 15, 2026 08:00 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable handling for incoming voice/audio messages (notably Discord .ogg) by introducing transcription support and exposing an audio_mode setting across backend, CLI, API, and Console UI.

Changes:

  • Add audio_mode config (auto/transcribe/native) and expose it via CLI init prompts and new FastAPI endpoints.
  • Implement audio block processing that can transcribe via an OpenAI-compatible Whisper endpoint and optionally convert audio via ffmpeg.
  • Add Console UI settings page + i18n strings for “Voice Transcription”.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds AgentsConfig.audio_mode setting.
src/copaw/cli/init_cmd.py Prompts for audio_mode during copaw init interactive flow.
src/copaw/app/routers/agent.py Adds GET/PUT /agent/audio-mode endpoints.
src/copaw/agents/utils/message_processing.py Adds transcription/conversion pipeline for audio blocks and expands supported audio media types.
src/copaw/agents/utils/audio_transcription.py New utility to transcribe audio via OpenAI-compatible /audio/transcriptions.
console/src/pages/Settings/VoiceTranscription/index.tsx New settings page to view/edit audio mode.
console/src/pages/Settings/VoiceTranscription/index.module.less Styling for the new settings page.
console/src/pages/Settings/Models/components/modals/ProviderConfigModal.tsx Formatting-only change.
console/src/pages/Settings/Models/components/cards/RemoteProviderCard.tsx Formatting-only change.
console/src/pages/Control/Sessions/index.tsx Formatting-only change.
console/src/locales/en.json Adds nav + page strings for Voice Transcription.
console/src/locales/zh.json Adds nav + page strings for Voice Transcription.
console/src/locales/ru.json Adds nav + page strings for Voice Transcription.
console/src/locales/ja.json Adds nav + page strings for Voice Transcription.
console/src/layouts/Sidebar.tsx Adds sidebar entry/route mapping for Voice Transcription.
console/src/layouts/MainLayout/index.tsx Adds route for the Voice Transcription page.
console/src/components/MarkdownCopy/MarkdownCopy.tsx Formatting-only change.
console/src/api/modules/agent.ts Adds getAudioMode / updateAudioMode API calls.
Comments suppressed due to low confidence (1)

src/copaw/agents/utils/message_processing.py:343

  • For audio blocks that get transcribed/replaced with a text block (or replaced with the “(transcription unavailable)” placeholder), _process_single_block still returns local_path. This later triggers process_file_and_media_blocks_in_message to insert a “User uploaded a file, downloaded to …” text block, which is noisy and can leak local filesystem paths even though the model no longer needs the audio file. Consider returning None (or a separate flag) when the audio block is converted to text, and only adding entries to downloaded_files for blocks that remain as downloadable media.
                "Updated %s block with local path: %s",
                block_type,
                local_path,
            )
            return local_path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/copaw/agents/utils/message_processing.py Outdated
Comment thread src/copaw/app/routers/agent.py Outdated
Comment thread src/copaw/agents/utils/audio_transcription.py Outdated
Remove formatting-only diffs in files not related to voice
transcription to keep the PR focused.

https://claude.ai/code/session_01HG6R9iZT7aGtYpQvkk1Dtb
@ekzhu ekzhu had a problem deploying to maintainer-approved March 15, 2026 08:07 — with GitHub Actions Failure
@ekzhu ekzhu changed the title Fix Discord voice messages (.ogg) with auto-transcription (feat) Discord voice messages (.ogg) with auto-transcription Mar 15, 2026
@ekzhu ekzhu changed the title (feat) Discord voice messages (.ogg) with auto-transcription feat (channel) Discord voice messages (.ogg) with auto-transcription Mar 15, 2026
@ekzhu ekzhu requested review from rayrayraykk and xieyxclack March 16, 2026 09:21
@xieyxclack
Copy link
Copy Markdown
Member

Minor suggestion: For native mode, consider adding a dependency-status check for ffmpeg similar to what local_whisper mode already has, so users can see whether ffmpeg is installed before selecting this mode.

Copilot AI review requested due to automatic review settings March 17, 2026 00:48
@ekzhu ekzhu had a problem deploying to maintainer-approved March 17, 2026 00:48 — with GitHub Actions Failure
Show an ffmpeg installation status alert when native audio mode is
selected, similar to the existing dependency check for Local Whisper.
Helps users verify ffmpeg is available before selecting this mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ekzhu ekzhu had a problem deploying to maintainer-approved March 17, 2026 00:52 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end voice message handling by introducing configurable audio handling modes and transcription backends (remote Whisper API or local openai-whisper), plus Console UI and API endpoints to configure and monitor the feature.

Changes:

  • Add agent config + CLI init prompts for audio mode and transcription provider type.
  • Implement audio block handling in message processing (transcribe in auto, send audio in native with ffmpeg conversion fallback).
  • Add backend API endpoints and Console settings page (incl. Local Whisper dependency checks) for voice transcription configuration.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds audio_mode + transcription-related config fields.
src/copaw/cli/init_cmd.py Adds interactive prompts for audio mode and transcription provider type.
src/copaw/app/routers/agent.py Adds REST endpoints for audio/transcription settings and status checks.
src/copaw/app/channels/dingtalk/content_utils.py Emits AudioContent for DingTalk voice messages.
src/copaw/app/channels/dingtalk/constants.py Maps DingTalk voice type to audio.
src/copaw/app/channels/base.py Lets audio-only messages bypass no-text debounce.
src/copaw/agents/utils/message_processing.py Implements transcription/native-audio handling and ffmpeg conversion for audio blocks.
src/copaw/agents/utils/audio_transcription.py New utility implementing Whisper API + Local Whisper transcription flows.
pyproject.toml Adds optional dependency extra copaw[whisper].
console/src/pages/Settings/VoiceTranscription/index.tsx New Console settings page to configure voice transcription and check provider status.
console/src/pages/Settings/VoiceTranscription/index.module.less Styling for the new settings page.
console/src/layouts/Sidebar.tsx Adds navigation entry for Voice Transcription settings.
console/src/layouts/MainLayout/index.tsx Adds route for /voice-transcription.
console/src/api/modules/agent.ts Adds API client methods for new backend endpoints.
console/src/locales/en.json Adds nav label + full Voice Transcription translations.
console/src/locales/zh.json Adds nav label + full Voice Transcription translations.
console/src/locales/ru.json Adds nav label + full Voice Transcription translations.
console/src/locales/ja.json Adds nav label + full Voice Transcription translations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/copaw/agents/utils/message_processing.py
Comment thread src/copaw/agents/utils/message_processing.py
Comment thread src/copaw/agents/utils/audio_transcription.py
@ekzhu ekzhu had a problem deploying to maintainer-approved March 17, 2026 00:58 — with GitHub Actions Failure
@ekzhu
Copy link
Copy Markdown
Collaborator Author

ekzhu commented Mar 17, 2026

Minor suggestion: For native mode, consider adding a dependency-status check for ffmpeg similar to what local_whisper mode already has, so users can see whether ffmpeg is installed before selecting this mode.

Done.

截屏2026-03-17 08 58 14

Ensure OpenAI-compatible provider base URLs end with /v1 before
passing to AsyncOpenAI. Fixes transcription failures for providers
configured without the /v1 suffix (e.g. DeepSeek).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 17, 2026 02:45
@ekzhu ekzhu temporarily deployed to maintainer-approved March 17, 2026 02:45 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end voice message support by introducing configurable audio handling (transcribe vs native audio), a transcription backend abstraction (remote Whisper API or local openai-whisper), backend APIs to manage these settings, and a Console UI settings page to configure and validate the setup.

Changes:

  • Add agent configuration for audio handling mode and transcription backend selection.
  • Implement audio block processing: secure local-path handling, optional Whisper transcription, and ffmpeg-based conversion for native audio.
  • Add Console settings page + backend API endpoints to manage and inspect voice transcription configuration/status.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds config schema fields for audio_mode and transcription settings.
src/copaw/cli/init_cmd.py Adds copaw init prompts for audio mode and transcription provider type.
src/copaw/app/routers/agent.py Adds REST endpoints to get/set audio mode, transcription provider type/provider ID, and local whisper dependency status.
src/copaw/app/channels/dingtalk/content_utils.py Switches DingTalk voice to an AudioContent block (vs file).
src/copaw/app/channels/dingtalk/constants.py Maps DingTalk voice messages to the audio content type.
src/copaw/app/channels/base.py Bypasses no-text debounce buffering for messages containing audio blocks.
src/copaw/agents/utils/message_processing.py Implements audio-mode aware audio processing (transcribe vs native + conversion) and expands media allowlist roots.
src/copaw/agents/utils/audio_transcription.py New transcription utility supporting Whisper API and local openai-whisper.
pyproject.toml Adds optional whisper extra and includes it in full.
console/src/pages/Settings/VoiceTranscription/index.tsx New Console settings page for voice transcription configuration and status.
console/src/pages/Settings/VoiceTranscription/index.module.less Styling for the new settings page.
console/src/locales/{en,zh,ru,ja}.json Adds UI strings for Voice Transcription page + nav entry.
console/src/layouts/Sidebar.tsx Adds navigation entry for Voice Transcription.
console/src/layouts/MainLayout/index.tsx Adds route for the Voice Transcription settings page.
console/src/api/modules/agent.ts Adds client functions for the new backend endpoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/copaw/agents/utils/message_processing.py
Comment thread src/copaw/agents/utils/message_processing.py
Comment thread src/copaw/agents/utils/audio_transcription.py
Comment thread console/src/pages/Settings/VoiceTranscription/index.tsx
@ekzhu ekzhu had a problem deploying to maintainer-approved March 17, 2026 07:57 — with GitHub Actions Failure
Copilot AI review requested due to automatic review settings March 17, 2026 07:58
@ekzhu ekzhu had a problem deploying to maintainer-approved March 17, 2026 07:58 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable voice-message handling so audio attachments (e.g., Discord/Telegram .ogg) can work across model backends by either transcribing to text (remote Whisper API or local openai-whisper) or sending audio natively (with ffmpeg conversion), plus Console UI and API/CLI configuration surfaces.

Changes:

  • Introduces audio_mode and transcription provider configuration in backend config + CLI init prompts.
  • Adds backend audio block processing (transcription and ffmpeg conversion) and a new audio_transcription utility.
  • Adds Console Settings page + navigation + API bindings for configuring voice transcription.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/copaw/config/config.py Adds new agent config fields for audio mode + transcription provider selection.
src/copaw/cli/init_cmd.py Prompts for audio mode and transcription provider during copaw init.
src/copaw/app/routers/agent.py Adds REST endpoints for reading/updating audio/transcription settings and status.
src/copaw/app/channels/dingtalk/content_utils.py Switches DingTalk voice payloads to runtime AudioContent.
src/copaw/app/channels/dingtalk/constants.py Maps DingTalk voice message type to audio.
src/copaw/app/channels/base.py Bypasses no-text debounce for audio-only messages so voice messages are processed immediately.
src/copaw/agents/utils/message_processing.py Adds audio-specific processing: transcription in auto mode and ffmpeg conversion + native send in native mode.
src/copaw/agents/utils/audio_transcription.py New module providing Whisper API and local-whisper transcription backends + provider listing/status helpers.
pyproject.toml Adds copaw[whisper] extra and includes it in copaw[full].
console/src/pages/Settings/VoiceTranscription/index.tsx New Console settings page for audio mode and transcription provider configuration.
console/src/pages/Settings/VoiceTranscription/index.module.less Styles for the new settings page.
console/src/locales/en.json Adds navigation label and page strings for Voice Transcription settings.
console/src/locales/zh.json Adds navigation label and page strings for Voice Transcription settings.
console/src/locales/ru.json Adds navigation label and page strings for Voice Transcription settings.
console/src/locales/ja.json Adds navigation label and page strings for Voice Transcription settings.
console/src/layouts/Sidebar.tsx Adds Settings nav entry for Voice Transcription.
console/src/layouts/MainLayout/index.tsx Adds /voice-transcription route to render the new page.
console/src/api/modules/agent.ts Adds frontend API wrappers for the new backend endpoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return AudioContent(
type=ContentType.AUDIO,
data=url,
format="amr",
Comment thread src/copaw/app/routers/agent.py
Comment thread src/copaw/agents/utils/message_processing.py
Copy link
Copy Markdown
Member

@xieyxclack xieyxclack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xieyxclack xieyxclack merged commit 6b7abac into agentscope-ai:main Mar 17, 2026
7 of 8 checks passed
@ekzhu ekzhu deleted the claude/fix-discord-voice-messages-NVPD1 branch March 18, 2026 07:31
tudan110 pushed a commit to tudan110/QwenPaw that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants