Skip to content

GradiumSTTService improvements#4066

Merged
markbackman merged 10 commits intomainfrom
mb/gradium-stt-improvements
Mar 18, 2026
Merged

GradiumSTTService improvements#4066
markbackman merged 10 commits intomainfrom
mb/gradium-stt-improvements

Conversation

@markbackman
Copy link
Copy Markdown
Contributor

@markbackman markbackman commented Mar 18, 2026

Summary

  • Improved GradiumSTTService transcription completeness by switching from silence-frame flushing to the flush API with text accumulation. Previously, trailing words could be dropped when the server's flushed response arrived before all text tokens were delivered.
  • Added a transcript aggregation delay (100ms) after flush to capture trailing tokens before finalizing the transcription.
  • Decoupled audio encoding from sample rate. The encoding parameter now takes a base type ("pcm", "wav", "opus") and the sample rate is derived from the pipeline's audio_in_sample_rate, assembled dynamically via input_format_from_encoding(). This fixes the mismatch where SAMPLE_RATE=24000 was passed to the base class while encoding defaulted to "pcm_16000".
  • Added model_name to the WebSocket setup message.
  • Aligned VAD handling with base class patterns and removed duplicate reconnection logic.

Breaking Changes

  • GradiumSTTService encoding parameter default changed from "pcm_16000" to "pcm". If you were passing encoding="pcm_16000" explicitly, change it to encoding="pcm" or omit it entirely.

Testing

  • Run uv run python examples/foundational/07zf-interruptible-gradium.py
  • Verify complete utterances appear in LLM context (no dropped trailing words)
  • Verify the WebSocket stays connected during pauses in speech

The _receive_messages method had its own while-True reconnect loop,
duplicating the reconnection handling already provided by
WebsocketService._receive_task_handler (exponential backoff, max
retries, error reporting). Flatten to just the inner message loop
and let the base class handle reconnection.
Replace the process_frame override with a _handle_vad_user_stopped_speaking
override, which is the proper hook provided by STTService. Move
start_processing_metrics() into run_stt (matching Gladia's pattern).
Remove unused FrameDirection and VADUserStartedSpeakingFrame imports.
Enable the base class keepalive mechanism (10s timeout, 5s interval)
and override _send_keepalive to wrap silence in Gradium's audio
message format. Prevents idle connection timeouts, especially
behind a ServiceSwitcher.
Inline _process_response into _receive_messages, add required
model_name field to the setup message per Gradium docs, and
improve _handle_text docstring.
Replace silence-based flushing with Gradium flush/flushed protocol.
Accumulate word-level text fragments as InterimTranscriptionFrames and
emit a single TranscriptionFrame on flush completion. Align VAD handling
with CartesiaSTTService pattern using process_frame override. Remove
keepalive (not supported by Gradium) and pass language to transcription
frames.
…kens

Gradium flushed response can arrive before all text tokens have been
delivered. Instead of finalizing immediately on flushed, start a short
timer (100ms) that allows trailing tokens to accumulate before pushing
the final TranscriptionFrame.
markbackman added a commit that referenced this pull request Mar 18, 2026
@markbackman markbackman force-pushed the mb/gradium-stt-improvements branch from f4c3d1e to c6945c5 Compare March 18, 2026 03:12
@markbackman markbackman requested a review from aconchillo March 18, 2026 03:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 21.42857% with 55 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/services/gradium/stt.py 21.42% 55 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/services/gradium/stt.py 31.25% <21.42%> (-1.24%) ⬇️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@markbackman markbackman force-pushed the mb/gradium-stt-improvements branch from a7d331d to b0f77bc Compare March 18, 2026 13:02
Comment thread src/pipecat/services/gradium/stt.py Outdated
# and pushed as a TranscriptionFrame.
self._accumulated_text: list[str] = []
self._flush_counter = 0
self._transcript_aggregation_delay = 0.1 # seconds to wait after flushed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we would like to allow users to change ? Otherwise, I think this could be a constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

Comment on lines +203 to +204
self._accumulated_text: list[str] = []
self._flush_counter = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reset these values when we disconnect the webSocket? For example, in case of a reconnection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating.

Copy link
Copy Markdown
Contributor

@filipi87 filipi87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just added a few possible improvements, but nothing major.

@markbackman markbackman force-pushed the mb/gradium-stt-improvements branch from fb0da4a to ef794ff Compare March 18, 2026 19:52
The encoding parameter now takes just the base type (pcm, wav, opus)
and the sample rate is derived from the pipeline audio_in_sample_rate,
assembled dynamically via input_format_from_encoding(). This fixes the
mismatch where SAMPLE_RATE=24000 was passed to the base class while
encoding defaulted to pcm_16000.
@markbackman markbackman force-pushed the mb/gradium-stt-improvements branch from ef794ff to 4d9d8af Compare March 18, 2026 19:53
@markbackman markbackman merged commit 4b704e6 into main Mar 18, 2026
6 checks passed
@markbackman markbackman deleted the mb/gradium-stt-improvements branch March 18, 2026 19:57
markbackman added a commit that referenced this pull request Mar 21, 2026
* Remove duplicate reconnection logic from Gradium STT

The _receive_messages method had its own while-True reconnect loop,
duplicating the reconnection handling already provided by
WebsocketService._receive_task_handler (exponential backoff, max
retries, error reporting). Flatten to just the inner message loop
and let the base class handle reconnection.

* Align Gradium STT VAD handling with base class patterns

Replace the process_frame override with a _handle_vad_user_stopped_speaking
override, which is the proper hook provided by STTService. Move
start_processing_metrics() into run_stt (matching Gladia's pattern).
Remove unused FrameDirection and VADUserStartedSpeakingFrame imports.

* Add transcript aggregation delay after flushed to capture trailing tokens

Gradium flushed response can arrive before all text tokens have been
delivered. Instead of finalizing immediately on flushed, start a short
timer (100ms) that allows trailing tokens to accumulate before pushing
the final TranscriptionFrame.

* Add changelog for PR #4066

* Change default encoding to pcm_16000

* Decouple encoding from sample_rate in Gradium STT

The encoding parameter now takes just the base type (pcm, wav, opus)
and the sample rate is derived from the pipeline audio_in_sample_rate,
assembled dynamically via input_format_from_encoding(). This fixes the
mismatch where SAMPLE_RATE=24000 was passed to the base class while
encoding defaulted to pcm_16000.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants