Skip to content

feat: handle server_content.interrupted for faster interruptions#3429

Merged
kompfner merged 2 commits intopipecat-ai:mainfrom
lukepayyapilli:fix/gemini-live-interrupted-signal
Jan 28, 2026
Merged

feat: handle server_content.interrupted for faster interruptions#3429
kompfner merged 2 commits intopipecat-ai:mainfrom
lukepayyapilli:fix/gemini-live-interrupted-signal

Conversation

@lukepayyapilli
Copy link
Copy Markdown
Contributor

@lukepayyapilli lukepayyapilli commented Jan 13, 2026

Summary

  • Handle server_content.interrupted signal in GeminiLiveLLMService message loop.
  • Provides ~700-1100ms faster interruptions when not using local VAD.

Problem

When using GeminiLiveLLMService without local VAD (Silero), interruptions were delayed because the service waited for input_transcription. Gemini sends server_content.interrupted instantly when its VAD detects speech.

Approach

Added inline handling in the message loop - no new methods or config flags.

Alternatives considered:

  • Add config flag (use_native_vad_interruptions) - rejected per YAGNI, no one has asked for it
  • Create separate handler method - rejected, 3 lines of code doesn't warrant abstraction

Why this is safe: _handle_interruption() is idempotent, so duplicate signals (from both local VAD and Gemini) are harmless.

Fixes #3381

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/services/google/gemini_live/llm.py 0.00% 4 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/services/google/gemini_live/llm.py 19.75% <0.00%> (-0.09%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if message.server_content and message.server_content.model_turn:
if message.server_content and message.server_content.interrupted:
logger.debug("Gemini VAD: interrupted signal received")
await self._handle_interruption()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we ought to use the standard interruption-triggering pattern here:

await self.push_interruption_task_frame_and_wait()

(We never push InterruptionFrames directly from services anymore, we always use this mechanism instead; this'll guarantees that the whole pipeline handles the interruption and makes it easier for us to maintain the interruption mechanism going forward).

As a nice side-effect of moving to this pattern, I think we'd no longer need to call await self._handle_interruption(), as GeminiLiveLLMService would receive and process an InterruptionFrame as usual.

Copy link
Copy Markdown
Contributor

@kompfner kompfner Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we need to precede the interruption with:

await self.broadcast_frame(UserStartedSpeakingFrame)

This is an important signal for the context-management system (i.e. LLMContextAggregatorPair) that records user and assistant messages in context.

Although...this is reminding me of a similar contribution (which appears never to have gotten merged) for leveraging AWS Nova Sonic's built-in VAD: #2431. There, we discussed the importance of also sending UserStoppedSpeakingFrame, if there's no local VAD in the pipeline, as that signal is also essential to context recording.

Can we find a way to broadcast a UserStoppedSpeakingFrame from this service, too?

cc @aconchillo, who last touched UserStartedSpeakingFrame/UserStoppedSpeakingFrame in other speech-to-speech services (OpenAI, Grok)...what was the effect of having the speech-to-speech service emit these frames in the case where the pipeline also had local VAD configured?

Copy link
Copy Markdown
Contributor

@kompfner kompfner Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the recommendation if you want to have context management in your pipeline without local VAD (i.e. LLMContextAggregatorPair) is that you'd configure the user aggregator with ExternalUserTurnStrategies(). Though...it may not hurt to just turn off local VAD without updating the user aggregator in that way...

Copy link
Copy Markdown
Contributor Author

@lukepayyapilli lukepayyapilli Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review @kompfner! Updated to address your feedback:

  • Using push_interruption_task_frame_and_wait() instead of pushing InterruptionFrame directly.
  • Added broadcast_frame(UserStartedSpeakingFrame()) before the interruption for context management.

Regarding UserStoppedSpeakingFrame - I'll defer to @aconchillo on whether that's needed here, given the open question about interaction with local VAD.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kompfner bumping this for another review when you have a moment - thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for not being clearer—regardless of the open question about interaction with local VAD, if we're emitting UserStartedSpeakingFrame from this service, we'll also need to emit UserStoppedSpeakingFrame.

I can help with some testing to ensure that the duplicate signal (in the case where we also have local VAD) doesn't cause issues. I suspect we'd be OK (OpenAI Realtime already always emits interruptions + user started/stopped speaking).

if message.server_content and message.server_content.interrupted:
logger.debug("Gemini VAD: interrupted signal received")
await self.broadcast_frame(UserStartedSpeakingFrame())
await self.push_interruption_task_frame_and_wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just realized that this condition (server_content.interrupted) only occurs for a barge-in, not for a "normal" user utterance that follows the assistant response.

If this service is responsible for firing UserStartedSpeakingFrame (and UserStoppedSpeakingFrame) it should be able to do so in all circumstances, barge-in or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's go with the simplest thing for now: just the await self.push_interruption_task_frame_and_wait() (no UserStartedSpeakingFrame).

Maybe add this comment above that line:

# NOTE: while the service triggers interruptions in
# the specific case of barge-ins, it does *not*
# emit UserStarted/StoppedSpeakingFrames, as the
# Gemini Live API does not give us broadly reliable
# signals to base those off of. Pipelines that
# require turn tracking (like those using context
# aggregators) still need an independent way to
# track turns, such as local Silero VAD in
# combination with the context aggregator default
# turn strategies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - implemented as suggested. Thanks!

@lukepayyapilli
Copy link
Copy Markdown
Contributor Author

Good point @kompfner - interrupted only fires for barge-ins. For this PR, the intent was specifically to improve barge-in latency by leveraging Gemini's signal. Full VAD responsibility (handling normal utterances too) feels like a larger architectural decision that might warrant a separate discussion/issue. Would it be acceptable to keep this PR focused on the barge-in case, with the understanding that local VAD is still needed for complete coverage?

@lukepayyapilli lukepayyapilli force-pushed the fix/gemini-live-interrupted-signal branch from f89ae45 to c65a89c Compare January 23, 2026 15:39
@lukepayyapilli lukepayyapilli force-pushed the fix/gemini-live-interrupted-signal branch from c65a89c to cadced3 Compare January 23, 2026 15:41
@kompfner
Copy link
Copy Markdown
Contributor

Good point @kompfner - interrupted only fires for barge-ins. For this PR, the intent was specifically to improve barge-in latency by leveraging Gemini's signal. Full VAD responsibility (handling normal utterances too) feels like a larger architectural decision that might warrant a separate discussion/issue. Would it be acceptable to keep this PR focused on the barge-in case, with the understanding that local VAD is still needed for complete coverage?

Trying to think through the ramifications of a service only firing events for barge-in and not for "regular" back and forth...I believe that inconsistency could be a problem because then users wouldn't get a clear sense of whether or not they needed to also have independent VAD-based turn detection in their pipeline. So I do think it might unfortunately be an all-or-nothing thing—either a service is responsible for emitting turn-related events (user started/stopped) or it's not.

Based on some brief research, it looks to me like Gemini Live sadly doesn't have timely events we can listen to to detect when it thinks user speech has actually started...

...but maybe for the purpose of context management (LLMContextAggregatorPair) working properly, it doesn't have to be that timely?

Except, shoot, no, the user started/stopped speaking frames also drive "on_user_turn_started" events, not to mention that folks might have custom processors or observers in their pipeline that expect user started/stopped frames to actually correspond in time to when the user has actually started and stopped speaking...

OK, here's my current thinking: maybe we just say, for now, that this service does not act as a turn controller (i.e. it doesn't emit user started/stopped speaking frames), and if you do need turn tracking in your pipeline (for context recording, say) then you need to also BYO turn tracking (like enabling local VAD + using context aggregator defaults). I can help do some testing to ensure that emitting just the interruption frame doesn't have any adverse effects, in pipelines with context recording.

Pardon the circling around on this question, it's proving a bit tricky!

Comment thread changelog/3429.added.md Outdated
@@ -0,0 +1 @@
- Added handling for `server_content.interrupted` signal in Gemini Live services for faster interruption response.
Copy link
Copy Markdown
Contributor

@kompfner kompfner Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Added handling for `server_content.interrupted` signal in Gemini Live services for faster interruption response.
- Added handling for `server_content.interrupted` signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with your suggested text. Thanks!

@kompfner
Copy link
Copy Markdown
Contributor

kompfner commented Jan 23, 2026

OK, after the last few suggestions (README update, removing UserStartedSpeakingFrame, and the code comment), let's get this thing in. Thanks for your patience, took a while to reason through this one.

@lukepayyapilli
Copy link
Copy Markdown
Contributor Author

@kompfner Implemented the suggested changes - removed UserStartedSpeakingFrame, added the explanatory comment, and updated the changelog. Thanks for the clear guidance!

Copy link
Copy Markdown
Contributor

@kompfner kompfner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! And for your patience with my back-and-forth suggestions 🙏

@kompfner kompfner merged commit 312caab into pipecat-ai:main Jan 28, 2026
6 checks passed
@lukepayyapilli lukepayyapilli deleted the fix/gemini-live-interrupted-signal branch January 28, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeminiLiveLLMService doesn't handle server_content.interrupted - causes delayed interruptions

2 participants