feat: handle server_content.interrupted for faster interruptions#3429
Conversation
Codecov Report❌ Patch coverage is
... and 23 files with indirect coverage changes 🚀 New features to boost your workflow:
|
| if message.server_content and message.server_content.model_turn: | ||
| if message.server_content and message.server_content.interrupted: | ||
| logger.debug("Gemini VAD: interrupted signal received") | ||
| await self._handle_interruption() |
There was a problem hiding this comment.
I think we ought to use the standard interruption-triggering pattern here:
await self.push_interruption_task_frame_and_wait()
(We never push InterruptionFrames directly from services anymore, we always use this mechanism instead; this'll guarantees that the whole pipeline handles the interruption and makes it easier for us to maintain the interruption mechanism going forward).
As a nice side-effect of moving to this pattern, I think we'd no longer need to call await self._handle_interruption(), as GeminiLiveLLMService would receive and process an InterruptionFrame as usual.
There was a problem hiding this comment.
I also think we need to precede the interruption with:
await self.broadcast_frame(UserStartedSpeakingFrame)
This is an important signal for the context-management system (i.e. LLMContextAggregatorPair) that records user and assistant messages in context.
Although...this is reminding me of a similar contribution (which appears never to have gotten merged) for leveraging AWS Nova Sonic's built-in VAD: #2431. There, we discussed the importance of also sending UserStoppedSpeakingFrame, if there's no local VAD in the pipeline, as that signal is also essential to context recording.
Can we find a way to broadcast a UserStoppedSpeakingFrame from this service, too?
cc @aconchillo, who last touched UserStartedSpeakingFrame/UserStoppedSpeakingFrame in other speech-to-speech services (OpenAI, Grok)...what was the effect of having the speech-to-speech service emit these frames in the case where the pipeline also had local VAD configured?
There was a problem hiding this comment.
Ah, the recommendation if you want to have context management in your pipeline without local VAD (i.e. LLMContextAggregatorPair) is that you'd configure the user aggregator with ExternalUserTurnStrategies(). Though...it may not hurt to just turn off local VAD without updating the user aggregator in that way...
There was a problem hiding this comment.
Thanks for the thorough review @kompfner! Updated to address your feedback:
- Using
push_interruption_task_frame_and_wait()instead of pushingInterruptionFramedirectly. - Added broadcast_frame(
UserStartedSpeakingFrame()) before the interruption for context management.
Regarding UserStoppedSpeakingFrame - I'll defer to @aconchillo on whether that's needed here, given the open question about interaction with local VAD.
There was a problem hiding this comment.
@kompfner bumping this for another review when you have a moment - thanks!
There was a problem hiding this comment.
Ah sorry for not being clearer—regardless of the open question about interaction with local VAD, if we're emitting UserStartedSpeakingFrame from this service, we'll also need to emit UserStoppedSpeakingFrame.
I can help with some testing to ensure that the duplicate signal (in the case where we also have local VAD) doesn't cause issues. I suspect we'd be OK (OpenAI Realtime already always emits interruptions + user started/stopped speaking).
| if message.server_content and message.server_content.interrupted: | ||
| logger.debug("Gemini VAD: interrupted signal received") | ||
| await self.broadcast_frame(UserStartedSpeakingFrame()) | ||
| await self.push_interruption_task_frame_and_wait() |
There was a problem hiding this comment.
Ah just realized that this condition (server_content.interrupted) only occurs for a barge-in, not for a "normal" user utterance that follows the assistant response.
If this service is responsible for firing UserStartedSpeakingFrame (and UserStoppedSpeakingFrame) it should be able to do so in all circumstances, barge-in or not.
There was a problem hiding this comment.
OK, let's go with the simplest thing for now: just the await self.push_interruption_task_frame_and_wait() (no UserStartedSpeakingFrame).
Maybe add this comment above that line:
# NOTE: while the service triggers interruptions in
# the specific case of barge-ins, it does *not*
# emit UserStarted/StoppedSpeakingFrames, as the
# Gemini Live API does not give us broadly reliable
# signals to base those off of. Pipelines that
# require turn tracking (like those using context
# aggregators) still need an independent way to
# track turns, such as local Silero VAD in
# combination with the context aggregator default
# turn strategies.
There was a problem hiding this comment.
Done - implemented as suggested. Thanks!
|
Good point @kompfner - |
f89ae45 to
c65a89c
Compare
c65a89c to
cadced3
Compare
Trying to think through the ramifications of a service only firing events for barge-in and not for "regular" back and forth...I believe that inconsistency could be a problem because then users wouldn't get a clear sense of whether or not they needed to also have independent VAD-based turn detection in their pipeline. So I do think it might unfortunately be an all-or-nothing thing—either a service is responsible for emitting turn-related events (user started/stopped) or it's not. Based on some brief research, it looks to me like Gemini Live sadly doesn't have timely events we can listen to to detect when it thinks user speech has actually started... ...but maybe for the purpose of context management ( Except, shoot, no, the user started/stopped speaking frames also drive "on_user_turn_started" events, not to mention that folks might have custom processors or observers in their pipeline that expect user started/stopped frames to actually correspond in time to when the user has actually started and stopped speaking... OK, here's my current thinking: maybe we just say, for now, that this service does not act as a turn controller (i.e. it doesn't emit user started/stopped speaking frames), and if you do need turn tracking in your pipeline (for context recording, say) then you need to also BYO turn tracking (like enabling local VAD + using context aggregator defaults). I can help do some testing to ensure that emitting just the interruption frame doesn't have any adverse effects, in pipelines with context recording. Pardon the circling around on this question, it's proving a bit tricky! |
| @@ -0,0 +1 @@ | |||
| - Added handling for `server_content.interrupted` signal in Gemini Live services for faster interruption response. | |||
There was a problem hiding this comment.
| - Added handling for `server_content.interrupted` signal in Gemini Live services for faster interruption response. | |
| - Added handling for `server_content.interrupted` signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm. |
There was a problem hiding this comment.
Updated with your suggested text. Thanks!
|
OK, after the last few suggestions (README update, removing |
|
@kompfner Implemented the suggested changes - removed |
kompfner
left a comment
There was a problem hiding this comment.
Thanks for the contribution! And for your patience with my back-and-forth suggestions 🙏
Summary
server_content.interruptedsignal in GeminiLiveLLMService message loop.Problem
When using GeminiLiveLLMService without local VAD (Silero), interruptions were delayed because the service waited for
input_transcription. Gemini sendsserver_content.interruptedinstantly when its VAD detects speech.Approach
Added inline handling in the message loop - no new methods or config flags.
Alternatives considered:
use_native_vad_interruptions) - rejected per YAGNI, no one has asked for itWhy this is safe:
_handle_interruption()is idempotent, so duplicate signals (from both local VAD and Gemini) are harmless.Fixes #3381