Conversation
Codecov ReportAttention: Patch coverage is
🚀 New features to boost your workflow:
|
| transport = DailyTransport( | ||
| room_url, | ||
| token, | ||
| "Storytelling Bot", |
There was a problem hiding this comment.
Nit: should we change to "Multiple voices Bot" or anything like this.
There was a problem hiding this comment.
When I first looked at the example, I thought,: Why aren’t we using function calling for this ? 😅
But after reading the description, it made total sense.
That makes me wonder if we could improve the example name, maybe something like 35-multiple-voices-bot or another name that clearly indicates the bot will automatically play multiple characters with different voices.
There was a problem hiding this comment.
The benefit of using this approach is that function calls are slow, adding noticeable latency to the response. With this approach, the LLM can output many encoded instructions in a single turn. Also, the applications extend beyond just voice switching. You can now encode any information into the LLM response and just parse it out. Two other common cases I've heard are DTMF codes and thinking tokens.
There was a problem hiding this comment.
Yep, I think it makes a lot of sense. 👍
| - Added foundational example `35-voice-switching.py` showing how to use the new | ||
| `PatternPairAggregator`. |
There was a problem hiding this comment.
Maybe leave the description more complete:
Added foundational example 35-voice-switching.py showing how to use the new
PatternPairAggregator to make the bot automatically play multiple characters with different voices.
There was a problem hiding this comment.
I'm taking a chance to educate a bit:
- Added foundational example `35-voice-switching.py` showing how to use the new
`PatternPairAggregator`. This example shows how to encode information for the
LLM to instruct TTS voice changes, but this can be used to encode any
information into the LLM response, which you want to parse and use in other
parts of your application.
I'll make this clear in docs too.
| pattern_id: Unique identifier for this pattern pair. | ||
| start_pattern: Pattern that marks the beginning of content. | ||
| end_pattern: Pattern that marks the end of content. | ||
| remove_match: Whether to remove the matched content from the text. |
| pattern_aggregator.on_pattern_match("voice_tag", on_voice_tag) | ||
|
|
||
| # Set the pattern aggregator on the TTS service | ||
| tts._text_aggregator = pattern_aggregator |
There was a problem hiding this comment.
I guess we should pass the text_aggregator when creating the CartesiaTTSService, because otherwise, it feels like we are modifying a private variable.
There was a problem hiding this comment.
Totally! Clearly the late night was getting to me.
There was a problem hiding this comment.
I've updated to:
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id=VOICE_IDS["narrator"],
text_aggregator=pattern_aggregator,
)
8e17917 to
8bbb856
Compare
|
Thanks for the quick review @filipi87! This should be ready again. |
8bbb856 to
b282764
Compare
| voice_name = match.content.strip().lower() | ||
| if voice_name in VOICE_IDS: | ||
| voice_id = VOICE_IDS[voice_name] | ||
| tts.set_voice(voice_id) |
There was a problem hiding this comment.
I think this is fine since the processor executing this code is actually the TTS. In general, we would want to use frames. Maybe it's worth adding a comment about this.
There was a problem hiding this comment.
This needs to execute very quickly, so the method is the best way to go, I think. Even with the method, sometimes it's not fast enough.
| processed_text = processed_text.replace(full_match, "", 1) | ||
| modified = True | ||
|
|
||
| return processed_text, modified |
There was a problem hiding this comment.
Maybe this could have gone to utils.string. The function signature would allow passing the set of pairs (start_tag, end_tag, remove_match).
There was a problem hiding this comment.
Happy to move it if you feel strongly. I don't really have a preference.
Please describe the changes in your PR. If it is addressing an issue, please reference that as well.
Extends the BaseTextAggregator with one that's aimed at: