Skip to content

Add PatternPairAggregator#1387

Merged
markbackman merged 5 commits intomainfrom
mb/pattern-aggregator
Mar 18, 2025
Merged

Add PatternPairAggregator#1387
markbackman merged 5 commits intomainfrom
mb/pattern-aggregator

Conversation

@markbackman
Copy link
Copy Markdown
Contributor

@markbackman markbackman commented Mar 18, 2025

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

Extends the BaseTextAggregator with one that's aimed at:

  • Removing text from the LLM output before it is provided to the TTS service
  • Includes a handler to provide the content between the pattern pair, so that the content can be used to do things, like change voices, which I show in a demo

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 94.66667% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat/utils/text/pattern_pair_aggregator.py 94.66% 4 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat/utils/text/pattern_pair_aggregator.py 94.66% <94.66%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

transport = DailyTransport(
room_url,
token,
"Storytelling Bot",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we change to "Multiple voices Bot" or anything like this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first looked at the example, I thought,: Why aren’t we using function calling for this ? 😅

But after reading the description, it made total sense.

That makes me wonder if we could improve the example name, maybe something like 35-multiple-voices-bot or another name that clearly indicates the bot will automatically play multiple characters with different voices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of using this approach is that function calls are slow, adding noticeable latency to the response. With this approach, the LLM can output many encoded instructions in a single turn. Also, the applications extend beyond just voice switching. You can now encode any information into the LLM response and just parse it out. Two other common cases I've heard are DTMF codes and thinking tokens.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think it makes a lot of sense. 👍

Comment thread CHANGELOG.md Outdated
Comment on lines +124 to +125
- Added foundational example `35-voice-switching.py` showing how to use the new
`PatternPairAggregator`.
Copy link
Copy Markdown
Contributor

@filipi87 filipi87 Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave the description more complete:

Added foundational example 35-voice-switching.py showing how to use the new
PatternPairAggregator to make the bot automatically play multiple characters with different voices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a chance to educate a bit:

- Added foundational example `35-voice-switching.py` showing how to use the new
  `PatternPairAggregator`. This example shows how to encode information for the
  LLM to instruct TTS voice changes, but this can be used to encode any
  information into the LLM response, which you want to parse and use in other
  parts of your application.

I'll make this clear in docs too.

pattern_id: Unique identifier for this pattern pair.
start_pattern: Pattern that marks the beginning of content.
end_pattern: Pattern that marks the end of content.
remove_match: Whether to remove the matched content from the text.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice one.

pattern_aggregator.on_pattern_match("voice_tag", on_voice_tag)

# Set the pattern aggregator on the TTS service
tts._text_aggregator = pattern_aggregator
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should pass the text_aggregator when creating the CartesiaTTSService, because otherwise, it feels like we are modifying a private variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! Clearly the late night was getting to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to:

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id=VOICE_IDS["narrator"],
    text_aggregator=pattern_aggregator,
)

@markbackman markbackman force-pushed the mb/pattern-aggregator branch from 8e17917 to 8bbb856 Compare March 18, 2025 11:47
@markbackman
Copy link
Copy Markdown
Contributor Author

Thanks for the quick review @filipi87! This should be ready again.

Copy link
Copy Markdown
Contributor

@filipi87 filipi87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool. 🚀

@markbackman markbackman merged commit 4677c34 into main Mar 18, 2025
6 checks passed
@markbackman markbackman deleted the mb/pattern-aggregator branch March 18, 2025 12:46
voice_name = match.content.strip().lower()
if voice_name in VOICE_IDS:
voice_id = VOICE_IDS[voice_name]
tts.set_voice(voice_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine since the processor executing this code is actually the TTS. In general, we would want to use frames. Maybe it's worth adding a comment about this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to execute very quickly, so the method is the best way to go, I think. Even with the method, sometimes it's not fast enough.

processed_text = processed_text.replace(full_match, "", 1)
modified = True

return processed_text, modified
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could have gone to utils.string. The function signature would allow passing the set of pairs (start_tag, end_tag, remove_match).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to move it if you feel strongly. I don't really have a preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants