Context
I've been studying Pipecat through the official examples, and the storytelling-chatbot example stands out as a compelling concept for demonstrating multimodal AI capabilities. However, based on reviewing the codebase and comparing it to newer examples like local-smart-turn, word-wrangler-gemini-live, and the foundational examples in the main Pipecat repo, the storytelling example appears to use patterns that could benefit from modernization.
The storytelling example is particularly valuable because it demonstrates:
- Multi-turn interactive narratives
- Image generation synchronized with narration
- Voice-driven user input for "choose your own adventure" experiences
- Integration of Gemini 2.0 LLM with Google Imagen
However, the implementation could leverage more recent Pipecat architectural patterns to improve clarity and extensibility.
🎯 Proposed Modernization
1. Structured Text Segmentation with PatternPairAggregator + LLMTextProcessor
Current approach (assumed): Text parsing likely uses manual regex or string splitting to separate narration from image prompts.
Modern approach: Use PatternPairAggregator with XML-style tags for clean segmentation:
from pipecat.processors.llm_text_processor import LLMTextProcessor
from pipecat.utils.text.pattern_pair_aggregator import PatternPairAggregator, MatchAction
# Configure pattern aggregator for story segments
pattern_aggregator = PatternPairAggregator()
# Define patterns for different content types
pattern_aggregator.add_pattern(
type="narration",
start_pattern="<narration>",
end_pattern="</narration>",
action=MatchAction.AGGREGATE
)
pattern_aggregator.add_pattern(
type="image_prompt",
start_pattern="<image_prompt>",
end_pattern="</image_prompt>",
action=MatchAction.AGGREGATE
)
# Create processor to segment LLM output
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)
Benefits:
- No manual regex parsing
- Clear separation between narration and image generation instructions
- Structured metadata attached to each segment
- Easier to extend with additional segment types (e.g., sound effects, scene transitions)
2. Dedicated Orchestration with StoryOrchestratorProcessor
Proposal: Create a custom processor that manages story flow and coordinates multimodal outputs.
class StoryOrchestratorProcessor(FrameProcessor):
"""Coordinates story pages, image generation, and narration sequencing."""
def __init__(self, image_generator):
super().__init__()
self._current_page = 0
self._pages = []
self._image_generator = image_generator
async def process_frame(self, frame, direction):
# Handle narration segments
if isinstance(frame, AggregatedTextFrame) and frame.aggregated_by == "narration":
await self._queue_narration(frame.text)
# Handle image prompts
elif isinstance(frame, AggregatedTextFrame) and frame.aggregated_by == "image_prompt":
await self._generate_and_queue_image(frame.text)
await self.push_frame(frame, direction)
Responsibilities:
- Maintain story state (current page, history)
- Trigger image generation for each
<image_prompt> segment
- Forward narration frames to TTS pipeline
- Handle user input for story choices
This separates orchestration logic from parsing logic, making the pipeline more modular.
3. Synchronize Narration & Images with Frame Observers
Current approach (assumed): Timing may be implicit or based on delays.
Modern approach: Use frame observers to detect narration completion:
class StorySyncObserver(FrameObserver):
"""Observes TTS completion to trigger next story page."""
def __init__(self, orchestrator):
self._orchestrator = orchestrator
async def on_frame(self, frame):
if isinstance(frame, TTSAudioEndFrame):
# Narration finished, show next image
await self._orchestrator.advance_to_next_page()
Flow:
- Narration 1 plays →
TTSAudioEndFrame → Show Image 2 → Play Narration 2
- Narration 2 plays →
TTSAudioEndFrame → Show Image 3 → Play Narration 3
This creates deterministic, page-by-page storytelling without timing hacks.
4. Clear Component Separation
| Responsibility |
Component |
| Text segmentation |
LLMTextProcessor with PatternPairAggregator |
| Story flow / page management |
StoryOrchestratorProcessor |
| Image generation |
StoryImageProcessor (or existing Imagen integration) |
| Narration timing |
StorySyncObserver (observes TTSAudioEndFrame) |
| UI updates |
Transport image frames |
✅ Benefits of Modernization
- Educational Value: Demonstrates recommended Pipecat patterns for multimodal agents
- Extensibility: Easy to add features like:
- Multiple story paths with user choice
- Background music synchronized with scene changes
- Video generation (future)
- Story progress UI
- Maintainability: Clear separation of concerns makes debugging easier
- Consistency: Aligns with patterns in newer examples (
local-smart-turn, simple-chatbot)
- Performance: Async image generation doesn't block narration pipeline
📝 Implementation Checklist
If this proposal is approved, I'm happy to contribute a PR with:
🤔 Questions for Maintainers
- Should this be a refactor of the existing storytelling-chatbot, or a new example (e.g.,
storytelling-chatbot-v2)?
- Are there specific Pipecat patterns you'd like demonstrated in this example?
Additional Deliverables (if helpful)
I can also provide:
- ✔ Pipeline architecture diagram (ASCII or Mermaid)
- ✔ Step-by-step migration guide from old to new pattern
- ✔ Comparative example showing before/after code
This example has great potential to showcase Pipecat's multimodal capabilities. Modernizing it would provide a clearer learning path for developers building similar experiences.
References:
Context
I've been studying Pipecat through the official examples, and the storytelling-chatbot example stands out as a compelling concept for demonstrating multimodal AI capabilities. However, based on reviewing the codebase and comparing it to newer examples like
local-smart-turn,word-wrangler-gemini-live, and the foundational examples in the main Pipecat repo, the storytelling example appears to use patterns that could benefit from modernization.The storytelling example is particularly valuable because it demonstrates:
However, the implementation could leverage more recent Pipecat architectural patterns to improve clarity and extensibility.
🎯 Proposed Modernization
1. Structured Text Segmentation with
PatternPairAggregator+LLMTextProcessorCurrent approach (assumed): Text parsing likely uses manual regex or string splitting to separate narration from image prompts.
Modern approach: Use
PatternPairAggregatorwith XML-style tags for clean segmentation:Benefits:
2. Dedicated Orchestration with
StoryOrchestratorProcessorProposal: Create a custom processor that manages story flow and coordinates multimodal outputs.
Responsibilities:
<image_prompt>segmentThis separates orchestration logic from parsing logic, making the pipeline more modular.
3. Synchronize Narration & Images with Frame Observers
Current approach (assumed): Timing may be implicit or based on delays.
Modern approach: Use frame observers to detect narration completion:
Flow:
TTSAudioEndFrame→ Show Image 2 → Play Narration 2TTSAudioEndFrame→ Show Image 3 → Play Narration 3This creates deterministic, page-by-page storytelling without timing hacks.
4. Clear Component Separation
LLMTextProcessorwithPatternPairAggregatorStoryOrchestratorProcessorStoryImageProcessor(or existing Imagen integration)StorySyncObserver(observesTTSAudioEndFrame)✅ Benefits of Modernization
local-smart-turn,simple-chatbot)📝 Implementation Checklist
If this proposal is approved, I'm happy to contribute a PR with:
<narration>and<image_prompt>tagsPatternPairAggregatorconfiguration for segmentationStoryOrchestratorProcessorfor flow controlStorySyncObserverfor narration/image synchronization🤔 Questions for Maintainers
storytelling-chatbot-v2)?Additional Deliverables (if helpful)
I can also provide:
This example has great potential to showcase Pipecat's multimodal capabilities. Modernizing it would provide a clearer learning path for developers building similar experiences.
References: