Skip to content

🔄 Modernize storytelling-chatbot Example with Latest Pipecat Patterns #126

@sohampirale

Description

@sohampirale

Context

I've been studying Pipecat through the official examples, and the storytelling-chatbot example stands out as a compelling concept for demonstrating multimodal AI capabilities. However, based on reviewing the codebase and comparing it to newer examples like local-smart-turn, word-wrangler-gemini-live, and the foundational examples in the main Pipecat repo, the storytelling example appears to use patterns that could benefit from modernization.

The storytelling example is particularly valuable because it demonstrates:

  • Multi-turn interactive narratives
  • Image generation synchronized with narration
  • Voice-driven user input for "choose your own adventure" experiences
  • Integration of Gemini 2.0 LLM with Google Imagen

However, the implementation could leverage more recent Pipecat architectural patterns to improve clarity and extensibility.


🎯 Proposed Modernization

1. Structured Text Segmentation with PatternPairAggregator + LLMTextProcessor

Current approach (assumed): Text parsing likely uses manual regex or string splitting to separate narration from image prompts.

Modern approach: Use PatternPairAggregator with XML-style tags for clean segmentation:

from pipecat.processors.llm_text_processor import LLMTextProcessor
from pipecat.utils.text.pattern_pair_aggregator import PatternPairAggregator, MatchAction

# Configure pattern aggregator for story segments
pattern_aggregator = PatternPairAggregator()

# Define patterns for different content types
pattern_aggregator.add_pattern(
    type="narration",
    start_pattern="<narration>",
    end_pattern="</narration>",
    action=MatchAction.AGGREGATE
)

pattern_aggregator.add_pattern(
    type="image_prompt",
    start_pattern="<image_prompt>",
    end_pattern="</image_prompt>",
    action=MatchAction.AGGREGATE
)

# Create processor to segment LLM output
llm_text_processor = LLMTextProcessor(text_aggregator=pattern_aggregator)

Benefits:

  • No manual regex parsing
  • Clear separation between narration and image generation instructions
  • Structured metadata attached to each segment
  • Easier to extend with additional segment types (e.g., sound effects, scene transitions)

2. Dedicated Orchestration with StoryOrchestratorProcessor

Proposal: Create a custom processor that manages story flow and coordinates multimodal outputs.

class StoryOrchestratorProcessor(FrameProcessor):
    """Coordinates story pages, image generation, and narration sequencing."""
    
    def __init__(self, image_generator):
        super().__init__()
        self._current_page = 0
        self._pages = []
        self._image_generator = image_generator
        
    async def process_frame(self, frame, direction):
        # Handle narration segments
        if isinstance(frame, AggregatedTextFrame) and frame.aggregated_by == "narration":
            await self._queue_narration(frame.text)
            
        # Handle image prompts
        elif isinstance(frame, AggregatedTextFrame) and frame.aggregated_by == "image_prompt":
            await self._generate_and_queue_image(frame.text)
            
        await self.push_frame(frame, direction)

Responsibilities:

  • Maintain story state (current page, history)
  • Trigger image generation for each <image_prompt> segment
  • Forward narration frames to TTS pipeline
  • Handle user input for story choices

This separates orchestration logic from parsing logic, making the pipeline more modular.


3. Synchronize Narration & Images with Frame Observers

Current approach (assumed): Timing may be implicit or based on delays.

Modern approach: Use frame observers to detect narration completion:

class StorySyncObserver(FrameObserver):
    """Observes TTS completion to trigger next story page."""
    
    def __init__(self, orchestrator):
        self._orchestrator = orchestrator
        
    async def on_frame(self, frame):
        if isinstance(frame, TTSAudioEndFrame):
            # Narration finished, show next image
            await self._orchestrator.advance_to_next_page()

Flow:

  1. Narration 1 plays → TTSAudioEndFrame → Show Image 2 → Play Narration 2
  2. Narration 2 plays → TTSAudioEndFrame → Show Image 3 → Play Narration 3

This creates deterministic, page-by-page storytelling without timing hacks.


4. Clear Component Separation

Responsibility Component
Text segmentation LLMTextProcessor with PatternPairAggregator
Story flow / page management StoryOrchestratorProcessor
Image generation StoryImageProcessor (or existing Imagen integration)
Narration timing StorySyncObserver (observes TTSAudioEndFrame)
UI updates Transport image frames

✅ Benefits of Modernization

  1. Educational Value: Demonstrates recommended Pipecat patterns for multimodal agents
  2. Extensibility: Easy to add features like:
    • Multiple story paths with user choice
    • Background music synchronized with scene changes
    • Video generation (future)
    • Story progress UI
  3. Maintainability: Clear separation of concerns makes debugging easier
  4. Consistency: Aligns with patterns in newer examples (local-smart-turn, simple-chatbot)
  5. Performance: Async image generation doesn't block narration pipeline

📝 Implementation Checklist

If this proposal is approved, I'm happy to contribute a PR with:

  • Updated LLM prompt to output <narration> and <image_prompt> tags
  • PatternPairAggregator configuration for segmentation
  • StoryOrchestratorProcessor for flow control
  • StorySyncObserver for narration/image synchronization
  • Updated README with architecture explanation
  • Frontend updates to display incoming image frames
  • Example demonstrating the pattern

🤔 Questions for Maintainers

  1. Should this be a refactor of the existing storytelling-chatbot, or a new example (e.g., storytelling-chatbot-v2)?
  2. Are there specific Pipecat patterns you'd like demonstrated in this example?

Additional Deliverables (if helpful)

I can also provide:

  • ✔ Pipeline architecture diagram (ASCII or Mermaid)
  • ✔ Step-by-step migration guide from old to new pattern
  • ✔ Comparative example showing before/after code

This example has great potential to showcase Pipecat's multimodal capabilities. Modernizing it would provide a clearer learning path for developers building similar experiences.


References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedLooking for someone to take on this issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions