A comprehensive RAG (Retrieval-Augmented Generation) application that helps generate unique and compelling talk proposals for cloud-native conferences by combining historical KubeCon talk data with real-time web research.
This application follows a multi-stage pipeline to create a powerful talk suggestion system:
- Data Collection - Extract and crawl KubeCon talk URLs
- Data Processing - Parse and structure talk information
- Vector Storage - Generate embeddings and store in Couchbase
- RAG Application - Combine historical data with real-time research for intelligent suggestions
- Python 3.8+
- Couchbase Server with Vector Search capabilities
- OpenAI-compatible API access (Nebius AI)
- Environment variables configured (see
.envsetup below)
Purpose: Extract all KubeCon talk URLs from conference schedule pages.
# Save the KubeCon schedule HTML to a file, then run:
python extract_events.py < schedule.htmlWhat it does:
- Parses HTML content from stdin
- Extracts all event URLs with pattern
event/ - Merges with existing URLs in
event_urls.txt - Outputs the count of new URLs discovered
Output: event_urls.txt - Contains all unique talk URLs
Purpose: Crawl individual talk pages and extract structured information.
python couchbase_utils.pyWhat it does:
- Reads URLs from
event_urls.txt - Uses AsyncWebCrawler to fetch talk pages in batches
- Extracts structured data:
- Title
- Description
- Speaker(s)
- Category
- Date
- Location
- Stores directly to Couchbase with document keys like
talk_<event_id>
Features:
- Batch processing (5 URLs at a time)
- Error handling and retry logic
- Progress tracking with success/failure counts
- Automatic document key generation
Purpose: Alternative approach to store pre-crawled talk data from JSON files.
python crawl_talks.pyWhat it does:
- Reads from
talk_results1.json - Processes and stores talks in Couchbase
- Handles document conflicts with upsert operations
- Adds crawling timestamps
Use Case: When you have pre-existing JSON data to import
Purpose: Generate vector embeddings for semantic search capabilities.
python embeddinggeneration.pyWhat it does:
- Queries all documents from Couchbase
- Combines title, description, and category into searchable text
- Generates embeddings using
intfloat/e5-mistral-7b-instructmodel - Updates documents with embedding vectors
- Enables vector search functionality
Model: Uses Nebius AI's embedding endpoint for high-quality vectors
Purpose: Interactive Streamlit application for generating talk proposals.
streamlit run kubecon-talk-agent/talk_suggestions_app.pyCore Features:
- Historical Context: Vector search through stored KubeCon talks
- Real-time Context: Web research via ADK (Agent Development Kit)
- Research Phase: ADK agent researches current trends
- Retrieval Phase: Vector search finds similar historical talks
- Synthesis Phase: LLM combines both contexts for unique proposals
- Avoids duplicating existing talks
- Incorporates latest industry trends
- Focuses on end-user perspectives
- Provides structured output with learning objectives
Create a .env file with the following variables:
# Couchbase Configuration
CB_CONNECTION_STRING=couchbase://your-cluster-url
CB_USERNAME=your-username
CB_PASSWORD=your-password
CB_BUCKET=kubecon-talks
CB_COLLECTION=talks
CB_SEARCH_INDEX=kubecontalks
# AI APIs
NEBIUS_API_KEY=your-nebius-api-key
NEBIUS_API_BASE=https://api.tokenfactory.nebius.com/v1
OPENAI_API_KEY=your-openai-key # Optional fallback-- Create bucket
CREATE BUCKET `kubecon-talks`;
-- Create collection (if not using default)
CREATE COLLECTION `kubecon-talks`.`_default`.`talks`;{
"name": "kubecontalks",
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": false
},
"type_field": "_type",
"types": {
"_default": {
"enabled": true,
"dynamic": true,
"properties": {
"embedding": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "embedding",
"type": "vector",
"dims": 4096,
"similarity": "dot_product"
}
]
}
}
}
}
}
},
"sourceType": "gocbcore",
"sourceName": "kubecon-talks"
}Query: "OpenTelemetry distributed tracing in microservices"
The system will:
- Research current OpenTelemetry discussions online
- Find similar historical talks in the database
- Generate a unique proposal that builds on existing knowledge
- Emerging Technologies: "WebAssembly in Kubernetes workloads"
- Implementation Stories: "Migration from Prometheus to OpenTelemetry"
- Best Practices: "Cost optimization strategies for multi-cloud Kubernetes"
kubecontalksagent/
├── extract_events.py # URL extraction from HTML
├── couchbase_utils.py # Main crawling and storage logic
├── crawl_talks.py # JSON-to-Couchbase import
├── embeddinggeneration.py # Vector embedding generation
├── kubecon-talk-agent/
│ ├── talk_suggestions_app.py # Main Streamlit RAG app
│ └── adk_research_agent.py # Web research agent
├── event_urls.txt # Extracted URLs (generated)
├── talk_results1.json # Optional: pre-crawled data
└── .env # Environment configuration
- Model:
intfloat/e5-mistral-7b-instruct(4096 dimensions) - Similarity: Dot product
- Search Strategy: Combines text matching with vector similarity
- Connection timeouts and retries
- Graceful degradation when services are unavailable
- Comprehensive logging and user feedback
- Batch processing for crawling
- Connection pooling for Couchbase
- Async operations where possible
- Configurable timeouts
- Ensure the search index is created and built
- Verify embedding dimensions match (4096)
- Check Couchbase FTS service is running
- Consider using a local embedding model
- Implement caching for repeated queries
- Use batch embedding generation
- Increase timeout values in environment
- Check network connectivity to Couchbase
- Verify credentials and permissions
- Support for multiple conference data sources
- Real-time talk trend analysis
- Speaker recommendation system
- Integration with CFP (Call for Papers) platforms
- Multi-language support for international conferences
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For questions and support, please open an issue in the GitHub repository or contact the maintainers.
Built with ❤️ for the Cloud Native Community