Retrieving accurate answers from large document collections is often slow and imprecise. Traditional retrieval-augmented generation (RAG) pipelines face challenges such as:
- High latency due to repeated context retrieval for similar queries
- Inaccurate answers caused by irrelevant or outdated context
- Increased computational load and database queries for frequently asked questions
These issues make it difficult to deliver fast and reliable answers in real-world applications where users may ask similar or related questions repeatedly.
CRag solves these problems by integrating context caching into the RAG pipeline. The key features of CRag are:
- Context Caching: Stores relevant context chunks for previously asked or similar queries.
- Cache Lookup: Checks the cache before querying the vector database, reducing redundant retrievals.
- Efficient Answer Generation: Uses cached context to generate answers quickly and accurately.
- Automatic Cache Update: Updates the cache with new context when a query is not found, ensuring freshness and relevance.
This approach significantly reduces response times, improves answer accuracy, and lowers the load on the vector database.
Below is the workflow for CRag. Add your workflow flow chart in the [Image] box:
Add your workflow flow chart here.
Workflow Steps:
- User Query: The user submits a question.
- Cache Lookup: The system checks if a similar query's context exists in the cache.
- Context Retrieval:
- If cached, use the stored context.
- If not cached, retrieve context from the vector database.
- Answer Generation: The RAG pipeline generates an answer using the retrieved context.
- Cache Update: If new context was retrieved, store it in the cache for future queries.
CRag uses a cache (in-memory or persistent) to store context chunks for queries. When a new query arrives, the system:
- Receives Query: User submits a question.
- Checks Cache: Looks for a matching or similar query in the cache.
- Retrieves Context:
- If found, uses cached context.
- If not found, fetches context from the vector database.
- Generates Answer: Uses the RAG pipeline to generate an answer from the context.
- Updates Cache: Stores new context for future queries, improving speed and accuracy for repeated questions.
- Faster response times: Answers are generated quickly for repeated or similar queries.
- Improved answer accuracy: Cached context ensures relevant information is used.
- Reduced database load: Fewer queries to the vector database, lowering computational costs.
CRag supports enabling and disabling the context caching functionality through environment variables. This gives you control over when to use caching vs. direct processing.
The cache can be controlled using the CACHE_ENABLED environment variable:
Enable Cache (Default):
CACHE_ENABLED=true # or 1, yes, on (case insensitive)Disable Cache:
CACHE_ENABLED=false # or 0, no, off (case insensitive)CRag now includes advanced async and parallel processing capabilities for faster document ingestion, based on LlamaIndex's IngestionPipeline with parallel execution.
- Sequential Processing: Traditional single-threaded processing
- Parallel Processing: Multi-worker parallel processing using ProcessPoolExecutor
- Async Processing: Asynchronous processing with parallel workers
- Async Batch Processing: Process multiple documents concurrently
Control parallel processing behavior with these environment variables:
# Enable/disable parallel processing (default: true)
PARALLEL_PROCESSING_ENABLED=true
# Number of parallel workers per document (default: 4)
MAX_PARALLEL_WORKERS=4
# Maximum concurrent documents in batch processing (default: 2)
MAX_CONCURRENT_DOCUMENTS=2
# Enable/disable async processing (default: true)
ASYNC_PROCESSING_ENABLED=trueBased on LlamaIndex benchmarks, the performance hierarchy is typically:
- Async + Parallel: Fastest for large workloads
- Async Only: Good for I/O bound operations
- Parallel Only: Good for CPU bound operations
- Sequential: Slowest but most stable
Add these configurations to your .env file:
# Cache Configuration
CACHE_ENABLED=true
# Parallel Processing Configuration
PARALLEL_PROCESSING_ENABLED=true
MAX_PARALLEL_WORKERS=4
MAX_CONCURRENT_DOCUMENTS=2
ASYNC_PROCESSING_ENABLED=true
# Other required variables
GEMINI_API_KEY=your_gemini_api_key_here
HACKRX_TOKEN=your_hackrx_token_here
LLAMA_CLOUD_API_KEY=your_llama_cloud_api_key_here
# Optional directories
DATA_DIR=data/
TEMP_DIR=temp/
CHROMA_PERSIST_DIR=chroma_db/Single Document Processing:
from utils.embeddings_chunking import process_document_for_rag, process_document_for_rag_async
# Using config defaults
success = process_document_for_rag(document_url)
# Explicit parallel processing
success = process_document_for_rag(document_url, use_parallel=True, num_workers=4)
# Async processing
success = await process_document_for_rag_async(document_url, num_workers=4)Batch Document Processing:
from utils.embeddings_chunking import process_documents_batch, process_multiple_documents_async
# Batch processing with config defaults
results = process_documents_batch(document_urls)
# Explicit async batch processing
results = process_documents_batch(document_urls, use_async=True, num_workers=4)
# Advanced async batch processing
results = await process_multiple_documents_async(document_urls, num_workers=4)Test the different processing modes:
# Test parallel processing performance
python test_parallel_processing.py
# Test cache toggle functionality
python test_cache_toggle.pyWhen Cache is ENABLED (CACHE_ENABLED=true):
- Documents are processed and stored in Gemini's context cache
- Questions are answered using cached document context
- Faster response times for repeated queries
- Cache mappings are saved/loaded for persistence across sessions
- API endpoints show cache status and allow cache management
When Cache is DISABLED (CACHE_ENABLED=false):
- Documents are processed directly without caching
- Each question triggers a fresh API call to Gemini
- No cache files are created or loaded
- Cache management functions are bypassed
- Useful for testing, debugging, or avoiding cache-related issues
When cache is enabled, the following endpoints are available:
GET /hackrx/cache/status- Get cache status and configurationDELETE /hackrx/cache/{document_hash}- Delete specific document cacheDELETE /hackrx/cache/all- Clear all caches
For more details, see the code and documentation in this repository.
Follow these steps to run CRag locally:
-
Clone the repository:
- Use the following command to clone the project:
git clone https://github.com/shreesha345/Rag_system.git cd Rag_system
- Use the following command to clone the project:
-
Install dependencies:
- Make sure you have Python 3.12+ installed.
- Use uv for fast dependency management:
uv sync
- This will install all dependencies specified in
pyproject.tomlanduv.lock.
-
Start the application:
- Run the main script:
uv venv uv pip run python main.py
- Run the main script:
-
Add your workflow image:
- Place your workflow flow chart image in the project directory and link it in the [Image] section above.
For troubleshooting or more details, refer to the code and comments in this repository.