🧠 Verifiable RAG Assistant

A retrieval-augmented generation (RAG) system engineered for high-accuracy document analysis with source verification. Unlike standard "Chat with PDF" tutorials, this project implements a Hybrid Architecture that decouples the embedding layer (Local CPU) from the inference layer (Groq Cloud) to optimize for both latency and cost.

🏗️ Architecture & System Design

This project moves beyond simple API wrapping by implementing a cost-efficient, high-performance pipeline:

🚀 Key Features

Agentic "Router" Architecture: The system intelligently decides whether to answer from the local PDFs or search the web.
- Internal Questions: Routed to the Vector DB for verified citations.
- External Questions: Routed to DuckDuckGo for real-time data (e.g., stock prices, news).
Smart Query Refinement: Before searching the web, the system analyzes the PDF context to refine the search query (e.g., converting "stock price of this company" -> "Microsoft stock price" based on document context).
Zero-Hallucination Protocol: Strict adherence to context. If information is missing in the PDF, it triggers a web search instead of fabricating facts.
Hybrid Compute Model:
- Embeddings: Calculated locally (all-MiniLM-L6-v2) for privacy and zero cost.
- Inference: Offloaded to Groq LPUs for sub-second generation.

🛠️ Tech Stack & Design Decisions

I chose this specific stack to balance performance, privacy, and cost.

Component	Technology	Engineering Decision (Why?)
Orchestration	LangChain	Provides a unified interface to swap components (e.g., changing Vector DBs) without rewriting core logic.
LLM Inference	Groq (Llama 3.1)	Chosen over GPT-4 for speed. Groq's LPU architecture delivers inference 10x faster than standard GPUs, crucial for real-time chat.
Embeddings	HuggingFace (Local)	Running all-MiniLM-L6-v2 locally removes the dependency on paid APIs for document processing. It runs efficiently on standard CPUs.
Vector DB	ChromaDB	Selected for its ability to run in-memory for development, avoiding the latency and complexity of cloud-based vector stores like Pinecone.
Chunking	RecursiveCharacter	Instead of hard splits, this respects sentence boundaries and semantic flow, reducing "context fragmentation" errors.
Web Search	DuckDuckGo	Selected for its privacy-focused, API-key-free access, enabling the system to fetch real-time external data without extra costs.
User Interface	Streamlit	Provides a clean, interactive chat interface with drag-and-drop file support, making the tool accessible to non-engineers.

🧐 Engineering Deep Dive (FAQ)

These are the questions I asked myself during the design process.

Why use RAG instead of Fine-Tuning?
- Fine-tuning teaches a model a behavior (style/format), but it is static. If the document changes, I would have to retrain the model (expensive). RAG allows for dynamic knowledge updates—I simply update the Vector Database, and the model knows the new information immediately.
Why decouple Embeddings from the LLM?
- By using a local embedding model (HuggingFace) and a cloud LLM (Groq), I created a Vendor-Agnostic architecture. If Groq becomes too expensive, I can swap the LLM to OpenAI or Anthropic without having to re-index all my documents (since the embeddings are local and independent).
How do you handle "Lost in the Middle" issues?
- I implemented chunk_overlap=200. When splitting text, vital context often exists at the boundaries of chunks. Overlapping ensures that semantic meaning flows across splits, improving retrieval accuracy for queries that span multiple paragraphs.

⚡Challenges & Solutions

Real-world problems encountered during development and how they were solved.

The "Parrot" Problem (Instruction Drift)
- When using the smaller Llama-3-8b model with large context windows, the model would often get "overwhelmed" and start repeating the input text verbatim instead of answering the question.
Solution:
- Prompt Hardening: I engineered a stricter System Prompt that explicitly forbids regurgitation ("Do NOT just repeat context. Extract specific answers.").
- Noise Reduction: I lowered the retrieval k value from 4 to 2. This feeds the LLM less "noise," forcing it to focus only on the most critical paragraphs.
Model Deprecation (API Volatility)
- During development, Groq deprecated the original Llama 3 model, breaking the pipeline with a 400 Model Decommissioned error.
Solution:
- Refactored the code to use llama-3.1-8b-instant.
- Abstracted the model name into a configuration variable to make future upgrades seamless without touching core logic.
The "Rate Limit" Wall (Groq Free Tier)
- The standard RAG pipeline with large PDF chunks quickly hit Groq's 6000 TPM (Tokens Per Minute) limit, causing 413 Errors.
Solution:
- I implemented an Optimized Ingestion Strategy:
  - Reduced chunk size from 1000 to 500 characters.
  - Lowered retrieval count (k) from 3 to 2.
  - Added a hard truncation limit (2000 chars) on tool outputs to guarantee API stability.
The "Ambiguous Search" Problem
- When asking "What is the stock price of this company?", the web search tool would fail because it didn't know which company the PDF was talking about.
Solution:
- I built a Query Refinement Step. Before searching, the LLM reads the PDF context, extracts the entity name (e.g., "Microsoft"), and rewrites the search query (e.g., "Microsoft current stock price") for high-precision results.
Small Model Tool Calling
- Llama-3-8b often struggled with strict JSON function calling, leading to parsing errors.
Solution:
- I switched to a Router Pattern using robust Python logic (if "SEARCH_WEB" in response) instead of relying on the fragile native function calling API. This increased system stability to near 100%.

⚡ Quick Start

Prerequisites

Python 3.10+
A Groq API Key (https://console.groq.com)

Installation

Clone the Repository

   git clone https://github.com/yourusername/verifiable-rag.git
   cd verifiable-rag

Install Dependencies

   pip install -r requirements.txt

Configure API Key

Open rag_groq.py and paste your API key:

os.environ["GROQ_API_KEY"] = "gsk_..."

Usage

Run the full chat application with multi-PDF support.

streamlit run app.py

Example Questions to Try

The Specific: "What is the kernel size in AlexNet?" (Retrieves from PDF)
The Comparative: "Compare the depth of VGG-16 and ResNet-152." (Synthesizes multiple PDFs)
The Agentic (Hybrid): "Who are the authors of the ResNet paper, and what is the current stock price of the company they worked for?" (Combines PDF authors with Web Search for stock price).

🚀 Future Improvements

Conversation Memory

Add a ConversationBufferWindowMemory to allow follow-up questions (currently stateless).

Re-ranking

Implement a Cross-Encoder step (using Cohere or BAAI) to re-rank retrieved documents for higher precision before sending them to the LLM.

Dockerization

Containerize the application for easier deployment across different environments.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
resources		resources
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Verifiable RAG Assistant

🏗️ Architecture & System Design

🚀 Key Features

🛠️ Tech Stack & Design Decisions

🧐 Engineering Deep Dive (FAQ)

⚡Challenges & Solutions

⚡ Quick Start

Prerequisites

Installation

Usage

Example Questions to Try

🚀 Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Verifiable RAG Assistant

🏗️ Architecture & System Design

🚀 Key Features

🛠️ Tech Stack & Design Decisions

🧐 Engineering Deep Dive (FAQ)

⚡Challenges & Solutions

⚡ Quick Start

Prerequisites

Installation

Usage

Example Questions to Try

🚀 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages