A sophisticated full-stack platform that ingests news from across the web, analyzes it using AI agents, and delivers personalized content through a modern web app and WhatsApp.
- ✨ Key Features
- 🏗️ Project Architecture
- 🛠️ Tech Stack
- 📚 Module Documentation
- 💾 Database Schema
- 🚀 How to Run the Project
- 📖 API Documentation
- 🔧 Development
- Dual Summarization: Professional factual summaries + child-friendly story formats
- Fact-Checking: Web-search verified claims with boolean verdicts
- Sentiment Analysis: Positive/Negative/Neutral classification with reasoning
- Named Entity Recognition: Extracts Person, Location, Organization entities
- Behavioral Analysis: Learns from user interactions and preferences
- Vector Similarity Matching: Cosine similarity-based article recommendations
- Real-time Updates: Dynamic preference profile generation
- Top 10 Curation: Delivers most relevant articles per user
- Modern Web App: React + TypeScript frontend with Tailwind CSS
- WhatsApp Integration: Scheduled personalized news delivery via Twilio
- Responsive Design: Optimized for all devices
- API Failover: Automatic Gemini → Groq switching on rate limits
- Deduplication: Smart article grouping to avoid redundant processing
- Hybrid Scoring: Combines AI analysis (60%) with user feedback (40%)
- Comprehensive Logging: MongoDB + local JSON file backups
The project is built on a decoupled monorepo structure, containing three main parts:
NewsVerse/
├── 📱 frontend/ # React + TypeScript web application
├── ⚙️ backend/ # Python microservice pipeline
│ ├── Scraping_Crawling/
│ ├── Summarization/
│ ├── Fact_Checker/
│ ├── Sentiment_Analysis/
│ ├── Name_Entity_Recognition/
│ ├── Embedding_Creation/
│ ├── Article_Scorer/
│ ├── Recommendation_Engine/
│ └── Whatsapp_Messaging/
└── 🧪 Raw_code_developer/ # Development sandbox & experiments
Web Sources → Scraping → Processing Pipeline → Embeddings → Scoring → Recommendations → Users
↓ ↓ ↓ ↓ ↓
MongoDB AI Agents Vector DB Quality Personalized
Scores Feed
| Area | Technology |
|---|---|
| Frontend | React 18+, TypeScript, Tailwind CSS, Vite, shadcn/ui |
| Backend | Python 3.10+, FastAPI, MongoDB, APScheduler |
| AI Framework | Agno Framework - Agent orchestration & LLM integration |
| AI / ML | LangChain, SentenceTransformers (all-MiniLM-L6-v2), scikit-learn |
| LLM Providers | Google Gemini (Primary), Groq (Failover via Agno) |
| Data Ingestion | Crawl4ai, BeautifulSoup, Requests |
| Messaging | Twilio API (WhatsApp) |
| Authentication | Google OAuth 2.0 |
| Vector Database | MongoDB (embedding storage) |
| Task Scheduling | APScheduler (CRON jobs) |
Agno is a powerful Python framework for building AI agents. In NewsVerse, we use Agno to:
- Create specialized AI agents (Summarization, Fact-Checking, Sentiment Analysis, NER, Article Scoring)
- Handle LLM provider management (Gemini, Groq)
- Implement automatic API failover mechanisms
- Orchestrate multi-agent workflows
Special Thanks: This project is built using the Agno Framework created by Ashpreet B. (CEO of Agno). The framework's agent-based architecture made it seamless to build and manage multiple AI agents for different tasks.
High-performance Python web framework used for:
- RESTful API endpoints
- User authentication (Google OAuth)
- Background task scheduling
- Pipeline orchestration endpoints
Modern frontend stack providing:
- Type-safe component development
- Responsive UI with Tailwind CSS
- Real-time article recommendations
- User preference management
Document database storing:
- Articles with embeddings
- User profiles and preferences
- Recommendation cache
- Processing status tracking
Directory: backend/Scraping_Crawling/
Purpose: Fetch raw articles (links, titles, content) from multiple news sources.
-
🌐 Broad Discovery —
Crawl4ai- Automatically discovers news articles across the web
- Handles dynamic content and JavaScript-rendered pages
-
🎯 Reliable Extraction — Custom Parsers
- For stable, major sources (BBC, CNN, HT, Benzinga), custom parser functions ensure consistency
- Handles source-specific HTML structures
def parse_bbc(soup):
"""Custom parser for BBC News articles."""
content = soup.find('article').text
return content
def parse_cnn(soup):
"""Custom parser for CNN articles."""
content = soup.find('div', class_='article__content').text
return content
PARSER_MAP = {
'bbc.com': parse_bbc,
'cnn.com': parse_cnn,
'hindustantimes.com': parse_ht,
'benzinga.com': parse_benzinga
}{
"_id": "HT_20250908_141827_5388",
"source": "HT",
"title": "Vice-president election on Sept 9...",
"date": "2025-09-08",
"time": "13:43:25",
"content": "The stage is set for...",
"url": "https://www.hindustantimes.com/...",
"scraped_at": "2025-09-08T14:18:27.311+00:00",
"processed_status": {
"summarized": false,
"fact_checked": false,
"sentiment": false,
"ner": false,
"scored": false
}
}Once raw articles are collected, a series of AI agents enrich the data. All modules feature resilient API handling with automatic failover from Gemini to Groq when rate limits are encountered.
Directory: backend/Summarization/
Agents Used: 2 (Summarization Agent, Story Agent)
Purpose: Generates two types of summaries for each article:
- Factual Summary — Concise, professional summary (2-4 sentences)
- Story Summary — Child-friendly, engaging story format (3-5 sentences, ages 6-12)
How it Works (run_summarization.py):
- Fetch Articles: Retrieves articles from MongoDB that need summarization
- Generate Factual Summary: Calls
get_factual_summary()which:- Uses the Summarization Agent (Groq model:
openai/gpt-oss-120b) - Handles API failover (Gemini → Groq on rate limits)
- Returns JSON with
summaryfield
- Uses the Summarization Agent (Groq model:
- Generate Story Summary: Calls
get_story_summary()which:- Uses the Story Agent (Groq model:
openai/gpt-oss-120b) - Creates child-friendly summaries with simple language
- Returns JSON with
story_summaryfield
- Uses the Story Agent (Groq model:
- Update Database: Saves both summaries to the article document
Agent Prompts (agents.py):
Summarization Agent:
You are a news article summarizer. Summarize the given article text in 2-4 sentences.
Return JSON in this exact format:
{
"summary": "<short, concise summary of the article>"
}
Do NOT include anything outside the JSON object.
Story Agent:
You are a children's story writer. Read the article carefully and summarize it in a fun,
simple, and easy-to-read way for kids.
Rules:
1. Use simple language suitable for 6-12 year old children.
2. Make it engaging like a short story.
3. Keep the summary concise (3-5 sentences max).
4. Focus on the main events or important points, but avoid technical jargon.
5. Return JSON in this exact format:
{
"story_summary": "<summary written as a story for kids>"
}
6. Do NOT include anything outside the JSON object.
Output Format:
{
"summarization": {
"summary": "Concise factual summary...",
"story_summary": "Child-friendly story format..."
}
}Directory: backend/Fact_Checker/**
Agents Used: 1 (Fact-Checker Agent with Web Search Tools)
Purpose: Verifies factual claims in articles using web search tools and returns a boolean verdict.
How it Works (fact_checker.py):
- Fetch Articles: Retrieves all articles from MongoDB
- Extract Main Claim: The agent identifies the primary factual claim
- Web Search Verification: Uses ONE of the following tools:
DuckDuckGoToolsGoogleSearchToolsWebBrowserToolsWebsiteTools
- Compare Results: Compares claim against top 3 reputable sources (BBC, Reuters, AP, Bloomberg, etc.)
- Generate Verdict: Returns boolean result with explanation
- Update Database: Saves fact-check results to article document
- Save Local Copy: Writes results to
fact_check_results.json
Agent Instructions (agents.py):
Step 1: Read the provided news article text.
Step 2: Extract the main factual claim from the article.
Step 3: Use ONLY ONE search (DuckDuckGo, GoogleSearch, WebBrowser, or WebsiteTools) for that claim.
Step 4: Compare the claim to the top 3 reputable search results (BBC, Reuters, AP, Bloomberg, etc.).
Step 5: Decide if the claim is factually correct (true or false).
Step 6: Output the result ONLY in a raw JSON object (no markdown block or surrounding text).
The JSON MUST have exactly two fields: 'llm_verdict' (boolean: true/false) and
'fact_check_explanation' (string: short reason).
Example: {"llm_verdict": true, "fact_check_explanation": "The claim is supported by multiple reputable sources."}
Output Format:
{
"fact_check": {
"llm_verdict": false,
"fact_check_explanation": "The article claims that Jagdeep Dhankhar resigned as Vice President on..."
}
}Features:
- ✅ API Failover: Automatically switches from Gemini to Groq on rate limits
- ✅ Tool-Based Verification: Uses real web search to verify claims
- ✅ Reputable Source Focus: Prioritizes trusted news sources for verification
Directory: backend/Sentiment_Analysis/
Agents Used: 1 (Sentiment Agent)
Purpose: Classifies article sentiment as Positive, Negative, or Neutral with reasoning.
How it Works (sentiment.py):
- Fetch Articles: Retrieves articles that need sentiment analysis
- Analyze Content: Agent analyzes tone and language
- Classify Sentiment: Returns classification with reason
- Update Database: Saves sentiment to article document
- Save Local Copy: Writes results to
sentiment_analysis.json
Agent Instructions (agents.py):
You are a sentiment evaluation agent. Analyze the tone and language of the article text.
Determine sentiment strictly as **Positive, Negative, or Neutral** based on these rules:
1. **Positive** → The article contains positive keywords (e.g., growth, profit, gain, recovery,
expansion, strong, successful), or the overall tone is optimistic and confidence-building.
2. **Negative** → The article contains negative keywords (e.g., loss, decline, fall, risk, weak,
downgrade, failure), or the overall tone is pessimistic, warning, or confidence-reducing.
3. **Neutral** → The article is mainly factual, descriptive, or balanced — with no clear
positive or negative tone. Includes objective reporting, announcements, or mixed signals.
4. Return only JSON in this exact format:
{
"sentiment": "<Positive|Negative|Neutral>",
"reason": "<short reason explaining the classification>"
}
5. Reason should be brief (1-2 sentences).
6. Do NOT include anything outside the JSON object.
Output Format:
{
"sentiment": "Neutral"
}Features:
- ✅ API Failover: Automatically switches from Gemini to Groq on rate limits
- ✅ Detailed Classification: Provides reasoning for sentiment classification
- ✅ Keyword-Based Analysis: Uses keyword detection and tone analysis
Directory: backend/Name_Entity_Recognition/
Agents Used: 1 (NER Agent)
Purpose: Extracts and aggregates named entities (Person, Location, Organization) from all articles a user has liked, storing them in the user's profile.
How it Works (NER.py):
- Fetch Users: Retrieves all users from the user collection
- Get User's Liked Articles: For each user, retrieves their
title_id_list(articles they've interacted with) - Process Each Article: For each article in the user's list:
- Fetches article content from the article collection
- Runs NER agent on the content
- Extracts entities (Person, Location, Organization)
- Aggregate Entities: Combines all entities from all user's articles into a single aggregated list (removes duplicates)
- Update User Profile: Saves aggregated entities to the user's
ner_datafield in MongoDB
Agent Instructions (agents.py):
Extract all unique named entities from the following news article text and categorize them
as "Person", "Location" (including cities/countries/regions), or "Organization"
(companies, institutions).
Return strictly a JSON object in this format:
{
"Person": [list of unique person names],
"Location": [list of unique locations],
"Organization": [list of unique organizations]
}
Do not include any other text, comments, or explanations. Return valid JSON only.
Output Format (in User Collection):
{
"ner_data": {
"Person": ["Rohit Arya", "Deepak Kesarkar", "Ashish Shelar", ...],
"Location": ["Mumbai", "Pune", "Maharashtra", ...],
"Organization": ["BCCI", "School Education Department", ...]
}
}Key Features:
- ✅ User-Centric: Processes entities per user, not per article
- ✅ Aggregation: Combines entities from all user's liked articles
- ✅ Deduplication: Removes duplicate entities automatically
- ✅ Profile Building: Used to build user interest profiles for recommendations
Note: This module updates the User Collection, not the Article Collection, as it builds user preference profiles based on their reading history.
Directory: backend/Embedding_Creation/
Purpose: Converts article text into high-dimensional vector embeddings for semantic similarity matching.
How it Works (embeddings.py):
- Connect to Database: Establishes connection to MongoDB article collection
- Fetch Unprocessed Articles: Retrieves articles that don't have embeddings yet
- Generate Embeddings: For each article:
- Combines
titleandcontentfor richer context - Uses SentenceTransformer model (
all-MiniLM-L6-v2) - Generates 384-dimensional vector embedding
- Converts to list format for MongoDB storage
- Combines
- Update Database: Saves embedding array to article document
- Logging: Tracks progress and errors for each article
Model Details:
- Model:
all-MiniLM-L6-v2(SentenceTransformers) - Dimensions: 384
- Purpose: Semantic similarity matching for recommendations
- Input: Article title + content (combined text)
- Output: Vector array stored in
embeddingfield
Code Example:
from Embedding_Creation.model_loader import embedding_model
# Combine title and content for richer embedding
text_to_embed = f"{article.get('title', '')} {article.get('content', '')}"
# Generate the embedding
embedding = embedding_model.encode(text_to_embed).tolist()
# Store in MongoDB
db_manager.updateArticleEmbedding(collection, article["_id"], embedding)Key Features:
- ✅ Efficient Processing: Only processes articles without embeddings
- ✅ Rich Context: Combines title and content for better semantic representation
- ✅ Batch Processing: Handles multiple articles efficiently
- ✅ Error Resilience: Continues processing even if individual articles fail
Directory: backend/Article_Scorer/
Purpose: This module assigns a hybrid "quality" score to each article by combining an AI-generated "knowledge depth" score with a (potential) user-provided score. It is designed to be resilient, with a built-in failover from the Gemini API to Groq.
Agents Used: 1 (The Article Scoring Agent)
This module uses a highly specific prompt with a 0-9 rubric based on "knowledge depth". The agent is instructed to return only a JSON object.
# This is the exact prompt from agents.py
"You are an evaluator of news articles.\n"
"Score each article from 0 to 9 based on knowledge depth:\n\n"
"0–2: Poor — highly superficial, incomplete, or factually questionable.\n"
"3–5: Moderate — covers basics but lacks depth or misses key points.\n"
"6–8: Good — detailed, covers multiple aspects, balanced and factual.\n"
"9: Exceptional — comprehensive, in-depth, authoritative, and well-structured.\n\n"
"Return valid JSON only in the format:\n"
"{\n"
' "score": <integer 0–9>,\n'
' "reason": "<short reason>"\n'
"}"The main script article_scorer.py orchestrates the entire scoring process through several key steps:
The script fetches all articles from MongoDB and groups them by title to de-duplicate the scoring process. This ensures that duplicate articles (same title, different sources) receive the same score, avoiding redundant API calls.
For one representative article from each group, the script calls the get_llm_score function. This function is designed to be robust and resilient:
Process:
- Attempts to get a score from the primary model (Gemini) via the
api_manager - The agent analyzes the article content and returns a JSON with:
score: Integer from 0-9reason: Short explanation
API Failover Mechanism:
- If the Gemini API fails due to rate limits (
ResourceExhausted), theapi_manageris instructed toswitch_to_groq() - The function automatically retries the request using the Groq model as a failover
- This ensures the scoring process continues even when one API is unavailable
The script iterates through the grouped articles to find any existing user_article_score (e.g., from a user's manual rating or feedback).
Purpose: Incorporates user feedback into the final score when available.
The final_custom_score is a weighted average that combines AI analysis with user feedback.
Formula Used (article_scorer.py):
# This is the exact formula from article_scorer.py
final_score = round((llm_score * 0.6) + (user_score * 0.4), 2) if user_score is not None else llm_scoreScoring Logic:
- If user_score exists:
final_score = (60% × llm_score) + (40% × user_score) - If no user_score:
final_score = llm_score
This weighted approach ensures that AI analysis carries more weight (60%) while still incorporating valuable user feedback (40%) when available.
Update MongoDB:
- The
final_scoreand its components (llm_score,user_article_score) are saved back to all articles in the group in MongoDB - This ensures all duplicate articles receive the same score
Save Local Copy:
- All scores are also saved to a local
article_scores.jsonfile for logging and backup purposes
- ✅ Resilient API Handling: Automatic failover from Gemini to Groq prevents scoring failures
- ✅ Deduplication: Groups articles by title to avoid redundant scoring
- ✅ Hybrid Scoring: Combines AI analysis (60%) with user feedback (40%)
- ✅ Comprehensive Logging: Saves scores both to MongoDB and local JSON file
- ✅ Knowledge Depth Focus: Uses a 0-9 rubric specifically designed to evaluate article depth and quality
Directory: backend/Recommendation_Engine/
Purpose: Matches users with most relevant articles using a sophisticated multi-stage pipeline that combines user behavior analysis, AI-powered profile generation, and vector similarity matching.
The recommendation system follows a precise, multi-step process that transforms raw user interactions into personalized article recommendations.
The pipeline begins when a user performs an action in the frontend:
- Liking an article (via
ArticleCard.tsx) - Defining interests in their profile (via
UserPreferences.tsx)
These actions are logged in MongoDB, creating records of:
liked_article_ids— List of articles the user has interacted withexplicit_preferences— Raw text preferences (e.g., "I like AI and finance")
When a recommendation is needed, this script creates a unified "profile" of the user's interests.
Process:
- Fetches two data sources from MongoDB:
- The user's
liked_article_ids - The user's
explicit_preferences(raw text)
- The user's
- Retrieves the full text content (or summaries) of all liked articles
- Collects all preference data into a single dataset
Output: A collection of raw, "noisy" text data (e.g., 5 liked articles + 3 preference phrases)
The raw user data is processed by an AI agent to distill it into a clean, meaningful profile.
Agent: User Analyzer Agent (defined in backend/Recommendation_Engine/agents.py)
Example Prompt:
You are a user profile analyzer. Based on the following articles a user has liked
({liked_article_content}) and their stated interests ({explicit_preferences}),
generate a single, dense paragraph that summarizes this user's true, nuanced interests.
Identify key topics, entities, and recurring themes.
Example Transformation:
Input:
- Article on Tesla
- Article on NVIDIA stock
- Preference: "AI"
Agent Output:
"This user is interested in high-growth technology, specifically in the electric vehicle and artificial intelligence sectors. They follow key companies like Tesla and NVIDIA, and are interested in the financial market implications of new tech."
Result: A single, high-quality "interest paragraph" that captures the user's true preferences.
The clean "interest paragraph" from Step 2 is converted into a mathematical representation.
Process:
- The interest paragraph is fed into the embedding model (from
backend/Embedding_Creation/embeddings.py) - Uses SentenceTransformer (
all-MiniLM-L6-v2) to generate vector embeddings - Output: A single User Profile Vector (e.g., a
[1, 384]array)
Storage: This vector is saved in the user's MongoDB document for quick retrieval, avoiding recomputation on every request.
This is the core matching engine, triggered by:
- Frontend requests (from
News.tsx) - WhatsApp service (from
whatsapp_sender.py)
Process:
-
Fetch User Profile Vector
- Retrieves the pre-calculated User Profile Vector from MongoDB
-
Load Article Vectors
- Loads all Article Vectors from the database (created by
Embedding_Creation/embeddings.pywhen articles were first scraped)
- Loads all Article Vectors from the database (created by
-
Calculate Similarity
- Uses cosine similarity to compute the mathematical "closeness" between the User Profile Vector and all Article Vectors
Code Implementation:
from sklearn.metrics.pairwise import cosine_similarity
# user_vector.shape is [1, 384]
# all_article_vectors.shape is [N, 384] (N = number of articles)
# This calculates the similarity of the user to EVERY article
similarity_scores = cosine_similarity(user_vector, all_article_vectors)
# Result is an array like: [0.91, 0.23, 0.88, 0.05, ...]The final step sorts and delivers the most relevant articles.
Process:
- Sorts the
similarity_scoresarray from highest to lowest - Takes the Top 10 article IDs from this sorted list
- Returns the final list as a JSON response to the frontend
- Frontend displays these articles to the user in
News.tsx
Result: Users receive personalized article recommendations that match their interests, behavior, and stated preferences.
user_analyzer.py— Analyzes user behavior and preferencesagents.py— Contains the User Analyzer Agent (LLM-based profile generation)article_recommender.py— Core matching engine using cosine similarityengine.py— Orchestrates the recommendation pipelinemodel_loader.py— Loads embedding models for vectorization
Directory: backend/Whatsapp_Messaging/
Purpose: Delivers personalized news recommendations to users via WhatsApp using Twilio API, with scheduled delivery based on user preferences.
How it Works:
-
Scheduled Tasks (
scheduler_tasks.py):- Uses APScheduler to run periodic tasks
- Fetches users with phone numbers and preferred delivery times
- Triggers recommendation generation for each user
-
Message Generation (
whatsapp_sender.py):- Retrieves top 10 recommended articles for each user
- Formats articles into WhatsApp-friendly message format
- Includes article titles, summaries, and links
-
Sending (
whatsapp_service.py):- Uses Twilio API to send messages
- Handles message formatting and delivery
- Logs delivery status
-
Integration:
- Connected to Recommendation Engine for article selection
- Respects user's
preferred_timesetting - Sends daily/weekly digests based on user preferences
Key Features:
- ✅ Scheduled Delivery: CRON-based scheduling via APScheduler
- ✅ Personalized Content: Uses recommendation engine for article selection
- ✅ Time Preferences: Respects user's preferred delivery time
- ✅ Twilio Integration: Reliable message delivery via Twilio API
Files:
whatsapp_sender.py— Main sending logicwhatsapp_service.py— Twilio API integrationscheduler_tasks.py— Scheduled task managementrecommender.py— Article recommendation integration
The NewsVerse platform uses MongoDB to store articles, user data, preferences, and recommendations. Below are the detailed schemas for each collection.
The main collection storing all scraped and processed articles.
Collection Name: articles (or similar, as configured)
Document Structure:
{
"_id": "HT_20250908_141827_5388",
"source": "HT",
"title": "Vice-president election on Sept 9: Numbers back NDA as Radhakrishnan b…",
"date": "2025-09-08",
"time": "13:43:25",
"content": "The stage is set for CP Radhakrishnan andSudershan Reddyto battle it o…",
"url": "https://www.hindustantimes.com/india-news/vice-president-election-on-s…",
"scraped_at": "2025-09-08T14:18:27.311+00:00",
"summarization": {
"summary": "Factual summary text...",
"story_summary": "Child-friendly story format..."
},
"sentiment": "Neutral",
"fact_check": {
"llm_verdict": false,
"fact_check_explanation": "The article claims that Jagdeep Dhankhar resigned as Vice President on…"
},
"article_score": {
"user_article_score": 5,
"llm_score": 6,
"final_custom_score": 5.6
},
"embedding": [/* Array of 384 dimensions */],
"rated_by": ["darshvaishnani1234@gmail.com"],
"processed_status": {
"summarized": true,
"fact_checked": true,
"sentiment": true,
"ner": true,
"scored": true
}
}Key Fields:
_id: Unique identifier (format:{SOURCE}_{DATE}_{TIME}_{RANDOM})source: News source abbreviation (e.g., "HT", "BBC", "CNN")embedding: Vector embedding (384 dimensions) for similarity matchingarticle_score: Quality score from Article Scorer modulerated_by: Array of user emails who have rated this article
Stores user profile information, preferences, and interaction history.
Collection Name: users (or similar, as configured)
Document Structure:
{
"_id": "68be98c16b193cc8e8317f73",
"email": "darshvaishnani1234@gmail.com",
"name": "Darsh Vaishnani",
"picture": "https://lh3.googleusercontent.com/a/ACg8ocJ_ZBbPNqLG1JJikCsw90INemaPXd…",
"phone_number": "+919375981112",
"preferred_time": "01:38",
"rated_articles": ["BBC_20250908_141827_6617"],
"ner_data": {
"Person": [
"Rohit Arya",
"Deepak Kesarkar",
"Ashish Shelar",
"Mohsin Naqvi",
"Devajit Sakia",
"Shukla",
"Suryakumar Yadav",
"Salman Agha"
],
"Location": [/* Array of location entities */],
"Organization": [/* Array of organization entities */]
},
"title_id_list": [
"IndianExpress_20251031_230549_6828",
"IndianExpress_20251001_004427_9329"
],
"title_list": [
"Behind the Powai tragedy: Rohit Arya's long fight with Maharashtra's S…",
"BCCI ex-officio leaves ACC meeting midway in protest, says Mohsin Naqv…"
],
"user_profile_vector": [/* Array of 384 dimensions - optional */],
"explicit_preferences": [/* Array of user-stated interests - optional */]
}Key Fields:
rated_articles: Array of article IDs the user has rated/likedner_data: Named entities extracted from user's liked articlestitle_id_list: IDs of articles the user has interacted withtitle_list: Titles of articles for quick referenceuser_profile_vector: Pre-computed embedding vector for recommendations (optional, cached)
Stores the AI-generated detailed summary of user interests.
Collection Name: user_preference_analysis (or similar, as configured)
Document Structure:
{
"_id": {
"$oid": "690516fce88e1f8c72949dee"
},
"email": "darshvaishnani1234@gmail.com",
"name": "Darsh Vaishnani",
"detailed_summary": "Based on the provided entities, the user seems to be interested in news related to education initiatives in India, particularly in Maharashtra (given mentions of 'School Education Department', 'Mazi Shala Sundar Shala', 'School Education Commissionerate', 'Powai', 'Pune', 'Mumbai'). They also seem interested in events and campaigns like 'Mahatma Gandhi Jayanti Se Sardar Patel Jayanti Tak', 'Vikasit Bharat Buildothon', 'Veer Gatha 5.0', 'Ek Ped Ma Ke Naam', and 'Mission Life Eco Club'. There's also a strong interest in cricket, with mentions of 'Suryakumar Yadav', 'Salman Agha', 'Board of Control for Cricket (BCCI)', 'Asian Cricket Council (ACC)', and 'Pakistan Cricket Board (PCB)', implying an interest in India-Pakistan cricket relations and tournaments possibly held in 'Dubai'. The user may also follow news from 'The Indian Express'.\n"
}Purpose: This collection stores the output from Step 2 of the Recommendation Pipeline (The Analysis Agent). The detailed_summary is the distilled interest paragraph that gets vectorized for recommendations.
Stores pre-computed article recommendations for each user.
Collection Name: recommended_articles (or similar, as configured)
Document Structure:
{
"_id": {
"$oid": "690516ae8b1f3d3714c87f83"
},
"email": "darshvaishnani1234@gmail.com",
"articles": [
{
"_id": "IndianExpress_20251031_230549_6828",
"title": "Behind the Powai tragedy: Rohit Arya's long fight with Maharashtra's School Education Dept",
"similarity": 0.2422
},
{
"_id": "IndianExpress_20251001_004427_9329",
"title": "BCCI ex-officio leaves ACC meeting midway in protest, says Mohsin Naqvi gave no clarity over Asia Cup trophy",
"similarity": 0.2483
}
// ... up to 10 articles
]
}Key Fields:
email: User identifierarticles: Array of top 10 recommended articles_id: Article identifiertitle: Article title for displaysimilarity: Cosine similarity score (0.0–1.0) indicating match quality
Purpose: This collection caches the results from Step 5 of the Recommendation Pipeline, allowing quick retrieval of personalized recommendations without recalculating similarity scores on every request.
- Node.js ≥ 18
- Python ≥ 3.10
- MongoDB (local or cloud instance)
- API Keys (stored in
.envfile):- Google OAuth credentials
- OpenAI/Groq API keys
- Gemini API key
- Twilio credentials
- MongoDB connection URI
# Navigate to backend directory
cd backendMac/Linux:
python -m venv venv
source venv/bin/activateWindows:
python -m venv venv
.\venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the backend/ directory:
# MongoDB
MONGO_URI=your_mongodb_connection_string
# Google OAuth
GOOGLE_CLIENT_ID=your_google_client_id
GOOGLE_CLIENT_SECRET=your_google_client_secret
SECRET_KEY=your_secret_key
# LLM APIs
GEMINI_API_KEY=your_gemini_api_key
GROQ_API_KEY=your_groq_api_key
# Twilio
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token
TWILIO_PHONE_NUMBER=your_twilio_phone
# Frontend URL
FRONTEND_URL=http://localhost:5173uvicorn main:app --reloadServer runs at:
👉 http://localhost:8000
API Documentation:
👉 http://localhost:8000/docs (Swagger UI)
👉 http://localhost:8000/redoc (ReDoc)
# Navigate to frontend directory
cd frontendnpm installnpm run devApp runs at:
👉 http://localhost:5173
The FastAPI server provides the following key endpoints:
GET /auth/google- Initiate Google OAuth loginGET /auth/callback- OAuth callback handlerGET /auth/logout- Logout userGET /user- Get current user information
GET /articles- Get articles (with filtering/pagination)GET /articles/{article_id}- Get specific articlePOST /articles/{article_id}/rate- Rate an article
GET /recommendations- Get personalized recommendations for current userPOST /preferences- Update user preferences
POST /pipeline/summarize- Run summarization pipelinePOST /pipeline/fact-check- Run fact-checking pipelinePOST /pipeline/sentiment- Run sentiment analysis pipelinePOST /pipeline/score- Run article scoring pipelinePOST /pipeline/preprocess- Run full preprocessing pipeline
POST /whatsapp/send- Send WhatsApp message (internal)
Directory: Raw_code_developer/
This directory contains:
- 🧪 Early prototypes - Initial proof-of-concept implementations
- 📓 Notebook experiments - Jupyter notebooks for testing
- 📚 Crawl4ai tutorials - Learning resources and examples
- 📊 Sample datasets - Example JSON files for testing
BBC_filtered_news_articles.jsonfact_check_results.jsonsentiment_results.jsonuser_database.json
- 💻 Initial code - Early versions of Summarization, Fact-Check, and Recommendation modules
See LICENSE file for details.
Built with ❤️ using AI, React, and Python