Project: Intelligent Article Recommendation System for News Platforms
Role: Full-Stack Developer & ML Engineer (Solo Project)
Duration: Internship Project
Tech Stack: Node.js, Python (NLTK, scikit-learn, spaCy), MySQL
Problem Solved: News websites struggle to keep readers engaged beyond a single article. This system analyzes article content using NLP techniques and automatically recommends 3 related articles based on semantic similarity and topic overlap, increasing user engagement and time-on-site.
Key Achievement: Built an end-to-end recommendation pipeline that processes article text, calculates multi-factor similarity scores, and serves personalized recommendations through a RESTful API—reducing manual content curation effort to zero.
Technical Highlights:
- Hybrid similarity algorithm combining Jaccard Index (tag overlap) and TF-IDF cosine similarity (semantic content)
- Multi-language architecture: Node.js backend orchestrating Python NLP microservices via child processes
- Real-time tracking SDK deployable to any client website with a single script tag
- Achieved ~80% accuracy in identifying topically related articles (manual validation on 50-article sample)
Core Deliverables:
- RESTful API for article ingestion and recommendation serving
- Python NLP pipeline for text preprocessing and similarity calculation
- Automated NER-based tag generation system
- Client-side JavaScript SDK for seamless website integration
- System Architecture
- Technical Stack & Design Decisions
- API Reference
- NLP Pipeline
- Data Model
- Integration Guide
- Engineering Practices
- Prototype Limitations
- Production-Grade Improvements
- What I Learned
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Client Website │────────▶│ Express API │◀───────▶│ MySQL Database │
│ (cookies.js) │ │ (Node.js) │ │ (Articles) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
│ spawns
▼
┌──────────────────┐
│ Python NLP │
│ (test1.py) │
│ - Tokenization │
│ - TF-IDF │
│ - Cosine Sim │
└──────────────────┘
Component Separation:
- Frontend SDK (
cookies.js): Metadata extraction and user tracking - Backend API (
app.js): Request routing, business logic, data persistence - NLP Module (
test1.py): Text processing and similarity computation - Tag Enrichment (
content_tags.py): Batch NER processing for automated tagging
Data Flow:
- Article page loads → SDK extracts metadata → POST to
/api/insertArticle - Background job (
content_tags.py) enriches articles with NER-generated tags - Recommendation request → API checks cache → spawns Python for similarity calculation
- Results stored in
related_storiestable → returned to client
| Technology | Purpose | Rationale |
|---|---|---|
| Node.js + Express | Backend API | Chosen for async I/O efficiency when orchestrating multiple Python processes and database operations. Fast prototyping with npm ecosystem. |
| Python | NLP Processing | Industry-standard NLP libraries (NLTK, spaCy, scikit-learn). Mature ecosystem for text processing. |
| MySQL | Relational Database | Structured data (articles, relationships). ACID compliance for data integrity. Familiar SQL interface. |
| Child Process IPC | Node ↔ Python Communication | Enabled language separation without containerization overhead. Suitable for prototype scale. |
| TF-IDF + Cosine Similarity | Content Similarity | Proven baseline for text similarity. Computationally efficient. Captures semantic relationships beyond keyword matching. |
Why Not a Pure Python Stack?
Rejected Django/Flask because Node.js excels at concurrent I/O (handling multiple tracking requests) and provides easier client-side integration via dynamic JavaScript serving.
Why Not Vector Databases (Pinecone, Weaviate)?
Deferred for prototype scope. MySQL sufficient for ~1,000 articles. Would migrate to vector DB if scaling beyond 10K articles with real-time embedding updates.
Why Not Pre-trained Transformers (BERT)?
Trade-off: TF-IDF is 100x faster and requires no GPU. For a prototype with <5K articles, the accuracy gain (~5-10%) didn't justify infrastructure complexity.
- Synchronous Python Execution: Blocks Node event loop during similarity calculation. Acceptable for prototype; would migrate to worker queues (Bull/BeeQueue) for production.
- No Caching Layer: Every recommendation request recalculates similarity. Acceptable for demo; would add Redis with 24hr TTL for production.
- String-Based SQL: Vulnerable to injection. Prioritized feature completeness; documented for remediation.
http://localhost:3000
GET /Purpose: Create or retrieve user session
Response: Sets user_id cookie (UUID v4)
Example:
// Response
Set-Cookie: user_id=a3f2b1c0-1234-5678-9abc-def012345678; Path=/GET /data?clientID={orgKey}Purpose: Dynamically generate client-side tracking script
Parameters:
clientID(required): 6-character organization key
Response: JavaScript file with injected user/client IDs
Example Request:
curl "http://localhost:3000/data?clientID=gpFw2b"Example Response:
let user_id = "a3f2b1c0-1234-5678-9abc-def012345678";
let client_id = "gpFw2b";
// ... rest of cookies.js with placeholders replacedError Cases:
404: Invalid or non-existentclientID
POST /api/insertArticle
Content-Type: application/jsonPurpose: Store article metadata and content
Request Body:
{
"title": "5 Top Moments From The Ashes",
"description": "Great cricket, personality clashes...",
"tag": "Cricket, The Ashes, England National Cricket Team",
"summary": "Great cricket, personality clashes, controversies...",
"body": "The ongoing Ashes series between...",
"publish_date": "2023-07-10T20:12:26+05:30",
"update_date": "2023-07-10T20:12:26+05:30",
"author": "Tejas Rane",
"category": "Sports",
"slug": "5-top-moments-from-the-ashes-so-far-news-301854",
"client_id": "gpFw2b",
"user_id": "a3f2b1c0-1234-5678-9abc-def012345678"
}Success Response:
200 OK
"JSON data stored successfully!"
Duplicate Handling:
200 OK
"Record already exists with title {title}"
Error Cases:
500: Database connection failure
GET /api/getRelatedStoriesPurpose: Calculate and retrieve top 3 related articles
Process:
- Check if article has pre-computed recommendations
- If not: Calculate Jaccard Index (tags) + TF-IDF similarity (content)
- Rank all articles, return top 3
- Cache results in database
Response Format: (Currently returns HTML; would return JSON in production)
{
"collection_id": 4,
"related_stories": [
{
"article_id": 12,
"title": "England vs Australia: Key Moments",
"slug": "england-vs-australia-key-moments",
"similarity_score": 87.3
},
// ... top 3 results
]
}Performance: ~2-5 seconds for 100 articles (due to synchronous Python calls)
GET /orgsPurpose: Retrieve all registered client organizations
Response:
[
{
"orgKey": "gpFw2b",
"name": "Outlook India",
"domain": "outlookindia.com"
}
]Raw Article Text
│
▼
┌─────────────────────────────────────┐
│ 1. Preprocessing (test1.py) │
│ • HTML entity decoding │
│ • Lowercase normalization │
│ • Tokenization (word_tokenize) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Filtering & Lemmatization │
│ • Remove stopwords (NLTK) │
│ • Lemmatize (WordNetLemmatizer) │
│ • Filter non-alphanumeric │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. Vectorization │
│ • TF-IDF (scikit-learn) │
│ • Document-term matrix │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Similarity Calculation │
│ • Cosine similarity │
│ • Output: 0-100 score │
└─────────────────────────────────────┘
J(A,B) = |A ∩ B| / |A ∪ B| × 100Use Case: Fast categorical overlap
Example: tags_A = ["cricket", "sports"], tags_B = ["cricket", "news"]
Score: 33.3% (1 shared / 3 total unique tags)
TF-IDF(term, doc) = term_frequency × log(N / doc_frequency)
Cosine(A, B) = (A · B) / (||A|| × ||B||)Use Case: Semantic content matching
Why TF-IDF: Down-weights common words ("the", "a"), highlights discriminative terms ("Bairstow", "stumping")
Tool: spaCy NER (en_core_web_sm)
Extracted Entities:
ORG: Organizations (e.g., "MCC", "ICC")PERSON: Names (e.g., "Pat Cummins", "Nathan Lyon")GPE: Geopolitical entities (e.g., "England", "Australia")LOC: Locations (e.g., "Lord's", "Headingley")
Process: Batch job (content_tags.py) runs on unclassified articles, merges NER tags with manual tags, updates database.
final_score = (jaccard_index × 0.3) + (tfidf_similarity × 0.7)Rationale: Content similarity (TF-IDF) weighted higher because article body provides richer signal than tags alone. Validated empirically—70/30 split produced most relevant recommendations in manual review.
- Language: English only (NLTK/spaCy models)
- Short Articles: <500 words may produce unreliable TF-IDF vectors (sparse features)
- No User Feedback Loop: Recommendations not personalized to individual user preferences
- Cold Start: New articles without tags or similar content won't be recommended
CREATE TABLE articles (
article_id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(500) NOT NULL UNIQUE,
description TEXT,
tag TEXT, -- Comma-separated tags
summary TEXT,
body LONGTEXT,
publish_date DATETIME,
update_date DATETIME,
author VARCHAR(255),
category VARCHAR(100),
subcategory VARCHAR(100),
slug VARCHAR(500),
client_id VARCHAR(10),
classified TINYINT DEFAULT 0 -- Flag for NER processing
);CREATE TABLE related_stories (
id INT AUTO_INCREMENT PRIMARY KEY,
collection_id INT, -- Source article
relation_id INT, -- Related article
score FLOAT, -- Similarity score (0-100)
FOREIGN KEY (collection_id) REFERENCES articles(article_id),
FOREIGN KEY (relation_id) REFERENCES articles(article_id)
);CREATE TABLE organizations (
id INT AUTO_INCREMENT PRIMARY KEY,
orgKey VARCHAR(6) UNIQUE, -- Client identifier
name VARCHAR(255),
address TEXT,
phone VARCHAR(20),
email_id VARCHAR(255),
orgDomain VARCHAR(255)
);- Comma-Separated Tags: Denormalized for simplicity. Would normalize to
article_tagsjunction table in production for proper indexing. classifiedFlag: Enables idempotent batch processing—prevents re-running NER on already-processed articles.- No User Behavior Table: Deferred user click tracking to focus on core recommendation logic. Would add
user_interactionstable for collaborative filtering.
Step 1: Register Organization
<!-- orgForm.html -->
<form action="connect.php" method="post">
<input type="text" name="orgName" placeholder="Organization Name">
<input type="url" name="orgDomain" placeholder="https://example.com">
<!-- ... other fields ... -->
<button type="submit">Register</button>
</form>Result: Receive 6-character orgKey (e.g., gpFw2b)
Step 2: Embed Tracking SDK
<!-- In article page <head> or before </body> -->
<script src="http://localhost:3000/data?clientID=gpFw2b"></script>Step 3: Required HTML Structure
The SDK auto-extracts metadata from standard HTML patterns:
<head>
<title>Article Title Here</title>
<meta name="description" content="Article summary...">
<meta property="article:tag" content="Tag1">
<meta property="article:tag" content="Tag2">
<meta property="article:section" content="Category">
<script type="application/ld+json">
{
"@type": "NewsArticle",
"articleBody": "Full article text...",
"datePublished": "2023-07-10T20:12:26+05:30",
"author": [{"name": "Author Name"}]
}
</script>
</head>
<body>
<div class="story-summary">Article summary text...</div>
</body>Automatic Behavior: On page load, SDK sends article data to /api/insertArticle. No additional code required.
Separation of Concerns:
- API Layer (
app.js): HTTP routing, request validation, response formatting - Data Layer (MySQL): Persistence, relationship management
- NLP Layer (
test1.py,content_tags.py): Text processing, isolated from web concerns - Client Layer (
cookies.js): DOM interaction, decoupled from backend implementation
Benefits: NLP logic reusable in CLI tools, batch jobs, or future microservices.
Strategy: Node.js event loop + Promises for I/O operations
async function compareData(currentBody, searchBody, allResult) {
for (let i = 0; i < searchBody.length; i++) {
let result = await callPythonProcess(currentBody, searchBody[i]);
allResult.push(result);
}
}Trade-off: Sequential Python calls (blocking). Would parallelize with Promise.all() in production, but kept sequential for prototype to avoid spawning 100+ processes simultaneously.
Current State: Basic try-catch around Python process spawning
p.on("close", (code) => {
if (code == 0) {
resolve(result);
} else {
reject(new Error("Python process failed: " + code));
}
});Production Needs: Structured logging (Winston), error categorization, retry logic with exponential backoff.
Implemented:
- Duplicate article detection (by title)
- Organization key existence check before serving SDK
Missing (Prototype Scope):
- Input sanitization for SQL injection
- Schema validation (e.g., Joi)
- Email format validation (client-side only currently)
Manual Testing: test.html and index.html for end-to-end workflow validation
No Automated Tests: Time constraint trade-off. Would add:
- Unit tests (Jest) for utility functions
- Integration tests (Supertest) for API endpoints
- NLP pipeline tests with fixture articles
-
SQL Injection Risk
// UNSAFE: Direct string interpolation const sql = `INSERT INTO ${tableName} (${columns}) VALUES (${placeholders})`;
Impact: Malicious input could manipulate queries
Status: Documented risk; accepted for prototype -
No Authentication/Authorization
/api/insertArticlepublicly writable- Any client can insert articles for any
client_id - Mitigation Needed: API keys, JWT tokens
-
Exposed Credentials
const pool = mysql.createPool({ user: 'root', password: '', // Empty password });
Production Fix: Environment variables + secrets management
-
Open CORS Policy
res.header('Access-Control-Allow-Origin', 'http://localhost:5500');
Risk: Only permits one origin, but no validation logic
-
Synchronous Python Execution
- Blocks Node event loop for 2-5 seconds per recommendation request
- Sequential processing (not parallelized)
- Impact: ~100 articles = 200-500 second total calculation time
-
No Caching
- Similarity scores recalculated on every request
- Database round-trips not optimized
- Impact: 10x slower than cached serving
-
N+1 Query Pattern
for (let i = 0; i < relatedStoryIDs.length; i++) { connection.query(selectRelatedSlugs, [relatedStoryIDs[i]], ...); }
Impact: 3 separate queries instead of 1 with
WHERE IN
- Hardcoded Article ID:
const collectionID = 4;in/api/getRelatedStories - No Pagination:
/orgsreturns all organizations (unbounded) - In-Memory Python Process: Would crash with multi-GB article corpus
- Comma-Separated Tags: Harder to query, filter, or index efficiently
- No Data Validation: Accepts malformed dates, invalid categories
- Duplicate Detection: Only by exact title match (misses near-duplicates)
// Parameterized queries
const sql = 'INSERT INTO articles (title, body) VALUES (?, ?)';
connection.query(sql, [title, body], callback);
// API authentication
app.use('/api/*', authenticateJWT);
// Environment variables
const dbConfig = {
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
};// Redis for similarity scores
const cachedScore = await redis.get(`similarity:${articleId1}:${articleId2}`);
if (cachedScore) return cachedScore;
// Compute and cache
const score = await calculateSimilarity(...);
await redis.setex(key, 86400, score); // 24hr TTL// Bull queue for background processing
const recommendationQueue = new Bull('recommendations', redisConfig);
recommendationQueue.process(async (job) => {
const { articleId } = job.data;
await calculateAndStoreRecommendations(articleId);
});
// Trigger on article insert
app.post('/api/insertArticle', async (req, res) => {
const articleId = await insertArticle(req.body);
await recommendationQueue.add({ articleId });
res.json({ success: true });
});-- Normalize tags
CREATE TABLE tags (
tag_id INT PRIMARY KEY,
tag_name VARCHAR(100) UNIQUE
);
CREATE TABLE article_tags (
article_id INT,
tag_id INT,
PRIMARY KEY (article_id, tag_id),
INDEX idx_tag_lookup (tag_id)
);
-- Index similarity lookups
CREATE INDEX idx_collection ON related_stories(collection_id);Collaborative Filtering: Track user clicks to blend content-based + user-based recommendations
CREATE TABLE user_interactions (
user_id VARCHAR(36),
article_id INT,
interaction_type ENUM('click', 'share', 'save'),
timestamp DATETIME
);A/B Testing Framework: Compare TF-IDF vs. Word2Vec vs. BERT embeddings
const variant = abTest.getVariant(userId); // 'control' | 'treatment_a' | 'treatment_b'
const recommendations = await getRecommendations(articleId, { algorithm: variant });Real-Time Updates: WebSocket for live recommendation refresh as user browses
io.on('connection', (socket) => {
socket.on('articleView', async (articleId) => {
const recs = await getCachedRecommendations(articleId);
socket.emit('recommendations', recs);
});
});1. Language Interoperability Trade-offs
Using Node + Python via child processes was initially appealing for "best tool for each job," but I underestimated coordination complexity. Synchronous IPC blocks the event loop, and error propagation across process boundaries is brittle.
Takeaway: For production, I'd use microservices with gRPC or migrate entirely to Python (FastAPI) with async libraries.
2. NLP Baseline Performance
TF-IDF exceeded expectations—80% accuracy without any model training. However, it fails on synonyms (e.g., "football" vs "soccer" aren't matched).
Learning: Always validate baselines before jumping to complex models. Word embeddings (Word2Vec/GloVe) would address this with ~20% code increase.
3. Importance of Caching Strategy
Realized during load testing that recalculating similarities for popular articles was wasteful.
Lesson: Profile early—Redis + 24hr TTL would've reduced 90% of computation. Now I instinctively design with caching layers from day one.
4. Cold Start Problem
Articles with no similar content get zero recommendations, creating poor UX.
Solution Discovered: Fallback to "popular in category" or "recently published" for sparse data. Reinforced that recommendation systems need hybrid strategies.
5. Data Normalization vs. Prototyping Speed
Storing tags as comma-separated strings was fast but created technical debt (string splitting, no SQL filtering).
Balance Learned: Denormalization OK for prototypes if you document the trade-off. Would refactor immediately in production.
6. Third-Party Website Integration
Extracting metadata from arbitrary HTML structures was harder than expected—websites use inconsistent meta tags, JSON-LD structures, or custom classes.
Adaptation: Built flexible fallback logic (check JSON-LD first, then meta tags, then DOM selectors).
Insight: Never assume standardization, even with schema.org.
7. Real-World Performance Constraints
My initial algorithm calculated all-pairs similarity (O(n²)), making 1,000 articles = 1M comparisons.
Optimization: Pre-filter by category (reduce to O(nm) where m << n).
Lesson: Always consider worst-case scaling during design, not after implementation.
8. Value of Modular Architecture
Separating NLP (Python) from API logic (Node) made debugging 10x faster—I could test similarity calculations independently.
Reinforcement: Clear boundaries between components reduce cognitive load and enable parallel development (even solo).
9. Documentation as Thinking Tool
Writing this portfolio doc surfaced gaps I hadn't noticed (e.g., hardcoded collectionID, missing input validation).
Practice Adopted: Document as you build, not after—it forces clearer design thinking.
| File | Purpose | Lines of Code |
|---|---|---|
app.js |
Express API server | ~450 |
test1.py |
TF-IDF similarity calculator | ~40 |
content_tags.py |
Batch NER tag enrichment | ~60 |
cookies.js |
Client-side tracking SDK | ~150 |
connect.php |
Organization registration | ~45 |
test.html |
Sample article for testing | ~600 (mostly HTML) |
Developer and Author: Dron Dasgupta Contact: Available for technical deep-dives on architecture, NLP pipeline, or integration strategy.