RPP - RAG Preparation Pipeline

A modular pipeline for scraping, parsing, and processing content into RAG-ready JSON artifacts.

Production: https://rag-scrape-pipeline-974351967139.us-west1.run.app Local: http://localhost:9090 Source: https://github.com/susom/rag_scrape_pipeline

Features

Web API with HTML UI for interactive processing
URL scraping (HTML snapshots, main content extraction, PDF/DOCX attachment detection)
Batch document upload (multiple PDF, DOCX, TXT files in one operation)
Link following:
- Web URLs: Follow PDF/DOCX attachments in main content
- Uploaded docs: Extract and scrape web links - supports both HTML and PDF URLs (optional, 1 level deep, rate-limited)
Source-aware AI extraction:
- Web pages: Remove structural cruft (nav, ads, scripts), preserve policy content
- Uploaded docs: Conservative preservation of all substantive content
- Critical: Preserves metadata labels and dry regulatory language
AI-powered content filtering via SecureChatAI gateway
Multi-model support (GPT-4.1, Claude, Gemini, Llama, DeepSeek, etc.)
PDF parsing (via pdfplumber)
Local caching (cache/raw for raw HTML/PDF text)
Sliding window processing with deduplication
Canonical JSON output (cache/rag_ready/{run_id}.json)
GCS storage integration (optional)
SharePoint integration (input/output storage, automation)
CI/CD deployment (auto-deploy on git push)

Pipeline Flow

flowchart LR
    A[URLs] --> B[Scraper/PDF Parser] --> C[cache/raw]
    C --> D[Sliding Window] --> E[cache/rag_ready JSON]

Quick Start

Clone the repo

Create .env with your credentials:

REDCAP_API_URL=https://your-redcap-instance/api/
REDCAP_API_TOKEN=your_token_here
GCS_BUCKET=your-bucket-name  # optional

Build and run:
```
docker-compose build
docker-compose up
```
Open http://localhost:9090 in your browser

Usage

Web API (Primary)

Start the server and visit http://localhost:9090:

docker-compose up

The web UI allows you to:

Process URLs: Enter web URLs with optional PDF/DOCX attachment following
Upload documents: PDF, DOCX, TXT files (batch upload supported)
Follow web links in documents: Optional checkbox to extract and scrape URLs found in uploaded files (1 level deep)
- ⚠️ Large batches with link following can take 30-60+ minutes
- UI warns when uploading >3 files with link following enabled
- Server timeout: 2 hours (sufficient for very large batches)
Configure prompts: Customize AI extraction behavior
Select AI model: Choose from multiple available models

API Endpoints

POST /run - Process URLs

curl -X POST http://localhost:9090/run \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "follow_links": true}'

Response:

{
  "status": "completed",
  "run_id": "rpp_2026-01-06T18-30-00Z_a1b2c3d4",
  "output_path": "cache/rag_ready/rpp_2026-01-06T18-30-00Z_a1b2c3d4.json",
  "stats": {"documents_processed": 1, "total_sections": 5, ...},
  "warnings": []
}

POST /upload - Upload and process documents

curl -X POST http://localhost:9090/upload \
  -F "[email protected]" \
  -F "follow_doc_links=true" \
  -F "model=gpt-4.1"

Parameters:

files: One or more files (PDF, DOCX, TXT)
follow_doc_links: Extract and scrape URLs found in documents (optional, default: false)
- Max 20 URLs per document, 2-second delay between requests (rate limiting)
model: AI model to use (optional)

GET /download/{run_id} - Download JSON output for a run

GET /health - Health check

CLI

Run the pipeline from command line:

docker-compose run --rm scraper python -m rag_pipeline.main

Output Format

RPP produces a single canonical JSON file per run at cache/rag_ready/{run_id}.json.

Schema version: rpp.v1

{
  "schema_version": "rpp.v1",
  "rpp_version": "0.2.0",
  "run": {
    "run_id": "rpp_2026-01-06T18-30-00Z_a1b2c3d4",
    "timestamp_start": "2026-01-06T18:30:00Z",
    "timestamp_end": "2026-01-06T18:32:15Z",
    "triggered_by": "web_api",
    "run_mode": "deterministic",
    "follow_links": true,
    "tags": []
  },
  "documents": [...],
  "aggregate_stats": {...},
  "warnings": [...]
}

Project Structure

.
├── cache/
│   ├── raw/           # raw scraped HTML/PDF text
│   └── rag_ready/     # canonical JSON output
├── config/
│   ├── urls.txt       # default URL list
│   └── sliding_window_prompts.json
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
├── CLAUDE.md
└── rag_pipeline/
    ├── web.py              # FastAPI web interface (primary)
    ├── main.py             # CLI entrypoint + run_pipeline()
    ├── cli.py              # Interactive CLI
    ├── output_json.py      # Canonical JSON writer
    ├── scraping/
    │   ├── scraper.py
    │   └── pdf_parser.py
    ├── processing/
    │   ├── ai_client.py    # SecureChatAI proxy
    │   └── sliding_window.py
    ├── storage/
    │   └── storage.py
    └── utils/
        └── logger.py

Environment Variables

Variable	Required	Description
`REDCAP_API_URL`	Yes	REDCap API endpoint for SecureChatAI
`REDCAP_API_TOKEN`	Yes	REDCap API token
`GCS_BUCKET`	No	GCS bucket for artifact upload
`STORAGE_MODE`	No	`local` (default) or `gcs`

Production Deployment (Cloud Run)

⚠️ CRITICAL: When deploying to Cloud Run, you must increase the request timeout from the default 5 minutes to 60 minutes to support link-following operations.

Via Google Cloud Console:

Go to Cloud Run → Select your service
"Edit & Deploy New Revision" → "Container" tab
Set "Request timeout" to 3600 seconds
Deploy

Via gcloud CLI:

gcloud run services update YOUR_SERVICE_NAME --timeout=3600 --region=YOUR_REGION

Without this change, link-following operations on large batches will timeout and fail.

Recent Updates

Bug Fixes (2026-01-14)

Problem: Followed web links were being skipped, causing content loss for Stanford policy pages.

Root Causes:

Field name typo in validation logic (content vs text)
Over-aggressive AI prompts removing dry regulatory language
Missing metadata label preservation

Fixes:

✅ Fixed field name in content validation (web.py, main.py)
✅ Updated AI prompts to explicitly preserve policy content and metadata labels
✅ Migrated to source-type-specific prompts (WebPage, DOCX, PDF, default)
✅ Added web link following to core pipeline (main.py)

Impact: Followed URLs now correctly preserve Stanford policy content.

Files Modified:

rag_pipeline/web.py - Fixed validation, preserved AI processing for followed links
rag_pipeline/processing/sliding_window.py - Updated default prompts
rag_pipeline/main.py - Added "web" follow mode
config/sliding_window_prompts.json - Nested structure with source-specific prompts

AI Extraction Philosophy

The pipeline uses source-type-aware extraction to apply the right level of filtering:

Web Pages (URL scraping):

✅ PRESERVE: All policy content (even if dry/formal), metadata labels
❌ REMOVE: Navigation, menus, headers, footers, ads, scripts, exact duplicates
Note: "Boilerplate" terminology removed - regulatory language is NOT boilerplate

Uploaded Documents (DOCX, PDF, TXT):

✅ PRESERVE: ALL substantive content, references, citations, links, tables, metadata labels
❌ REMOVE ONLY: Format artifacts, OCR errors, page numbers, corrupted characters

Followed Web Links:

Process with "WebPage" prompts (same as URL scraping)
Rate limited: max 20 URLs per document, 2-second delay

Prompts: Configured in config/sliding_window_prompts.json with nested structure.

SharePoint Integration

The pipeline integrates with SharePoint for input/output storage and automation:

Input Sources:

Source Documents library: DOCX, PDF, TXT files to process
Source URLs library: .txt files with URL lists (one per line)

Outputs:

Pipeline Outputs library: Generated JSON files organized by date
Processing Log list: Metadata tracking (run_id, timestamp, status, files processed)

Automation Strategies:

Power Automate: Monitor SharePoint libraries for changes, trigger processing
Delta Detection: Track file modification dates and content hashes to skip unchanged content
Scheduled Runs: Weekly/daily processing of URL lists with delta checking

Benefits:

Centralized document storage
Audit trail and version control via SharePoint
Automated processing on file updates
70-90% reduction in redundant processing

See SharePoint wiki page for detailed integration workflows.

Run Modes

Mode	Description
`ai_always`	Every chunk passes through AI normalization (recommended)
`deterministic`	Pure text extraction, no AI calls
`ai_auto`	AI triggered by noise detection heuristics

Deployment

CI/CD: Automated via GitHub Actions

git push → GitHub Actions → Docker build → Cloud Run deployment

Repository: https://github.com/susom/rag_scrape_pipeline

Production URL: https://rag-scrape-pipeline-974351967139.us-west1.run.app

Rollback: Revert commit or redeploy specific tag via Cloud Run console

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
config		config
rag_pipeline		rag_pipeline
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RPP - RAG Preparation Pipeline

Features

Pipeline Flow

Quick Start

Usage

Web API (Primary)

API Endpoints

CLI

Output Format

Project Structure

Environment Variables

Production Deployment (Cloud Run)

Recent Updates

Bug Fixes (2026-01-14)

AI Extraction Philosophy

SharePoint Integration

Run Modes

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

susom/rag_scrape_pipeline

Folders and files

Latest commit

History

Repository files navigation

RPP - RAG Preparation Pipeline

Features

Pipeline Flow

Quick Start

Usage

Web API (Primary)

API Endpoints

CLI

Output Format

Project Structure

Environment Variables

Production Deployment (Cloud Run)

Recent Updates

Bug Fixes (2026-01-14)

AI Extraction Philosophy

SharePoint Integration

Run Modes

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages