A modular pipeline for scraping, parsing, and processing content into RAG-ready JSON artifacts.
Production: https://rag-scrape-pipeline-974351967139.us-west1.run.app
Local: http://localhost:9090
Source: https://github.com/susom/rag_scrape_pipeline
- Web API with HTML UI for interactive processing
- URL scraping (HTML snapshots, main content extraction, PDF/DOCX attachment detection)
- Batch document upload (multiple PDF, DOCX, TXT files in one operation)
- Link following:
- Web URLs: Follow PDF/DOCX attachments in main content
- Uploaded docs: Extract and scrape web links - supports both HTML and PDF URLs (optional, 1 level deep, rate-limited)
- Source-aware AI extraction:
- Web pages: Remove structural cruft (nav, ads, scripts), preserve policy content
- Uploaded docs: Conservative preservation of all substantive content
- Critical: Preserves metadata labels and dry regulatory language
- AI-powered content filtering via SecureChatAI gateway
- Multi-model support (GPT-4.1, Claude, Gemini, Llama, DeepSeek, etc.)
- PDF parsing (via
pdfplumber) - Local caching (
cache/rawfor raw HTML/PDF text) - Sliding window processing with deduplication
- Canonical JSON output (
cache/rag_ready/{run_id}.json) - GCS storage integration (optional)
- SharePoint integration (input/output storage, automation)
- CI/CD deployment (auto-deploy on git push)
flowchart LR
A[URLs] --> B[Scraper/PDF Parser] --> C[cache/raw]
C --> D[Sliding Window] --> E[cache/rag_ready JSON]
- Clone the repo
- Create
.envwith your credentials:REDCAP_API_URL=https://your-redcap-instance/api/ REDCAP_API_TOKEN=your_token_here GCS_BUCKET=your-bucket-name # optional
- Build and run:
docker-compose build docker-compose up
- Open
http://localhost:9090in your browser
Start the server and visit http://localhost:9090:
docker-compose upThe web UI allows you to:
- Process URLs: Enter web URLs with optional PDF/DOCX attachment following
- Upload documents: PDF, DOCX, TXT files (batch upload supported)
- Follow web links in documents: Optional checkbox to extract and scrape URLs found in uploaded files (1 level deep)
⚠️ Large batches with link following can take 30-60+ minutes- UI warns when uploading >3 files with link following enabled
- Server timeout: 2 hours (sufficient for very large batches)
- Configure prompts: Customize AI extraction behavior
- Select AI model: Choose from multiple available models
POST /run - Process URLs
curl -X POST http://localhost:9090/run \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "follow_links": true}'Response:
{
"status": "completed",
"run_id": "rpp_2026-01-06T18-30-00Z_a1b2c3d4",
"output_path": "cache/rag_ready/rpp_2026-01-06T18-30-00Z_a1b2c3d4.json",
"stats": {"documents_processed": 1, "total_sections": 5, ...},
"warnings": []
}POST /upload - Upload and process documents
curl -X POST http://localhost:9090/upload \
-F "[email protected]" \
-F "follow_doc_links=true" \
-F "model=gpt-4.1"Parameters:
files: One or more files (PDF, DOCX, TXT)follow_doc_links: Extract and scrape URLs found in documents (optional, default: false)- Max 20 URLs per document, 2-second delay between requests (rate limiting)
model: AI model to use (optional)
GET /download/{run_id} - Download JSON output for a run
GET /health - Health check
Run the pipeline from command line:
docker-compose run --rm scraper python -m rag_pipeline.mainRPP produces a single canonical JSON file per run at cache/rag_ready/{run_id}.json.
Schema version: rpp.v1
{
"schema_version": "rpp.v1",
"rpp_version": "0.2.0",
"run": {
"run_id": "rpp_2026-01-06T18-30-00Z_a1b2c3d4",
"timestamp_start": "2026-01-06T18:30:00Z",
"timestamp_end": "2026-01-06T18:32:15Z",
"triggered_by": "web_api",
"run_mode": "deterministic",
"follow_links": true,
"tags": []
},
"documents": [...],
"aggregate_stats": {...},
"warnings": [...]
}.
├── cache/
│ ├── raw/ # raw scraped HTML/PDF text
│ └── rag_ready/ # canonical JSON output
├── config/
│ ├── urls.txt # default URL list
│ └── sliding_window_prompts.json
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
├── CLAUDE.md
└── rag_pipeline/
├── web.py # FastAPI web interface (primary)
├── main.py # CLI entrypoint + run_pipeline()
├── cli.py # Interactive CLI
├── output_json.py # Canonical JSON writer
├── scraping/
│ ├── scraper.py
│ └── pdf_parser.py
├── processing/
│ ├── ai_client.py # SecureChatAI proxy
│ └── sliding_window.py
├── storage/
│ └── storage.py
└── utils/
└── logger.py
| Variable | Required | Description |
|---|---|---|
REDCAP_API_URL |
Yes | REDCap API endpoint for SecureChatAI |
REDCAP_API_TOKEN |
Yes | REDCap API token |
GCS_BUCKET |
No | GCS bucket for artifact upload |
STORAGE_MODE |
No | local (default) or gcs |
Via Google Cloud Console:
- Go to Cloud Run → Select your service
- "Edit & Deploy New Revision" → "Container" tab
- Set "Request timeout" to 3600 seconds
- Deploy
Via gcloud CLI:
gcloud run services update YOUR_SERVICE_NAME --timeout=3600 --region=YOUR_REGIONWithout this change, link-following operations on large batches will timeout and fail.
Problem: Followed web links were being skipped, causing content loss for Stanford policy pages.
Root Causes:
- Field name typo in validation logic (
contentvstext) - Over-aggressive AI prompts removing dry regulatory language
- Missing metadata label preservation
Fixes:
- ✅ Fixed field name in content validation (web.py, main.py)
- ✅ Updated AI prompts to explicitly preserve policy content and metadata labels
- ✅ Migrated to source-type-specific prompts (WebPage, DOCX, PDF, default)
- ✅ Added web link following to core pipeline (main.py)
Impact: Followed URLs now correctly preserve Stanford policy content.
Files Modified:
rag_pipeline/web.py- Fixed validation, preserved AI processing for followed linksrag_pipeline/processing/sliding_window.py- Updated default promptsrag_pipeline/main.py- Added "web" follow modeconfig/sliding_window_prompts.json- Nested structure with source-specific prompts
The pipeline uses source-type-aware extraction to apply the right level of filtering:
Web Pages (URL scraping):
- ✅ PRESERVE: All policy content (even if dry/formal), metadata labels
- ❌ REMOVE: Navigation, menus, headers, footers, ads, scripts, exact duplicates
- Note: "Boilerplate" terminology removed - regulatory language is NOT boilerplate
Uploaded Documents (DOCX, PDF, TXT):
- ✅ PRESERVE: ALL substantive content, references, citations, links, tables, metadata labels
- ❌ REMOVE ONLY: Format artifacts, OCR errors, page numbers, corrupted characters
Followed Web Links:
- Process with "WebPage" prompts (same as URL scraping)
- Rate limited: max 20 URLs per document, 2-second delay
Prompts: Configured in config/sliding_window_prompts.json with nested structure.
The pipeline integrates with SharePoint for input/output storage and automation:
Input Sources:
- Source Documents library: DOCX, PDF, TXT files to process
- Source URLs library:
.txtfiles with URL lists (one per line)
Outputs:
- Pipeline Outputs library: Generated JSON files organized by date
- Processing Log list: Metadata tracking (run_id, timestamp, status, files processed)
Automation Strategies:
- Power Automate: Monitor SharePoint libraries for changes, trigger processing
- Delta Detection: Track file modification dates and content hashes to skip unchanged content
- Scheduled Runs: Weekly/daily processing of URL lists with delta checking
Benefits:
- Centralized document storage
- Audit trail and version control via SharePoint
- Automated processing on file updates
- 70-90% reduction in redundant processing
See SharePoint wiki page for detailed integration workflows.
| Mode | Description |
|---|---|
ai_always |
Every chunk passes through AI normalization (recommended) |
deterministic |
Pure text extraction, no AI calls |
ai_auto |
AI triggered by noise detection heuristics |
CI/CD: Automated via GitHub Actions
git push → GitHub Actions → Docker build → Cloud Run deployment
Repository: https://github.com/susom/rag_scrape_pipeline
Production URL: https://rag-scrape-pipeline-974351967139.us-west1.run.app
Rollback: Revert commit or redeploy specific tag via Cloud Run console