Skip to content

nandhakt/doc_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ DocExtract AI

An intelligent agentic document extraction solution built with Streamlit, FastAPI, and LangGraph. Extract structured data from PDF documents using AI, with human-in-the-loop feedback for refinement.

Architecture Python License

✨ Features

  • πŸ€– Agentic Extraction: LangGraph-powered workflow for intelligent document understanding
  • πŸ“‹ Schema-Driven: Define extraction fields via JSON schema
  • πŸ‘οΈ Side-by-Side View: PDF viewer alongside extracted data
  • πŸ’¬ Human-in-the-Loop: Provide feedback to refine extractions
  • πŸ”„ Iterative Refinement: Re-run extraction with corrections
  • πŸ“Š Confidence Scores: See extraction confidence for each field
  • 🎨 Modern UI: Beautiful, responsive Streamlit interface

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Streamlit Frontend                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   PDF Viewer     β”‚      β”‚   Extraction Results     β”‚     β”‚
β”‚  β”‚   - Page nav     β”‚      β”‚   - Field values         β”‚     β”‚
β”‚  β”‚   - Zoom         β”‚      β”‚   - Confidence bars      β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚   - Feedback input       β”‚     β”‚
β”‚                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚ HTTP/REST
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FastAPI Backend                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Upload    β”‚  β”‚  Extract   β”‚  β”‚     Feedback         β”‚   β”‚
β”‚  β”‚  /upload   β”‚  β”‚  /extract  β”‚  β”‚     /feedback        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   LangGraph Agent                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Extract  │───▢│ Validate │───▢│ Apply Feedback       β”‚   β”‚
β”‚  β”‚   Node   β”‚    β”‚   Node   β”‚    β”‚      Node            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚        β”‚                                    β”‚               β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                    (Re-extraction loop)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • OpenAI API key

Installation

  1. Clone or navigate to the project:

    cd doc_extraction
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables:

    # Create .env file
    echo "OPENAI_API_KEY=your-openai-api-key-here" > .env

Running the Application

Option 1: Using the startup script

chmod +x start.sh
./start.sh

Option 2: Manual startup

Terminal 1 - Start the API:

python run_api.py

Terminal 2 - Start the frontend:

streamlit run streamlit_app.py

Access the Application

πŸ“– Usage Guide

1. Define Your Schema

In the sidebar, define a JSON schema for the fields you want to extract:

{
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string",
      "description": "Invoice or document number"
    },
    "date": {
      "type": "string",
      "description": "Document date"
    },
    "total_amount": {
      "type": "number",
      "description": "Total amount"
    }
  },
  "required": ["invoice_number", "date"]
}

2. Upload a PDF

Drag and drop or click to upload your PDF document.

3. Extract Fields

Click "πŸš€ Extract Fields" to run the AI extraction.

4. Review Results

  • View extracted values alongside their confidence scores
  • Fields with low confidence are highlighted
  • Navigate PDF pages to verify extraction

5. Provide Feedback (Optional)

If results need correction:

  1. Enter feedback in the text area (e.g., "The date should be in MM/DD/YYYY format")
  2. Click "πŸ”„ Re-extract with Feedback"
  3. The agent will re-process considering your feedback

6. Export Results

Click "πŸ“₯ Export Results" to download the extracted data as JSON.

πŸ”§ Configuration

Environment Variables

Variable Description Default
OPENAI_API_KEY OpenAI API key (required) -
OPENAI_MODEL Model to use gpt-4o
FASTAPI_HOST API host localhost
FASTAPI_PORT API port 8000

Customizing the Schema

The JSON schema supports:

  • String fields: Text values
  • Number fields: Numeric values
  • Array fields: Lists of items
  • Nested objects: Complex structures
  • Required fields: Mark fields that must be extracted

πŸ“ Project Structure

doc_extraction/
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── extraction_agent.py    # LangGraph extraction workflow
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py               # FastAPI endpoints
β”‚   └── models.py             # Pydantic schemas
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── pdf_processor.py      # PDF text/image extraction
β”œβ”€β”€ uploads/                   # Uploaded PDFs (auto-created)
β”œβ”€β”€ config.py                 # Configuration settings
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ run_api.py               # API startup script
β”œβ”€β”€ start.sh                 # Combined startup script
β”œβ”€β”€ streamlit_app.py         # Streamlit frontend
└── README.md

πŸ› οΈ API Reference

Endpoints

Method Endpoint Description
POST /upload Upload a PDF file
POST /extract/{session_id} Extract fields from PDF
POST /feedback Submit feedback and re-extract
GET /pdf/{session_id}/page/{page_num} Get PDF page as image
GET /pdf/{session_id}/info Get PDF metadata
GET /session/{session_id} Get session status
DELETE /session/{session_id} Delete session

Example API Usage

import httpx
import json

# Upload PDF
with open("document.pdf", "rb") as f:
    response = httpx.post(
        "http://localhost:8000/upload",
        files={"file": f}
    )
    session_id = response.json()["session_id"]

# Extract fields
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "date": {"type": "string"}
    }
}

response = httpx.post(
    f"http://localhost:8000/extract/{session_id}",
    data={"json_schema": json.dumps(schema)}
)
print(response.json())

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published