An intelligent agentic document extraction solution built with Streamlit, FastAPI, and LangGraph. Extract structured data from PDF documents using AI, with human-in-the-loop feedback for refinement.
- π€ Agentic Extraction: LangGraph-powered workflow for intelligent document understanding
- π Schema-Driven: Define extraction fields via JSON schema
- ποΈ Side-by-Side View: PDF viewer alongside extracted data
- π¬ Human-in-the-Loop: Provide feedback to refine extractions
- π Iterative Refinement: Re-run extraction with corrections
- π Confidence Scores: See extraction confidence for each field
- π¨ Modern UI: Beautiful, responsive Streamlit interface
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Frontend β
β ββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β PDF Viewer β β Extraction Results β β
β β - Page nav β β - Field values β β
β β - Zoom β β - Confidence bars β β
β ββββββββββββββββββββ β - Feedback input β β
β ββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β HTTP/REST
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββ β
β β Upload β β Extract β β Feedback β β
β β /upload β β /extract β β /feedback β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangGraph Agent β
β ββββββββββββ ββββββββββββ ββββββββββββββββββββββββ β
β β Extract βββββΆβ Validate βββββΆβ Apply Feedback β β
β β Node β β Node β β Node β β
β ββββββββββββ ββββββββββββ ββββββββββββββββββββββββ β
β β β β
β ββββββββββββββββββββββββββββββββββββββ β
β (Re-extraction loop) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.9+
- OpenAI API key
-
Clone or navigate to the project:
cd doc_extraction -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
# Create .env file echo "OPENAI_API_KEY=your-openai-api-key-here" > .env
Option 1: Using the startup script
chmod +x start.sh
./start.shOption 2: Manual startup
Terminal 1 - Start the API:
python run_api.pyTerminal 2 - Start the frontend:
streamlit run streamlit_app.py- Frontend: http://localhost:8501
- API Docs: http://localhost:8000/docs
In the sidebar, define a JSON schema for the fields you want to extract:
{
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "Invoice or document number"
},
"date": {
"type": "string",
"description": "Document date"
},
"total_amount": {
"type": "number",
"description": "Total amount"
}
},
"required": ["invoice_number", "date"]
}Drag and drop or click to upload your PDF document.
Click "π Extract Fields" to run the AI extraction.
- View extracted values alongside their confidence scores
- Fields with low confidence are highlighted
- Navigate PDF pages to verify extraction
If results need correction:
- Enter feedback in the text area (e.g., "The date should be in MM/DD/YYYY format")
- Click "π Re-extract with Feedback"
- The agent will re-process considering your feedback
Click "π₯ Export Results" to download the extracted data as JSON.
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key (required) | - |
OPENAI_MODEL |
Model to use | gpt-4o |
FASTAPI_HOST |
API host | localhost |
FASTAPI_PORT |
API port | 8000 |
The JSON schema supports:
- String fields: Text values
- Number fields: Numeric values
- Array fields: Lists of items
- Nested objects: Complex structures
- Required fields: Mark fields that must be extracted
doc_extraction/
βββ agent/
β βββ __init__.py
β βββ extraction_agent.py # LangGraph extraction workflow
βββ api/
β βββ __init__.py
β βββ main.py # FastAPI endpoints
β βββ models.py # Pydantic schemas
βββ utils/
β βββ __init__.py
β βββ pdf_processor.py # PDF text/image extraction
βββ uploads/ # Uploaded PDFs (auto-created)
βββ config.py # Configuration settings
βββ requirements.txt # Python dependencies
βββ run_api.py # API startup script
βββ start.sh # Combined startup script
βββ streamlit_app.py # Streamlit frontend
βββ README.md
| Method | Endpoint | Description |
|---|---|---|
POST |
/upload |
Upload a PDF file |
POST |
/extract/{session_id} |
Extract fields from PDF |
POST |
/feedback |
Submit feedback and re-extract |
GET |
/pdf/{session_id}/page/{page_num} |
Get PDF page as image |
GET |
/pdf/{session_id}/info |
Get PDF metadata |
GET |
/session/{session_id} |
Get session status |
DELETE |
/session/{session_id} |
Delete session |
import httpx
import json
# Upload PDF
with open("document.pdf", "rb") as f:
response = httpx.post(
"http://localhost:8000/upload",
files={"file": f}
)
session_id = response.json()["session_id"]
# Extract fields
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"date": {"type": "string"}
}
}
response = httpx.post(
f"http://localhost:8000/extract/{session_id}",
data={"json_schema": json.dumps(schema)}
)
print(response.json())Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.