A GPU-accelerated document extraction pipeline built on NVIDIA nv-ingest (~2.8K stars). Extract text, tables, charts, and metadata from any PDF corpus and convert the output into chat-format JSONL training data for fine-tuning LLMs on your domain.
This repo is a starter / shell. The architecture is generic. You bring the documents, you choose the system prompt, you decide what your domain-specific question framing looks like — the pipeline handles the heavy lifting of extraction, chunking, formatting, and benchmarking.
- One-command nv-ingest setup via docker compose — no manual model downloads, no CUDA-version headaches
- A clean Python API to extract any PDF corpus
- A converter that turns extracted content into chat-format JSONL ready to feed into training pipelines (the format used by
axolotl,unsloth, and most QLoRA / SFT toolchains) - A benchmark harness that compares nv-ingest against a CPU PyPDF2 baseline, so you can quantify the speedup on your own documents
- 61 unit tests that all run without a GPU, so CI is fast and you can iterate on the converter without spinning up nv-ingest
- Extension points documented in docs/CUSTOMIZATION.md — subclass the converter to add domain-specific topic extraction, ID detection, or question framing
your PDFs (any domain)
|
nv-ingest (GPU extraction)
- Text with paragraph/heading classification
- Tables with row/column structure
- Charts and diagrams
|
DocumentConverter (this repo)
- Chunks long text on paragraph + sentence boundaries
- Emits chat-format training examples
- Optional: subclass to add your own topic extraction
|
chat-format JSONL
{"messages": [system, user, assistant], "source": "nv_ingest_..."}
|
feed into your training pipeline
(axolotl, unsloth, transformers, etc.)
The full step-by-step is in docs/SETUP.md. The 60-second version:
git clone https://github.com/NathanMaine/nv-ingest-document-pipeline.git
cd nv-ingest-document-pipeline
# 1. Install Python deps (no GPU needed for the converter or tests)
pip install -r requirements.txt
# 2. Run the test suite (all 51 tests pass without GPU or nv-ingest)
python -m pytest tests/ -v
# 3. (When ready for real extraction) start nv-ingest
docker login nvcr.io # free NGC account required
docker compose up -d # ~10-15 min first run for model loading
docker compose logs -f # watch readiness
# 4. Drop a PDF in data/sample_pdfs/, then:
python -c "
from src.extractor import DocumentExtractor
from src.converter import DocumentConverter
ex = DocumentExtractor()
results = ex.extract(['data/sample_pdfs/your.pdf'])
conv = DocumentConverter()
conv.convert_to_jsonl(results, 'output/training_data.jsonl')
"| Decision | You choose | Pipeline provides |
|---|---|---|
| Document corpus | Your PDFs | Extraction at scale |
| System prompt | Domain-specific role | Generic default works for any topic |
| Question framing | Subclass _generate_question() |
Generic question templates that work for "what / how / why" |
| Topic extraction | Override _extract_topic() to detect your domain's identifiers |
Falls back to first-sentence heuristic |
| Chunk size | min_length and max_length constructor args |
Defaults of 100 / 4000 chars work for most corpora |
| Output format | Anywhere you like (the JSONL is standard chat format) | Chat format compatible with axolotl, unsloth, transformers |
| Doc | What it covers |
|---|---|
| docs/SETUP.md | Complete install walkthrough — Python, Docker, NGC login, GPU drivers, first extraction |
| docs/HARDWARE.md | GPU recommendations for different workload sizes, RAM/disk requirements, NVIDIA driver compatibility |
| docs/CUSTOMIZATION.md | How to subclass DocumentConverter for your domain — topic extraction, identifier detection, custom question framing, custom chunking |
| docs/EXAMPLES.md | Worked examples — academic papers, legal contracts, technical documentation, scientific datasets |
| docs/TROUBLESHOOTING.md | Common errors and fixes — Docker daemon, NGC auth, OOM during model load, slow extraction |
| examples/custom_converter_example.py | Working code that subclasses DocumentConverter to add custom domain logic |
nv-ingest-document-pipeline/
├── src/
│ ├── extractor.py # nv-ingest API wrapper (DocumentExtractor)
│ ├── converter.py # nv-ingest JSON -> chat training format (DocumentConverter)
│ └── benchmark.py # PyPDF2 vs nv-ingest side-by-side comparison
├── tests/
│ ├── test_extractor.py # 18 tests (mock data, no GPU)
│ ├── test_converter.py # 22 tests (mock data, no GPU)
│ └── test_benchmark.py # 11 tests (mock data, no GPU)
├── data/
│ ├── sample_pdfs/ # Drop your test PDFs here
│ └── expected_output/ # Ground truth JSONL for validation (you create this)
├── docs/
│ ├── SETUP.md
│ ├── HARDWARE.md
│ ├── CUSTOMIZATION.md
│ ├── EXAMPLES.md
│ └── TROUBLESHOOTING.md
├── examples/
│ └── custom_converter_example.py
├── docker-compose.yml # nv-ingest service definition
├── requirements.txt # Pinned Python dependencies
├── LICENSE # Apache 2.0
└── README.md # This file
| Use case | GPU | Reason |
|---|---|---|
| Development, testing, converter only | None | All 51 tests run on CPU, converter is pure Python |
| Small workload (<20 PDFs/day) | RTX 3090 / 4090 (24 GB) | Fits nv-ingest's models with headroom |
| Medium workload (20-200 PDFs/day) | A40 / L40S (48 GB) | Faster inference, supports larger batches |
| Production / heavy batch jobs | A100 80GB / H100 / DGX Spark | Best throughput per PDF, room for additional models |
Full details in docs/HARDWARE.md.
You can extract PDFs with PyPDF2, pdfminer.six, pdfplumber, Marker, MinerU, Docling, and a dozen other tools. The benchmark in this repo lets you compare nv-ingest against the PyPDF2 baseline on your own documents, but the broad story is:
- GPU-accelerated — extraction time scales with VRAM, not CPU cores
- Table-aware — preserves row/column structure that PyPDF2 flattens
- Chart-aware — extracts diagrams and charts as structured text
- Maintained by NVIDIA — production-grade with regular updates
- Microservice architecture — easy to scale horizontally with Kubernetes
- OCR fallback — handles scanned PDFs, not just native text PDFs
If your corpus is mostly clean native-text PDFs and you don't need tables, PyPDF2 is fine. If you have any combination of dense tables, scanned pages, or charts you need preserved as text — nv-ingest pays for itself.
Apache 2.0 — see LICENSE.
Built on NVIDIA nv-ingest, a GPU-accelerated document extraction microservice. Thanks to the nv-ingest team — including Edward Kim, Jeremy Dyer, and Devin Robison — for building and maintaining the project.
The chat-format JSONL output is compatible with most modern fine-tuning toolchains including axolotl, unsloth, and the Hugging Face transformers library's TRL trainer.