nv-ingest Document Pipeline

A GPU-accelerated document extraction pipeline built on NVIDIA nv-ingest (~2.8K stars). Extract text, tables, charts, and metadata from any PDF corpus and convert the output into chat-format JSONL training data for fine-tuning LLMs on your domain.

This repo is a starter / shell. The architecture is generic. You bring the documents, you choose the system prompt, you decide what your domain-specific question framing looks like — the pipeline handles the heavy lifting of extraction, chunking, formatting, and benchmarking.

What this gives you

One-command nv-ingest setup via docker compose — no manual model downloads, no CUDA-version headaches
A clean Python API to extract any PDF corpus
A converter that turns extracted content into chat-format JSONL ready to feed into training pipelines (the format used by axolotl, unsloth, and most QLoRA / SFT toolchains)
A benchmark harness that compares nv-ingest against a CPU PyPDF2 baseline, so you can quantify the speedup on your own documents
61 unit tests that all run without a GPU, so CI is fast and you can iterate on the converter without spinning up nv-ingest
Extension points documented in docs/CUSTOMIZATION.md — subclass the converter to add domain-specific topic extraction, ID detection, or question framing

Pipeline flow

your PDFs (any domain)
        |
   nv-ingest (GPU extraction)
   - Text with paragraph/heading classification
   - Tables with row/column structure
   - Charts and diagrams
        |
   DocumentConverter (this repo)
   - Chunks long text on paragraph + sentence boundaries
   - Emits chat-format training examples
   - Optional: subclass to add your own topic extraction
        |
   chat-format JSONL
   {"messages": [system, user, assistant], "source": "nv_ingest_..."}
        |
   feed into your training pipeline
   (axolotl, unsloth, transformers, etc.)

Quick start

The full step-by-step is in docs/SETUP.md. The 60-second version:

git clone https://github.com/NathanMaine/nv-ingest-document-pipeline.git
cd nv-ingest-document-pipeline

# 1. Install Python deps (no GPU needed for the converter or tests)
pip install -r requirements.txt

# 2. Run the test suite (all 51 tests pass without GPU or nv-ingest)
python -m pytest tests/ -v

# 3. (When ready for real extraction) start nv-ingest
docker login nvcr.io          # free NGC account required
docker compose up -d           # ~10-15 min first run for model loading
docker compose logs -f         # watch readiness

# 4. Drop a PDF in data/sample_pdfs/, then:
python -c "
from src.extractor import DocumentExtractor
from src.converter import DocumentConverter

ex = DocumentExtractor()
results = ex.extract(['data/sample_pdfs/your.pdf'])

conv = DocumentConverter()
conv.convert_to_jsonl(results, 'output/training_data.jsonl')
"

What you choose, what the pipeline gives you

Decision	You choose	Pipeline provides
Document corpus	Your PDFs	Extraction at scale
System prompt	Domain-specific role	Generic default works for any topic
Question framing	Subclass `_generate_question()`	Generic question templates that work for "what / how / why"
Topic extraction	Override `_extract_topic()` to detect your domain's identifiers	Falls back to first-sentence heuristic
Chunk size	`min_length` and `max_length` constructor args	Defaults of 100 / 4000 chars work for most corpora
Output format	Anywhere you like (the JSONL is standard chat format)	Chat format compatible with axolotl, unsloth, transformers

Documentation

Doc	What it covers
docs/SETUP.md	Complete install walkthrough — Python, Docker, NGC login, GPU drivers, first extraction
docs/HARDWARE.md	GPU recommendations for different workload sizes, RAM/disk requirements, NVIDIA driver compatibility
docs/CUSTOMIZATION.md	How to subclass `DocumentConverter` for your domain — topic extraction, identifier detection, custom question framing, custom chunking
docs/EXAMPLES.md	Worked examples — academic papers, legal contracts, technical documentation, scientific datasets
docs/TROUBLESHOOTING.md	Common errors and fixes — Docker daemon, NGC auth, OOM during model load, slow extraction
examples/custom_converter_example.py	Working code that subclasses `DocumentConverter` to add custom domain logic

Project structure

nv-ingest-document-pipeline/
├── src/
│   ├── extractor.py      # nv-ingest API wrapper (DocumentExtractor)
│   ├── converter.py       # nv-ingest JSON -> chat training format (DocumentConverter)
│   └── benchmark.py       # PyPDF2 vs nv-ingest side-by-side comparison
├── tests/
│   ├── test_extractor.py  # 18 tests (mock data, no GPU)
│   ├── test_converter.py  # 22 tests (mock data, no GPU)
│   └── test_benchmark.py  # 11 tests (mock data, no GPU)
├── data/
│   ├── sample_pdfs/       # Drop your test PDFs here
│   └── expected_output/   # Ground truth JSONL for validation (you create this)
├── docs/
│   ├── SETUP.md
│   ├── HARDWARE.md
│   ├── CUSTOMIZATION.md
│   ├── EXAMPLES.md
│   └── TROUBLESHOOTING.md
├── examples/
│   └── custom_converter_example.py
├── docker-compose.yml     # nv-ingest service definition
├── requirements.txt       # Pinned Python dependencies
├── LICENSE                # Apache 2.0
└── README.md              # This file

Hardware requirements (TL;DR)

Use case	GPU	Reason
Development, testing, converter only	None	All 51 tests run on CPU, converter is pure Python
Small workload (<20 PDFs/day)	RTX 3090 / 4090 (24 GB)	Fits nv-ingest's models with headroom
Medium workload (20-200 PDFs/day)	A40 / L40S (48 GB)	Faster inference, supports larger batches
Production / heavy batch jobs	A100 80GB / H100 / DGX Spark	Best throughput per PDF, room for additional models

Full details in docs/HARDWARE.md.

Why nv-ingest specifically

You can extract PDFs with PyPDF2, pdfminer.six, pdfplumber, Marker, MinerU, Docling, and a dozen other tools. The benchmark in this repo lets you compare nv-ingest against the PyPDF2 baseline on your own documents, but the broad story is:

GPU-accelerated — extraction time scales with VRAM, not CPU cores
Table-aware — preserves row/column structure that PyPDF2 flattens
Chart-aware — extracts diagrams and charts as structured text
Maintained by NVIDIA — production-grade with regular updates
Microservice architecture — easy to scale horizontally with Kubernetes
OCR fallback — handles scanned PDFs, not just native text PDFs

If your corpus is mostly clean native-text PDFs and you don't need tables, PyPDF2 is fine. If you have any combination of dense tables, scanned pages, or charts you need preserved as text — nv-ingest pays for itself.

License

Apache 2.0 — see LICENSE.

Acknowledgments

Built on NVIDIA nv-ingest, a GPU-accelerated document extraction microservice. Thanks to the nv-ingest team — including Edward Kim, Jeremy Dyer, and Devin Robison — for building and maintaining the project.

The chat-format JSONL output is compatible with most modern fine-tuning toolchains including axolotl, unsloth, and the Hugging Face transformers library's TRL trainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nv-ingest Document Pipeline

What this gives you

Pipeline flow

Quick start

What you choose, what the pipeline gives you

Documentation

Project structure

Hardware requirements (TL;DR)

Why nv-ingest specifically

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

nv-ingest Document Pipeline

What this gives you

Pipeline flow

Quick start

What you choose, what the pipeline gives you

Documentation

Project structure

Hardware requirements (TL;DR)

Why nv-ingest specifically

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages