Hi There,
To improve pdf_oxide further I would like to suggest benchmarking with a range of documents, with known ground truth (GT) markdown outputs, measuring both TF1 (text extracted) and SF1 (structure extracted). You can see how its done in Kreuzberg (/tools/benchmark_harness).
Alternatives I'd measure against:
- Docling-parse. Docling have their own c++ (iirc) engine for text extraction. It doesnt handle all documents, but when it does it has superior TF1/SF1 results.
- Pdfium - this one is the most battle tested solution. It can handle PDFs from a huge variety. Its lower level APIs are hard to work with. I would probably focus on measuring TF1 and what it is able to process first.
- pdftotext -- based on Poppler. This is hacky and GPL. But its the fastest engine out there, and it has decent TF1.
The goal of these benchmarks would be to guide pdf_oxide to have the best TF1 / SF1 vis-a-vis the alternatives.
Hi There,
To improve pdf_oxide further I would like to suggest benchmarking with a range of documents, with known ground truth (GT) markdown outputs, measuring both TF1 (text extracted) and SF1 (structure extracted). You can see how its done in Kreuzberg (
/tools/benchmark_harness).Alternatives I'd measure against:
The goal of these benchmarks would be to guide pdf_oxide to have the best TF1 / SF1 vis-a-vis the alternatives.