Benchmark TF1 and SF1 extraction against alternatives

Hi There,

To improve pdf_oxide further I would like to suggest benchmarking with a range of documents, with known ground truth (GT) markdown outputs, measuring both TF1 (text extracted) and SF1 (structure extracted). You can see how its done in Kreuzberg (`/tools/benchmark_harness`). 

Alternatives I'd measure against:

1. Docling-parse. Docling have their own c++ (iirc) engine for text extraction. It doesnt handle all documents, but when it does it has superior TF1/SF1 results.
2. Pdfium - this one is the most battle tested solution. It can handle PDFs from a huge variety. Its lower level APIs are hard to work with. I would probably focus on measuring TF1 and what it is able to process first.
3. pdftotext -- based on Poppler. This is hacky and GPL. But its the fastest engine out there, and it has decent TF1. 

The goal of these benchmarks would be to guide pdf_oxide to have the best TF1 / SF1 vis-a-vis the alternatives. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark TF1 and SF1 extraction against alternatives #320

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark TF1 and SF1 extraction against alternatives #320

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions