Skip to content

Benchmark TF1 and SF1 extraction against alternatives #320

@Goldziher

Description

@Goldziher

Hi There,

To improve pdf_oxide further I would like to suggest benchmarking with a range of documents, with known ground truth (GT) markdown outputs, measuring both TF1 (text extracted) and SF1 (structure extracted). You can see how its done in Kreuzberg (/tools/benchmark_harness).

Alternatives I'd measure against:

  1. Docling-parse. Docling have their own c++ (iirc) engine for text extraction. It doesnt handle all documents, but when it does it has superior TF1/SF1 results.
  2. Pdfium - this one is the most battle tested solution. It can handle PDFs from a huge variety. Its lower level APIs are hard to work with. I would probably focus on measuring TF1 and what it is able to process first.
  3. pdftotext -- based on Poppler. This is hacky and GPL. But its the fastest engine out there, and it has decent TF1.

The goal of these benchmarks would be to guide pdf_oxide to have the best TF1 / SF1 vis-a-vis the alternatives.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions