Skip to content

[Improvement]PDF parsing fails when file exceeds page limit - add automatic splitting support #1230

@earayu

Description

@earayu

Background

PDF parsing fails for large documents that exceed the Mineru parser's page limit. The system throws "Number of pages exceeds limit, please split the file and try again" error, causing the entire parsing task to fail. This prevents users from processing legitimate large documents like technical books.

Error occurs in:

  • File: /app/aperag/index/document_parser.py line 267
  • Method: process_document_parsing()
  • Parser: Mineru parsing engine

Example failure:

Exception: Document parsing failed for /tmp/Designing Data-Intensive Applications...pdf: 
Mineru parsing failed: Number of pages exceeds limit, please split the file and try again

Proposal

Add automatic PDF splitting functionality to handle large documents:

  1. Pre-processing check: Detect page count before parsing
  2. Auto-split: Automatically divide large PDFs into chunks within page limits
  3. Batch processing: Process chunks sequentially and merge results
  4. Progress tracking: Show splitting and parsing progress to users
  5. Configurable limits: Allow administrators to adjust page limits per environment

This would eliminate manual file preparation while maintaining parsing reliability for large documents.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions