Background
PDF parsing fails for large documents that exceed the Mineru parser's page limit. The system throws "Number of pages exceeds limit, please split the file and try again" error, causing the entire parsing task to fail. This prevents users from processing legitimate large documents like technical books.
Error occurs in:
- File:
/app/aperag/index/document_parser.py line 267
- Method:
process_document_parsing()
- Parser: Mineru parsing engine
Example failure:
Exception: Document parsing failed for /tmp/Designing Data-Intensive Applications...pdf:
Mineru parsing failed: Number of pages exceeds limit, please split the file and try again
Proposal
Add automatic PDF splitting functionality to handle large documents:
- Pre-processing check: Detect page count before parsing
- Auto-split: Automatically divide large PDFs into chunks within page limits
- Batch processing: Process chunks sequentially and merge results
- Progress tracking: Show splitting and parsing progress to users
- Configurable limits: Allow administrators to adjust page limits per environment
This would eliminate manual file preparation while maintaining parsing reliability for large documents.
Background
PDF parsing fails for large documents that exceed the Mineru parser's page limit. The system throws "Number of pages exceeds limit, please split the file and try again" error, causing the entire parsing task to fail. This prevents users from processing legitimate large documents like technical books.
Error occurs in:
/app/aperag/index/document_parser.pyline 267process_document_parsing()Example failure:
Proposal
Add automatic PDF splitting functionality to handle large documents:
This would eliminate manual file preparation while maintaining parsing reliability for large documents.