-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Description
Environment
- Tesseract Version: 5.1.0 (compiled from source)
- Commit Number:
- Platform: Windows 10 32-bit, gcc 11.2.0 via MSYS2)
Current Behavior:
TessPDFRenderer uses long int to store PDF offsets, and on my platform long int is 32-bits. As a result, TessPDFRenderer can only create valid PDF files if all offsets are below 2GiB. Offsets between 2GiB and 4GiB are output as negative numbers in the Xref table.
Expected Behavior:
Valid PDF Xref offsets can be as high as 9999999999 (10GB). In order to support PDF output of that size, TessPDFRenderer should store offsets using a data type guaranteed to be at least 64-bit.
Suggested Fix:
Make TessPDFRenderer::offsets_ a vector of long long int rather than long int:
--- a/include/tesseract/renderer.h
+++ b/include/tesseract/renderer.h
@@ -229,11 +229,11 @@ private:
// PDFs one page at a time. At the end, that metadata is
// used to make everything that isn't easily handled in a
// streaming fashion.
- long int obj_; // counter for PDF objects
- std::vector<long int> offsets_; // offset of every PDF object in bytes
- std::vector<long int> pages_; // object number for every /Page object
- std::string datadir_; // where to find the custom font
- bool textonly_; // skip images if set
+ long int obj_; // counter for PDF objects
+ std::vector<long long int> offsets_; // offset of every PDF object in bytes
+ std::vector<long int> pages_; // object number for every /Page object
+ std::string datadir_; // where to find the custom font
+ bool textonly_; // skip images if set
// Bookkeeping only. DIY = Do It Yourself.
void AppendPDFObjectDIY(size_t objectsize);
// Bookkeeping + emit data.