Skip to content

TessPDFRenderer uses "long int" to store PDF offsets, which is too small on some platforms #3805

@mcsjosh

Description

@mcsjosh

Environment

  • Tesseract Version: 5.1.0 (compiled from source)
  • Commit Number:
  • Platform: Windows 10 32-bit, gcc 11.2.0 via MSYS2)

Current Behavior:

TessPDFRenderer uses long int to store PDF offsets, and on my platform long int is 32-bits. As a result, TessPDFRenderer can only create valid PDF files if all offsets are below 2GiB. Offsets between 2GiB and 4GiB are output as negative numbers in the Xref table.

Expected Behavior:

Valid PDF Xref offsets can be as high as 9999999999 (10GB). In order to support PDF output of that size, TessPDFRenderer should store offsets using a data type guaranteed to be at least 64-bit.

Suggested Fix:

Make TessPDFRenderer::offsets_ a vector of long long int rather than long int:

--- a/include/tesseract/renderer.h
+++ b/include/tesseract/renderer.h
@@ -229,11 +229,11 @@ private:
   // PDFs one page at a time. At the end, that metadata is
   // used to make everything that isn't easily handled in a
   // streaming fashion.
-  long int obj_;                  // counter for PDF objects
-  std::vector<long int> offsets_; // offset of every PDF object in bytes
-  std::vector<long int> pages_;   // object number for every /Page object
-  std::string datadir_;           // where to find the custom font
-  bool textonly_;                 // skip images if set
+  long int obj_;                       // counter for PDF objects
+  std::vector<long long int> offsets_; // offset of every PDF object in bytes
+  std::vector<long int> pages_;        // object number for every /Page object
+  std::string datadir_;                // where to find the custom font
+  bool textonly_;                      // skip images if set
   // Bookkeeping only. DIY = Do It Yourself.
   void AppendPDFObjectDIY(size_t objectsize);
   // Bookkeeping + emit data.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions