Skip to content

Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769

@krzysiekj94

Description

@krzysiekj94

Environment

  • Tesseract Version: 5.1.0
  • Platform: Windows 32-bit

Current Behavior:

I have the following problem:

  1. I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application.
  2. I prepared the following dependencies with CMake 3.23 (without SW build):
    a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37.
    b. Links to src:
  1. After generating the dependencies, I used them in a wrapper that uses CAPI and generated a dll file (also 32 bit) that I used in the application. The list of all dependencies is as follows:
    image
  2. In the next step, I performed an OCR test in the application with tessdata germany data - deu.traineddata model: https://github.com/tesseract-ocr/tessdata_fast.
  3. At this point, I noticed inferior recognition quality compared to the Tesseract 4.1.1 version, which I used earlier.
    image
    a. test file:
    test_file
  4. I noticed that there is also a problem with slash, for example: It is then changed to "jj" - see:
    image
  5. I would like to add that I have also prepared a Tesseract 4.1.1 compilation with the dependencies as in point 2b. The quality of OCR did not change then.
  6. I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

Expected Behavior:

I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.

Suggested Fix:

Consideration of an upgrade for deu.traineddata models on the website:
https://github.com/tesseract-ocr/tessdata_fast

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions