-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Description
Environment
- Tesseract Version: 5.1.0
- Platform: Windows 32-bit
Current Behavior:
I have the following problem:
- I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application.
- I prepared the following dependencies with CMake 3.23 (without SW build):
a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37.
b. Links to src:
- tesseract 5.1.0 (https://github.com/tesseract-ocr/tesseract/releases) – 01.03.2022
- leptonica 1.82.0 (http://www.leptonica.org/download.html) – 22.09.2021
- libtiff 4.3.0 (http://download.osgeo.org/libtiff) – 20.04.2021
- libjpeg-turbo 2.1.3 (https://github.com/libjpeg-turbo/libjpeg-turbo/releases ) – 25.02.2022
- zlib 1.2.11 (https://github.com/madler/zlib/tags) – 15.01.2017
- libpng 1.6.37 (https://github.com/glennrp/libpng/releases/tag/v1.6.37) – 14.04.2019
c. I also fix CMakeList.txt a bit for tesseract to be able to generate .dll files - see:
CMakeLists.txt
- After generating the dependencies, I used them in a wrapper that uses CAPI and generated a dll file (also 32 bit) that I used in the application. The list of all dependencies is as follows:

- In the next step, I performed an OCR test in the application with tessdata germany data - deu.traineddata model: https://github.com/tesseract-ocr/tessdata_fast.
- At this point, I noticed inferior recognition quality compared to the Tesseract 4.1.1 version, which I used earlier.

a. test file:

- I noticed that there is also a problem with slash, for example: It is then changed to "jj" - see:

- I would like to add that I have also prepared a Tesseract 4.1.1 compilation with the dependencies as in point 2b. The quality of OCR did not change then.
- I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.
Expected Behavior:
I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.
Suggested Fix:
Consideration of an upgrade for deu.traineddata models on the website:
https://github.com/tesseract-ocr/tessdata_fast