Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler

### Environment

* **Tesseract Version**: 5.1.0
* **Platform**: Windows 32-bit 

### Current Behavior:
I have the following problem:
1. I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application.
2. I prepared the following dependencies with CMake 3.23 (without SW build): 
a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37.
b. Links to src:
- tesseract 5.1.0 (https://github.com/tesseract-ocr/tesseract/releases) – 01.03.2022
- leptonica 1.82.0 (http://www.leptonica.org/download.html) – 22.09.2021
- libtiff 4.3.0 (http://download.osgeo.org/libtiff) – 20.04.2021
- libjpeg-turbo 2.1.3 (https://github.com/libjpeg-turbo/libjpeg-turbo/releases ) – 25.02.2022
- zlib 1.2.11 (https://github.com/madler/zlib/tags) – 15.01.2017
- libpng 1.6.37 (https://github.com/glennrp/libpng/releases/tag/v1.6.37) – 14.04.2019
c. I also fix CMakeList.txt a bit for tesseract to be able to generate .dll files - see: 
[CMakeLists.txt](https://github.com/tesseract-ocr/tesseract/files/8284026/CMakeLists.txt)
3. After generating the dependencies, I used them in a wrapper that uses CAPI and generated a dll file (also 32 bit) that I used in the application. The list of all dependencies is as follows:
![image](https://user-images.githubusercontent.com/12548678/158794280-5d048760-5419-49e3-bd8a-110fa0dc0640.png)
4. In the next step, I performed an OCR test in the application with tessdata germany data - deu.traineddata model: https://github.com/tesseract-ocr/tessdata_fast. 
5. At this point, I noticed inferior recognition quality compared to the Tesseract 4.1.1 version, which I used earlier. 
![image](https://user-images.githubusercontent.com/12548678/158797464-585c773a-f264-467b-9e56-a7dfefd25d3a.png)
a. test file: 
![test_file](https://user-images.githubusercontent.com/12548678/158796308-0e0e8e57-ad24-4eb5-b70a-0c6b99722663.png)
6. I noticed that there is also a problem with slash, for example: It is then changed to "jj" - see:
![image](https://user-images.githubusercontent.com/12548678/158798504-3bae659a-d0cd-434b-83ac-2de69ddb87dd.png)
7. I would like to add that I have also prepared a Tesseract 4.1.1 compilation with the dependencies as in point 2b. The quality of OCR did not change then.
8. I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

### Expected Behavior:
I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.

### Suggested Fix:
Consideration of an upgrade for deu.traineddata models on the website:
https://github.com/tesseract-ocr/tessdata_fast

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769

Description

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions