Skip to content

Commit b2832c5

Browse files
authored
Merge pull request #9 from Shreeshrii/patch-2
Add sections, better formatting
2 parents 8203e55 + 86db1f4 commit b2832c5

File tree

1 file changed

+37
-16
lines changed

1 file changed

+37
-16
lines changed

README.md

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,56 @@
11
# tessdata_fast – Fast integer versions of trained models
22

3-
This repository contains fast integer versions of trained models for the
4-
[Tesseract Open Source OCR Engine](https://github.com/tesseract-ocr/tesseract).
3+
This repository contains fast integer versions of trained models for the [Tesseract Open Source OCR Engine](https://github.com/tesseract-ocr/tesseract).
54

6-
Most users will want to use these traineddata files and this is what is planned to be shipped as part of Linux distributions. Fine tuning/incremental training will **NOT** be possible from these `fast` models, as they are 8-bit integer. It will be possible to convert a tuned `best` to integer to make it faster, but some of the speed in `fast` will be from the smaller model.
5+
- Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions.
6+
- Fine tuning/incremental training will **NOT** be possible from these `fast` models, as they are 8-bit integer.
7+
- It will be possible to convert a tuned `best` to integer to make it faster, but some of the speed in `fast` will be from the smaller model.
8+
- When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy `tesseract` engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
79

8-
When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy `tesseract` engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.
10+
## Two types of models
911

10-
Initial capitals indicate the one model for all languages in that script.
12+
The repository contains two types of models,
13+
- those for a single language and
14+
- those for a single script supporting one or more languages.
1115

12-
**Latin** is all latin-based languages,
13-
except vie, which has its own **Vietnamese**.
16+
Most of the script models include English training data as well as the script, but not **Cyrillic**, as that would have a major ambiguity problem.
1417

15-
**Devanagari** is hin+san+mar+nep+eng
18+
On Linux, the language based traineddata packages are named `tesseract-ocr-LANG` where LANG is the three letter language code eg. tesseract-ocr-eng (English language), tesseract-ocr-hin (Hindi language), etc.
1619

17-
**Fraktur** is basically a combination of all the latin-based languages that have an 'old' variant.
20+
On Linux, the script based traineddata packages are named `tesseract-ocr-script-SCRIPT` where SCRIPT is the four letter script code eg. tesseract-ocr-script-latn (Latin Script), tesseract-ocr-script-deva (Devanagari Script), etc.
1821

19-
Most of the script models include English training data as well as the script, but not for **Cyrillic**, as that would have a major ambiguity problem.
22+
### Data files for a particular script
2023

21-
For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines.
24+
Initial capitals in the filename indicate the one model for all languages in that script.
2225

23-
For Latin, I have ~4500 fonts to train with. For Devanagari ~50, and for Kannada 15. With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, I tried mixing the training data with English, thinking that English is often mixed in anyway, and some of the font diversity might generalize to the other script. The overall effect was slightly positive, so I left it that way.
26+
- **Latin** is all latin-based languages, except vie.
27+
- **Vietnamese** is for latin-based Vietnamese language.
28+
- **Fraktur** is basically a combination of all the latin-based languages that have an 'old' variant.
29+
- **Devanagari** is for hin+san+mar+nep+eng.
2430

25-
'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese. As with most of the other Script traineddatas, **Japanese** contains all the languages that use that script (in this case just the one) PLUS English.The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less.
31+
### LSTM training details for different languages and scripts
2632

27-
'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).
33+
For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. eg.
34+
35+
- Latin ~4500 fonts
36+
- Devanagari ~50 fonts
37+
- Kannada 15.
38+
39+
With a theory that poor accuracy on test data and over-fitting on training data was caused by the lack of fonts, the training data has been mixed with English, so that some of the font diversity might generalize to the other script. The overall effect was slightly positive, hence the script models include English language also.
40+
41+
### Example - jpn and Japanese
42+
43+
**'jpn'** contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese.
44+
45+
**Japanese** contains all the languages that use that script (in this case just the one) PLUS English.The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less.
46+
47+
**'jpn_vert'** is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).
2848

2949
'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.
3050

31-
See the [Tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files)
32-
for additional information.
51+
--------------------------------
52+
53+
See the [Tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) for additional information.
3354

3455
All data in the repository are licensed under the
3556
Apache-2.0 License, see file [COPYING](COPYING).

0 commit comments

Comments
 (0)