Skip to content
Stefan Weil edited this page Feb 10, 2020 · 5 revisions

Tesseract standard Fraktur Models

Tesseract comes with several models which are specialized for Fraktur texts.

Those models also support historic and modern Antiqua scripts.

Some other models are primarly made for modern Antiqua scripts, but have a very limited ability to recognize Fraktur and historic Antiqua, too.

Neither of the above models is really good as a general model for Fraktur and historic Antiqua texts because each of them has specific problems.

dan_frak, deu_frak and slk_frak are language specific, so they only support a limited set of characters. They can only be used with the old legacy recognizer, not with the newer LSTM (neural network) recognizer. Typically (not always!) the results from the legacy recognizer are worse than those from the LSTM recognizer.

frk supports the German character set, but important characters like for example § are missing and will never be recognized. In addition, some ligatures like ch and ck were trained wrongly and will therefore be recognized as < and >. script/Fraktur supports a larger international character set, but otherwise has the same issues as frk.

So to summarize, other models are needed for Fraktur and historic Antiqua. Such models can be trained either from scratch or based on one of the existing standard models.

Training Fraktur

This is a collection of sources for training OCR models which can be used to recognize Fraktur. A more complete list which is not restricted to Fraktur only can be found at https://github.com/cneud/ocr-gt.

Austrian Newspapers

Austrian Newspapers is a ground truth data set created from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek).See https://github.com/tesseract-ocr/tesstrain/wiki/AustrianNewspapers for more information.

GT4HistOCR

GT4HistOCR is ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for more information.

Ocropus Fraktur

https://github.com/jze/ocropus-model_fraktur provides ground truth data, 3852 lines for training and 414 lines for testing, both of good quality.

Open issues

  • Some umlauts might be replaced by aͤ, oͤ, uͤ.
  • It uses the minus #stead of ⸗.

``

Clone this wiki locally