Skip to content
Stefan Weil edited this page Apr 30, 2023 · 14 revisions

Training Fraktur with Austrian Newspapers

About the data set

Austrian Newspapers is a ground truth data set created with Transkribus from Austrian newspapers by the Library Labs of the Austrian National Library (Österreichische Nationalbibliothek). See this publication for details:

Günter Mühlberger, & Günter Hackl. (2019). NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3387369

The transcription uses level 1 according to the OCR-D transcription guidelines (German).

The dataset is also available on https://github.com/UB-Mannheim/AustrianNewspapers with some fixes and enhancements which upgrade the transcription to level 2.

Preparing the data for training

The files needed for training with tesstrain can be extracted from the data set with page2img.py which is part of OCR-D/format-converters.

# Get the data set.
wget https://zenodo.org/record/3387369/files/TrainingSet_ONB_Newseye_GT_M1+.zip
wget https://zenodo.org/record/3387369/files/ValidationSet_ONB_Newseye_GT_M1+.zip

# Unzip the data.
unzip TrainingSet_ONB_Newseye_GT_M1+.zip
unzip ValidationSet_ONB_Newseye_GT_M1+.zip

# Optionally remove the zip files which are no longer needed.
rm TrainingSet_ONB_Newseye_GT_M1+.zip ValidationSet_ONB_Newseye_GT_M1+.zip

# Get the data needed for tesstrain.
for set in TrainingSet_ONB_Newseye_GT_M1+ ValidationSet_ONB_Newseye_GT_M1+; do
  cd $set
  for xml in *.xml; do
    dir=gt/$(basename $xml .xml)
    echo $xml
    mkdir -p $dir
    python3 PATH/format-converters/page2img.py --text --out-dir $dir --page-version 2013-07-15 $xml
  done
  cd ..
done

# Fix names of text files.
for txt in $(find * -name "*.txt"); do mv -v $txt ${txt/.txt/.gt.txt}; done

# ...

Known problems

PAGE files

The PAGE files were exported from Transkribus which does not generate valid XML. They cannot be processed directly by tools which require valid XML, for example the PRIMA PageViewer. See this report: https://github.com/Transkribus/TranskribusCore/issues/38.

Image files

The data set includes a mixture of image files in JPEG and TIFF format in various resolutions:

TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_18950706_8.jpg:   JPEG image data, JFIF standard 1.02, resolution (DPI), density 300x300, segment length 16, baseline, precision 8, 2479x3508, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_001.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 458x458, segment length 16, baseline, precision 8, 3598x5367, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_005.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 460x460, segment length 16, baseline, precision 8, 3607x5387, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_008.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 332x332, segment length 16, baseline, precision 8, 2746x3589, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_013.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 334x334, segment length 16, baseline, precision 8, 2763x3576, components 3

The JPEG files are surprisingly small, so the real resolution might be 150 x 150.

The TIFF files are even smaller. While the JPEG files use 8 bit grayscale, the TIFF files are binarized images.

Line segmentation

A significant number of line images has problems.

Some images are too short. To work around this, the baseline length can be used.

Other line images show large parts of neighboring lines or even two complete lines.

Transcriptions

The ground truth files contain 80 Latin characters (not only from the German alphabet) and 67 other characters (even rare ones like or ), but important characters like ² (superscript two) or ſ (long s) are missing.

The transcription uses place holders for characters which could not be identified, for example bi###g (billig). This affects about 257 lines.

As expected for human made transcriptions there is also a certain error rate in the ground truth. Typical Fraktur confusions like f / long s, u / n occur often. Other examples: Sparberde (should be Sparherde).

Here are more examples:

diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
index 5a7a60a4..ee390841 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
@@ -1 +1 @@
-— Wir haben gestern non dem merkwürdigen Tadel
+— Wir haben gestern von dem merkwürdigen Tadel
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
index 7c4e23d4..d4e2c3ec 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
@@ -1 +1 @@
-erzahlt, den der Colberger Bürgermeister van dem vorgesezten
+erzählt, den der Colberger Bürgermeister von dem vorgesetzten
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
index 0f94eb05..f5bc489e 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
@@ -1 +1 @@
-Stadt aufs tieffte. In der Stadtverordnetensitzung vom 1. Juli
+Stadt aufs tiefste. In der Stadtverordnetensitzung vom 1. Juli
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
index 6a55c966..09cec4ff 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
@@ -1 +1 @@
-Volksversammlung gestattet wurden, die von den Freisinnigen
+Volksversammlung gestattet wvrden, die von den Freisinnigen
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
index ec589592..19edbefe 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
@@ -1 +1 @@
-als Konservativer anftretender Schuhmachenneister. Es ist in
+als Konservativer auftretender Schuhmachermeister. Es ist in
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
index dac61928..8271d184 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
@@ -1 +1 @@
-die Versammlung im Strandschloffe unangenehm gewesen ist.
+die Versammlung im Strandschlosse unangenehm gewesen ist.

Training results

The latest Tesseract models were trained from the enhanced ONB ground truth dataset.

``

Clone this wiki locally