Skip to content
Stefan Weil edited this page Oct 22, 2019 · 14 revisions

Training Fraktur with Austrian Newspapers

About the data set

Austrian Newspapers is a ground truth data set created with Transkribus from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek). See this publication for details:

Guenter Muehlberger, & Guenter Hackl. (2019). NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3387369

The transcription uses level 1 according to the OCR-D transcription guidelines (German).

Preparing the data for training

The files needed for training with tesstrain can be extracted from the data set with page2img.py which is part of OCR-D/format-converters. The current code needs a patch for PAGE files from Transkribus like those in the data set.

# Get the data set.
wget https://zenodo.org/record/3387369/files/TrainingSet_ONB_Newseye_GT_M1+.zip
wget https://zenodo.org/record/3387369/files/ValidationSet_ONB_Newseye_GT_M1+.zip

# Unzip the data.
unzip TrainingSet_ONB_Newseye_GT_M1+.zip
unzip ValidationSet_ONB_Newseye_GT_M1+.zip

# Optionally remove the zip files which are no longer needed.
rm TrainingSet_ONB_Newseye_GT_M1+.zip ValidationSet_ONB_Newseye_GT_M1+.zip

# Get the data needed for tesstrain.
for set in TrainingSet_ONB_Newseye_GT_M1 ValidationSet_ONB_Newseye_GT_M1; do
  cd $set
  for xml in *.xml; do
    dir=gt/$(basename $xml .xml)
    echo $xml
    mkdir -p $dir
    python PATH/format-converters/page2img.py --text --out-dir $dir --page-version 2013-07-15 $xml
  done
  cd ..
done

# Fix names of text files.
# tbd.
# ...

Known problems

PAGE files

The PAGE files were exported from Transkribus which does not generate valid XML. They cannot be processed directly by tools which require valid XML, for example the PRIMA PageViewer.

Image files

The data set includes a mixture of image files in JPEG and TIFF format in various resolutions:

TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_18950706_8.jpg:   JPEG image data, JFIF standard 1.02, resolution (DPI), density 300x300, segment length 16, baseline, precision 8, 2479x3508, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_001.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 458x458, segment length 16, baseline, precision 8, 3598x5367, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_005.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 460x460, segment length 16, baseline, precision 8, 3607x5387, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_008.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 332x332, segment length 16, baseline, precision 8, 2746x3589, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_013.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 334x334, segment length 16, baseline, precision 8, 2763x3576, components 3

The JPEG files are surprisingly small, so the real resolution might by 150 x 150.

The TIFF files are even smaller. While the JPEG files use 8 bit grayscale, the TIFF files are binarized images.

``

Clone this wiki locally