-
Notifications
You must be signed in to change notification settings - Fork 188
AustrianNewspapers
Austrian Newspapers is a ground truth data set created with Transkribus from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek). See this publication for details:
Günter Muehlberger, & Günter Hackl. (2019). NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3387369
The transcription uses level 1 according to the OCR-D transcription guidelines (German).
The files needed for training with tesstrain
can be extracted from the data set with page2img.py
which is part of OCR-D/format-converters. The current code needs a patch for PAGE files from Transkribus like those in the data set.
# Get the data set.
wget https://zenodo.org/record/3387369/files/TrainingSet_ONB_Newseye_GT_M1+.zip
wget https://zenodo.org/record/3387369/files/ValidationSet_ONB_Newseye_GT_M1+.zip
# Unzip the data.
unzip TrainingSet_ONB_Newseye_GT_M1+.zip
unzip ValidationSet_ONB_Newseye_GT_M1+.zip
# Optionally remove the zip files which are no longer needed.
rm TrainingSet_ONB_Newseye_GT_M1+.zip ValidationSet_ONB_Newseye_GT_M1+.zip
# Get the data needed for tesstrain.
for set in TrainingSet_ONB_Newseye_GT_M1+ ValidationSet_ONB_Newseye_GT_M1+; do
cd $set
for xml in *.xml; do
dir=gt/$(basename $xml .xml)
echo $xml
mkdir -p $dir
python PATH/format-converters/page2img.py --text --out-dir $dir --page-version 2013-07-15 $xml
done
cd ..
done
# Remove unneeded files.
find TrainingSet_ONB_Newseye_GT_M1+ ValidationSet_ONB_Newseye_GT_M1+ -name "*line*" | xargs rm
# Fix names of text files.
for txt in $(find * -name "*.txt"); do mv -v $txt ${txt/.txt/.gt.txt}; done
# ...
The PAGE files were exported from Transkribus which does not generate valid XML. They cannot be processed directly by tools which require valid XML, for example the PRIMA PageViewer. See this report: https://github.com/Transkribus/TranskribusCore/issues/38.
The data set includes a mixture of image files in JPEG and TIFF format in various resolutions:
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_18950706_8.jpg: JPEG image data, JFIF standard 1.02, resolution (DPI), density 300x300, segment length 16, baseline, precision 8, 2479x3508, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_001.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 458x458, segment length 16, baseline, precision 8, 3598x5367, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_005.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 460x460, segment length 16, baseline, precision 8, 3607x5387, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_008.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 332x332, segment length 16, baseline, precision 8, 2746x3589, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_013.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 334x334, segment length 16, baseline, precision 8, 2763x3576, components 3
The JPEG files are surprisingly small, so the real resolution might by 150 x 150.
The TIFF files are even smaller. While the JPEG files use 8 bit grayscale, the TIFF files are binarized images.
The ground truth files contain 80 Latin characters (not only from the German alphabet) and 67 other characters (even rare ones like ⅜
or ∆
), but important characters like ²
(superscript two) or ſ
(long s) are missing.
The transcription uses place holders for characters which could not be identified, for example bi###g
(billig). This affects about 257 lines.
As expected for human made transcriptions there is also a certain error rate in the ground truth. Typical Fraktur confusions like f / long s, u / n occur often. Other examples: Sparberde
(should be Sparherde).
Here are more examples:
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
index 5a7a60a4..ee390841 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_126.gt.txt
@@ -1 +1 @@
-— Wir haben gestern non dem merkwürdigen Tadel
+— Wir haben gestern von dem merkwürdigen Tadel
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
index 7c4e23d4..d4e2c3ec 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_127.gt.txt
@@ -1 +1 @@
-erzahlt, den der Colberger Bürgermeister van dem vorgesezten
+erzählt, den der Colberger Bürgermeister von dem vorgesetzten
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
index 0f94eb05..f5bc489e 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_142.gt.txt
@@ -1 +1 @@
-Stadt aufs tieffte. In der Stadtverordnetensitzung vom 1. Juli
+Stadt aufs tiefste. In der Stadtverordnetensitzung vom 1. Juli
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
index 6a55c966..09cec4ff 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_156.gt.txt
@@ -1 +1 @@
-Volksversammlung gestattet wurden, die von den Freisinnigen
+Volksversammlung gestattet wvrden, die von den Freisinnigen
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
index ec589592..19edbefe 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_198.gt.txt
@@ -1 +1 @@
-als Konservativer anftretender Schuhmachenneister. Es ist in
+als Konservativer auftretender Schuhmachermeister. Es ist in
diff --git a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
index dac61928..8271d184 100644
--- a/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
+++ b/gt/eval/ONB_aze_18950706_4/ONB_aze_18950706_4.jpg_tl_217.gt.txt
@@ -1 +1 @@
-die Versammlung im Strandschloffe unangenehm gewesen ist.
+die Versammlung im Strandschlosse unangenehm gewesen ist.