-
Notifications
You must be signed in to change notification settings - Fork 182
AustrianNewspapers
Austrian Newspapers is a ground truth data set created with Transkribus from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek). See this publication for details:
Guenter Muehlberger, & Guenter Hackl. (2019). NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3387369
The transcription uses level 1 according to the OCR-D transcription guidelines (German).
The PAGE files were exported from Transkribus which does not generate valid XML. They cannot be processed directly by tools which require valid XML, for example the PRIMA PageViewer.
The data set includes a mixture of image files in JPEG and TIFF format in various resolutions:
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_18950706_8.jpg: JPEG image data, JFIF standard 1.02, resolution (DPI), density 300x300, segment length 16, baseline, precision 8, 2479x3508, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_001.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 458x458, segment length 16, baseline, precision 8, 3598x5367, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_aze_19110701_005.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 460x460, segment length 16, baseline, precision 8, 3607x5387, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_008.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 332x332, segment length 16, baseline, precision 8, 2746x3589, components 3
TrainingSet_ONB_Newseye_GT_M1+/ONB_krz_19110701_013.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 334x334, segment length 16, baseline, precision 8, 2763x3576, components 3
The JPEG files are surprisingly small, so the real resolution might by 150 x 150.
The TIFF files are even smaller. While the JPEG files use 8 bit grayscale, the TIFF files are binarized images.