Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 2.86 KB

README.md

File metadata and controls

59 lines (45 loc) · 2.86 KB

HisDoc1B Dataset

The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.

Dataset #Books #Document images #Characters #Character categories Text punctuation
MTHv1[1] - 1,500 521,370 4,058 ×
MTHv2[2] - 3,199 1,081,678 6,733 ×
IC19 HDRC[3] - 11,715 2,482,994 8,353 ×
M5HisDoc[4] - 8,000 4,367,360 16,151 ×
CASIA-AHCDB[5] - - 2,276,740 10,350 ×
HisDoc1B (Ours) 40,281 3,163,330 (270×) 1,082,544,808 (248×) 30,615 (1.9×)

Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.

Download

OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq

Directory Format

The dataset is organized in the following directory format:

├── HisDoc1B
    ├── books
    │   ├── xxx.pdf/.djvu
    │   └── ...
    ├── annos
    │   ├── xxx.json
    │   └── ...
    ├── readme.md
    ├── book2im.py
    ├── read_anno.py

Inference codes to generate the dataset

Contact

For any questions about the dataset, please contact the authors by sending an email to yongxin_shi@foxmail.com.