The HisDoc1B dataset comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in terms of scale (as shown in the below table). Additionally, it is the only dataset with complete book-level annotations and punctuation annotations.
Dataset | #Books | #Document images | #Characters | #Character categories | Text punctuation |
---|---|---|---|---|---|
MTHv1[1] | - | 1,500 | 521,370 | 4,058 | × |
MTHv2[2] | - | 3,199 | 1,081,678 | 6,733 | × |
IC19 HDRC[3] | - | 11,715 | 2,482,994 | 8,353 | × |
M5HisDoc[4] | - | 8,000 | 4,367,360 | 16,151 | × |
CASIA-AHCDB[5] | - | - | 2,276,740 | 10,350 | × |
HisDoc1B (Ours) | 40,281 | 3,163,330 (270×) | 1,082,544,808 (248×) | 30,615 (1.9×) | ✓ |
Table 1: Comparison of HisDoc1B with existing Chinese historical document datasets. The highest and second highest values within each column are denoted by bold and underline, respectively.
OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTdPghMv281sKYsq0?e=fIuK65
BaiduYun: https://pan.baidu.com/s/1CQnfmHwh6hGigyvHNlmPCQ?pwd=aziq
The dataset is organized in the following directory format:
├── HisDoc1B
├── books
│ ├── xxx.pdf/.djvu
│ └── ...
├── annos
│ ├── xxx.json
│ └── ...
├── readme.md
├── book2im.py
├── read_anno.py
-
Download the docker image:
OneDrive: https://1drv.ms/u/s!ApQfSeOP7LDTc0VErUM9NBjQLls?e=gVTcqw
BaiduYun: https://pan.baidu.com/s/1OV_RjZ8pf9QJlCrMDFvztA?pwd=wdlm -
Initialize the docker container:
docker load -i hisdoc1b-docker.tar docker run -it --gpus all --shm-size="2g" hisdoc1b-image:20240716
-
The code are in the /root/HisDoc1B_codes/.
For any questions about the dataset, please contact the authors by sending an email to yongxin_shi@foxmail.com.