This page serves as a compilation of the performance metrics achieved by various visual information extraction algorithms on public benchmarks. The data presented here are collected from research papers as well as official code repositories.
Given the prediction of the model and the ground-truth, if the predicted content is exactly consistent with the ground-truth, it will be recorded as a true positive(TP) sample. Let
The Entity F1 score is a metric used for the Entity Extraction task (also known as Semantic Entity Recognition, SER). It measures the accuracy of the predicted string and its corresponding category with respect to the ground-truth. When both the predicted string and category match the ground-truth, it is considered a TP sample.
If you are using BIO-tagging models, such as LayoutLM, LiLT, etc., you can utilize the seqeval library for metric calculation.
Used as a metric for Entity Linking (or Relation Extraction, RE) task. Models takes the ground-truths of Entity Extraction as input, then predicts the linking relation between entities. A linking is considered as TP if and only if the predicted pair exists in the ground-truth pairs.
Used as a metric for end-to-end pair extraction task. A prediction is considered TP if and only if the predicted key-value pair exactly matches the ground-truth pair.
The metric is used specifically for LLM-based models.
For Entity Extraction, two types of operations are employed:
- The model takes the text content of an entity as input and predicts its corresponding key category. Used for datasets like FUNSD, where each key category contains multiple entities.
- The model takes the key category name as input and predicts the corresponding text content. Used for datasets where each key category contains one or no entity.
For Entity Linking, the model takes the question entity as input, then predicts the corresponding answer entity.
The edit distance between the prediction string and ground-truth of a key category is calculated as follow
where
The document parsing task in CORD employs this metric. The zhang-shasha library can be used to calculate the edit distance between strings.
The SROIE dataset takes the entity micro F1-score as the evaluation metric. The dataset contains 4 key categories, each category contains one or no entity. The dataset consists of four key categories, with each category containing one or no entity. In this metric, if the predicted string of a key category is consistent with the ground-truth string, it will be recorded as a true positive (TP) sample. The total number of TP samples, total number of predictions, and total number of ground-truth strings are used to calculate the micro-F1 score.
You can find the evaluation scripts for the SROIE dataset on the ICDAR2019 SROIE official page (Download tab, Task 3 Evaluation script).
Type | Approach | Precision | Recall | F1 | QA F1 | |
---|---|---|---|---|---|---|
Grid-based | ViBERTgrid | BERT-base | - | - | 96.25 | - |
RoBERTa-base | - | - | 96.40 | - | ||
GNN-based | PICK | - | - | 96.12 | - | |
MatchVIE | - | - | 96.57 | - | ||
GraphDoc | - | - | 98.45 | - | ||
FormNetV2 | - | - | 98.31 | - | ||
Large Scale Pre-trained | LayoutLM | base | 94.38 | 94.38 | 94.38 | - |
large | 95.24 | 95.24 | 95.24 | - | ||
LayoutLMv2 | base | 96.25 | 96.25 | 96.25 | - | |
large | 99.04 | 96.61 | 97.81 | - | ||
TILT | base | - | - | 97.65 | - | |
large | - | - | 98.10 | - | ||
BROS | base | - | - | 95.91 | - | |
large | - | - | 96.62 | - | ||
StrucTexT | eng-base | - | - | 96.88 | - | |
chn&eng-base | - | - | 98.27 | - | ||
chn&eng-large | - | - | 98.70 | - | ||
WUKONG-READER | base | - | - | 96.88 | - | |
large | - | - | 98.15 | - | ||
ERNIE-layout | large | - | - | 97.55 | - | |
QGN | - | - | 97.90 | - | ||
LayoutMask | base | - | - | 96.87 | - | |
large | - | - | 97.27 | - | ||
HGALayoutLM | base | 99.58 | 99.48 | 99.53 | - | |
large | 99.69 | 99.53 | 99.61 | - | ||
End-to-End | TRIE | ground-truth | - | - | 96.18 | - |
end-to-end | - | - | 82.06 | - | ||
VIES | ground-truth | - | - | 96.12 | - | |
end-to-end | - | - | 91.07 | - | ||
Kuang CFAM | end-to-end | - | - | 85.87 | - | |
OmniParser | - | - | 85.60 | - | ||
HIP | - | - | 87.60 | - | ||
LLM-based | HRVDA | - | - | - | 91.00 | |
Monkey | - | - | - | 41.90 | ||
TextMonkey | - | - | - | 47.00 | ||
MiniMonkey | - | - | - | 70.30 | ||
UniDoc | 224 | - | - | - | 1.40 | |
336 | - | - | - | 2.92 | ||
DocPedia | 224 | - | - | - | 17.01 | |
336 | - | - | - | 21.44 | ||
LayoutLLM | Llama2-7B-chat | - | - | - | 70.97 | |
Vicuna-1.5-7B | - | - | - | 72.12 | ||
Other Methods | TCPN | TextLattice | - | - | 96.54 | - |
Tag, ground-truth | - | - | 95.46 | - | ||
Tag, end-to-end | - | - | 91.21 | - | ||
Tag&Copy, end-to-end | - | - | 91.93 | - |
The authors of the CORD dataset, the Clova-AI team, have not explicitly specified the task type and evaluation metrics for this dataset. However, upon reviewing the source code of Donut, one of Clova-AI's works, it is apparent that they evaluate the model's performance in Document Structure Parsing. In a typical receipt, various details about the purchased items are provided, such as their names, quantities, and unit prices. These entities have a hierarchical relationship, and a receipt can be represented by a JSON-like structure as shown below:
{
"menu": [
{
"nm": "EGG TART",
"cnt": "1",
"price": "13,000"
},
{
"nm": "CHOCO CUS ARD PASTRY",
"cnt": "2",
"price": "24,000"
},
{
"nm": "REDBEAN BREAD",
"cnt": "1",
"price": "9,000"
}
],
"total": {
"total_price": "46,000",
"cashprice": "50,000",
"changeprice": "4,000"
}
}
The evaluation metric used by Donut is the TED Acc (Tree Edit Distance Accuracy), which measures the similarity between the predicted JSON and the ground-truth.
In addition to Document Structure Parsing, Donut also evaluates the model's performance on the Entity Extraction task, using the Entity F1 score as the evaluation metric. Most SOTA models follows this evaluation pipeline.
Another work by clovvai, SPADE, evaluate the model's performance on Document Structure Parsing through a relaxed structured field F1-score. This evaluation measures the accuracy of dependency parsing by computing the F1 score of predicted edges. The task is simplified by not considering differences between predictions and ground truth in certain fields (such as store name
, menu name
, and item name
) when the edit distance is less than 2 or when the ratio of edit distance to the ground truth string length is less than or equal to 0.4. Details can be found in their paper (Section 5.3 and A.2).
Some other works, such as BROS, evaluate the model's performance on Entity Linking using the Linking F1 score.
Type | Approach | Entity Extraction | Entity Linking | Document Structure Parsing | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | QA F1 | Precision | Recall | F1 | Precision | Recall | F1 | TED Acc | |||
GNN-based | GraphDoc | - | - | 96.93 | - | - | - | - | - | - | - | - | |
FormNet | 98.02 | 96.55 | 97.28 | - | - | - | - | - | - | - | - | ||
FomNetV2 | - | - | 97.70 | - | - | - | - | - | - | - | - | ||
Large Scale Pre-trained | LayoutLM | base | 94.37 | 95.08 | 94.72 | - | - | - | - | - | - | - | - |
large | 94.32 | 95.54 | 94.93 | - | - | - | - | - | - | - | - | ||
LayoutLMv2 | base | 94.53 | 95.39 | 94.95 | - | - | - | - | - | - | - | - | |
large | 95.65 | 96.37 | 96.01 | - | - | - | - | - | - | - | - | ||
LayoutLMv3 | base | - | - | 96.56 | - | - | - | - | - | - | - | - | |
large | - | - | 97.46 | - | - | - | - | - | - | - | - | ||
DocFormer | base | 96.52 | 96.14 | 96.33 | - | - | - | - | - | - | - | - | |
large | 97.25 | 96.74 | 96.99 | - | - | - | - | - | - | - | - | ||
TILT | base | - | - | 95.11 | - | - | - | - | - | - | - | - | |
large | - | - | 96.33 | - | - | - | - | - | - | - | - | ||
BROS | base | - | - | 96.50 | - | - | - | 95.73 | - | - | - | - | |
large | - | - | 97.28 | - | - | - | 97.40 | - | - | - | - | ||
UDoc | UDoc | - | - | 96.64 | - | - | - | - | - | - | - | - | |
UDoc* | - | - | 96.86 | - | - | - | - | - | - | - | - | ||
LiLT | [EN-RoBERTa]base | - | - | 96.07 | - | - | - | - | - | - | - | - | |
[InfoXLM]base | - | - | 95.77 | - | - | - | - | - | - | - | - | ||
DocReL | - | - | 97.00 | - | - | - | - | - | - | - | - | ||
WUKONG-READER | base | - | - | 96.54 | - | - | - | - | - | - | - | - | |
large | - | - | 97.27 | - | - | - | - | - | - | - | - | ||
ERNIE-layout | large | - | - | 96.99 | - | - | - | - | - | - | - | - | |
QGN | - | - | 96.84 | - | - | - | - | - | - | - | - | ||
GeoLayoutLM | - | - | 97.97 | - | - | - | 99.45 | - | - | - | - | ||
GraphLayoutLM | base | - | - | 97.28 | - | - | - | - | - | - | - | - | |
large | - | - | 97.75 | - | - | - | - | - | - | - | - | ||
HGALayoutLM | base | 97.89 | 97.16 | 97.52 | - | - | - | - | - | - | - | - | |
large | 97.97 | 97.38 | 97.67 | - | - | - | - | - | - | - | - | ||
DocFormerv2 | base | 97.51 | 96.10 | 96.80 | - | - | - | - | - | - | - | - | |
large | 97.71 | 97.70 | 97.70 | - | - | - | - | - | - | - | - | ||
DocTr | - | - | 98.20 | - | - | - | - | - | - | 94.40 | - | ||
LayoutMask | base | - | - | 96.99 | - | - | - | - | - | - | - | - | |
large | - | - | 97.19 | - | - | - | - | - | - | - | - | ||
End-to-End | Donut | - | - | 84.10 | - | - | - | - | - | - | - | 90.90 | |
ESP | - | - | 95.65 | - | - | - | - | - | - | - | - | ||
UDOP | - | - | 97.58 | - | - | - | - | - | - | - | - | ||
CREPE | - | - | 85.00 | - | - | - | - | - | - | - | - | ||
OmniParser | - | - | 84.80 | - | - | - | - | - | - | - | 88.00 | ||
HIP | - | - | 85.70 | - | - | - | - | - | - | - | - | ||
LLM-based | HRVDA | - | - | - | 89.30 | - | - | - | - | - | - | - | |
LayoutLLM | Llama2-7B-chat | - | - | - | 62.21 | - | - | - | - | - | - | - | |
Vicuna-1.5-7B | - | - | - | 63.10 | - | - | - | - | - | - | - | ||
Other Methods | SPADE | ♠ CORD, oracle input | - | - | - | - | - | - | - | - | - | 92.50 | - |
♠ CORD | - | - | - | - | - | - | - | - | - | 88.20 | - | ||
♠ CORD+ | - | - | - | - | - | - | - | - | - | 87.40 | - | ||
♠ CORD++ | - | - | - | - | - | - | - | - | - | 83.10 | - | ||
♠ w/o TCM, CORD, oracle input | - | - | - | - | - | - | - | - | - | 91.50 | - | ||
♠ w/o TCM, CORD | - | - | - | - | - | - | - | - | - | 87.40 | - | ||
♠ w/o TCM, CORD+ | - | - | - | - | - | - | - | - | - | 86.10 | - | ||
♠ w/o TCM, CORD++ | - | - | - | - | - | - | - | - | - | 82.60 | - |
FUNSD comprises two tasks: Entity Extraction and Entity Linking. The Entity Extraction task requires extracting header
, question
, and answer
entities from the document, and employs Entity F1 Score as the evaluation metric. The Entity Linking task focuses on linking predictions between question
and answer
entities, and uses Linking F1 Score as the evaluation metric.
It is worth noting that, in most mainstream approaches, these two subtasks are considered independent. For instance, LayoutLM's Entity Linking official implementation takes the ground-truth of question
and answer
entities as input and predict the linkings only, without considering the performance of Entity Extraction.
Real-world applications require extracting all key-value pairs from the document, which involves combining the EE and EL tasks to predict the entire kv-pair content. We termed this task as the End-to-end Pair Extraction. It presents challenges such as error accumulation and text segment aggregation. Regrettably, only a few studies have recognized and addressed these challenges, while the majority of research continues to follow the conventional EE+EL setting. We hope to see more studies that delve into this particular case.
Type | Approach | Entity Extraction | Entity Linking | E2E Pair Extraction | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | QA F1 | Precision | Recall | F1 | QA F1 | Precision | Recall | F1 | |||||||
Grid-based | MSAU-PAF | - | - | 83.00 | - | - | - | - | - | - | - | 75.00 | |||||
GNN-based | GraphDoc | - | - | 87.77 | - | - | - | - | - | - | - | - | |||||
MatchVIE | - | - | 81.33 | - | - | - | - | - | - | - | - | ||||||
FormNet | - | - | 84.69 | - | - | - | - | - | - | - | - | ||||||
FormNetV2 | - | - | 92.51 | - | - | - | - | - | - | - | - | ||||||
Large Scale Pre-trained | LayoutLM | base | 75.97 | 81.55 | 78.66 | - | - | - | - | - | - | - | - | ||||
large | 75.96 | 82.19 | 78.95 | - | - | - | - | - | - | - | - | ||||||
LayoutLMv2 | base | 80.29 | 85.39 | 82.76 | - | - | - | - | - | - | - | - | |||||
large | 83.24 | 85.19 | 84.20 | - | - | - | - | - | - | - | - | ||||||
LayoutXLM | base, Language Specific Fine-tuning | - | - | 79.40 | - | - | - | 54.83 | - | - | - | - | |||||
large, Language Specific Fine-tuning | - | - | 82.25 | - | - | - | 64.04 | - | - | - | - | ||||||
base, Multitask Fine-tuning | - | - | 79.24 | - | - | - | 66.71 | - | - | - | - | ||||||
large, Multitask Fine-tuning | - | - | 80.68 | - | - | - | 76.83 | - | - | - | - | ||||||
LayoutLMv3 | base | - | - | 90.29 | - | - | - | - | - | - | - | - | |||||
large | - | - | 92.08 | - | - | - | - | - | - | - | - | ||||||
XYLayoutLM | - | - | 83.35 | - | - | - | - | - | - | - | - | ||||||
SelfDoc | - | - | 83.36 | - | - | - | - | - | - | - | - | ||||||
DocFormer | base | 80.76 | 86.09 | 83.34 | - | - | - | - | - | - | - | - | |||||
large | 82.29 | 86.94 | 84.55 | - | - | - | - | - | - | - | - | ||||||
StructuralLM-large | 83.52 | - | 85.14 | - | - | - | - | - | - | - | - | ||||||
BROS | base | 81.16 | 85.02 | 83.05 | - | - | - | 71.46 | - | - | - | - | |||||
large | 82.81 | 86.31 | 84.52 | - | - | - | 77.01 | - | - | - | - | ||||||
StrucTexT | eng-base | - | - | 83.09 | - | - | - | 44.10 | - | - | - | - | |||||
chn&eng-base | - | - | 84.83 | - | - | - | 70.45 | - | - | - | - | ||||||
chn&eng-large | - | - | 87.56 | - | - | - | 74.21 | - | - | - | - | ||||||
UDoc | UDoc | - | - | 87.96 | - | - | - | - | - | - | - | - | |||||
UDoc* | - | - | 87.93 | - | - | - | - | - | - | - | - | ||||||
LiLT | [En RoBERTa]base | 87.21 | 89.65 | 88.41 | - | - | - | - | - | - | - | - | |||||
[InfoXLM]base | 84.67 | 87.09 | 85.86 | - | - | - | - | - | - | - | - | ||||||
[InfoXLM]base, Language Specific Fine-tuning | - | - | 84.15 | - | - | - | 62.76 | - | - | - | - | ||||||
[InfoXLM]base, Multitask Fine-tuning | - | - | 85.74 | - | - | - | 74.07 | - | - | - | - | ||||||
DocReL | - | - | - | - | - | - | 46.10 | - | - | - | - | ||||||
WUKONG-READER | base | - | - | 91.52 | - | - | - | - | - | - | - | - | |||||
large | - | - | 93.62 | - | - | - | - | - | - | - | - | ||||||
ERNIE-layout | large | - | - | 93.12 | - | - | - | - | - | - | - | - | |||||
GeoLayoutLM | - | - | 92.86 | - | - | - | 89.45 | - | - | - | - | ||||||
KVPFormer | - | - | - | - | - | - | 90.86 | - | - | - | - | ||||||
GraphLayoutLM | base | - | - | 93.15 | - | - | - | - | - | - | - | - | |||||
large | - | - | 94.39 | - | - | - | - | - | - | - | - | ||||||
HGALayoutLM | base | 94.84 | 93.80 | 94.32 | - | - | - | - | - | - | - | - | |||||
large | 95.67 | 94.95 | 95.31 | - | - | - | - | - | - | - | - | ||||||
DocFormerv2 | base | 89.15 | 87.60 | 88.37 | - | - | - | - | - | - | - | - | |||||
large | 89.88 | 87.92 | 88.89 | - | - | - | - | - | - | - | - | ||||||
DocTr | - | - | 84.00 | - | - | - | 73.90 | - | - | - | - | ||||||
DocFormerv2 | base | 89.15 | 87.60 | 88.37 | - | - | - | - | - | - | - | - | |||||
large | 89.88 | 87.92 | 88.89 | - | - | - | - | - | - | - | - | ||||||
LayoutMask | base | - | - | 92.91 | - | - | - | - | - | - | - | - | |||||
large | - | - | 93.20 | - | - | - | - | - | - | - | - | ||||||
End-to-End | ESP | - | - | 91.12 | - | - | - | 88.88 | - | - | - | - | |||||
UDOP | - | - | 91.62 | - | - | - | - | - | - | - | - | ||||||
HIP | - | - | 52.00 | - | - | - | - | - | - | - | - | ||||||
LLM-based | Monkey | - | - | - | - | - | - | - | 24.10 | - | - | - | |||||
TextMonkey | - | - | - | - | - | - | - | 32.30 | - | - | - | - | - | - | - | - | |
MiniMonkey | - | - | - | - | - | - | - | 42.90 | - | - | - | - | - | - | - | - | |
UniDoc | 224 | - | - | - | - | - | - | - | 1.19 | - | - | - | |||||
336 | - | - | - | - | - | - | - | 1.02 | - | - | - | ||||||
DocPedia | 224 | - | - | - | - | - | - | - | 18.75 | - | - | - | |||||
336 | - | - | - | - | - | - | - | 29.86 | - | - | - | ||||||
LayoutLLM | Llama2-7B-chat | - | - | - | - | - | - | - | 78.65 | - | - | - | |||||
Vicuna-1.5-7B | - | - | - | - | - | - | - | 79.98 | - | - | - | ||||||
Other Methods | SPADE | - | - | 71.60 | - | - | - | 41.30 | - | - | - | - |
XFUND is an multi-lingual extension of FUNSD, covering 7 languages: Chinese, Japanese, Spanish, French, Italian, German, and Portuguese. It contains 1,393 fully annotated forms, with each language containing 199 forms. The training set comprises 149 forms, while the test set includes 50 forms. XFUND also includes two subtasks: Entity Extraction and Entity Linking. Its follows the same evaluation protocol as FUNSD.
Note: In the following chart, the term Avg.
represents the average score of the 7 non-English subsets. Some methods include the English subset in their reported average scores. To ensure a fair comparison, we made adjustments accordingly.
Type | Approach | Entity Extraction | Entity Linking | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ZH | JA | ES | FR | IT | DE | PT | Avg. | ZH | JA | ES | FR | IT | DE | PT | Avg. | |||
Large Scale Pre-trained | LayoutXLM | base, Language Specific Fine-tuning | 89.24 | 79.21 | 75.50 | 79.02 | 80.02 | 82.22 | 79.03 | 82.40 | 70.73 | 69.63 | 68.96 | 63.53 | 64.15 | 65.51 | 57.18 | 65.67 |
large, Language Specific Fine-tuning | 91.61 | 80.33 | 78.30 | 80.98 | 82.75 | 83.61 | 82.73 | 82.90 | 78.88 | 72.25 | 76.66 | 71.02 | 76.91 | 68.43 | 67.96 | 73.16 | ||
base, Zero-shot transfer | 60.19 | 47.15 | 45.65 | 57.57 | 48.46 | 52.52 | 53.90 | 52.21 | 44.94 | 44.08 | 47.08 | 44.16 | 40.90 | 38.20 | 36.85 | 42.31 | ||
large, Zero-shot transfer | 68.96 | 51.90 | 49.76 | 61.35 | 55.17 | 59.05 | 60.77 | 58.14 | 55.31 | 56.96 | 57.80 | 56.15 | 51.84 | 48.90 | 47.95 | 53.56 | ||
base, Multitask Fine-tuning | 89.73 | 79.64 | 77.98 | 81.73 | 82.10 | 83.22 | 82.41 | 82.40 | 82.41 | 81.42 | 81.04 | 82.21 | 83.10 | 78.54 | 70.44 | 79.88 | ||
large, Multitask Fine-tuning | 91.55 | 82.16 | 80.55 | 83.84 | 83.72 | 85.30 | 86.50 | 84.80 | 90.00 | 86.21 | 85.92 | 86.69 | 86.75 | 82.63 | 81.60 | 85.69 | ||
XYLayoutLM | 91.76 | 80.57 | 76.87 | 79.97 | 81.75 | 83.35 | 80.01 | 82.04 | 74.45 | 70.59 | 72.59 | 65.21 | 65.72 | 67.03 | 58.98 | 67.79 | ||
LiLT | [InfoXLM] base, Language Specific Fine-tuning | 89.38 | 79.64 | 79.11 | 79.53 | 83.76 | 82.31 | 82.20 | 82.27 | 72.97 | 70.37 | 71.95 | 69.65 | 70.43 | 65.58 | 58.74 | 68.53 | |
[InfoXLM] base, Zero-shot transfer | 61.52 | 51.84 | 51.01 | 59.23 | 53.71 | 60.13 | 63.25 | 57.24 | 47.64 | 50.81 | 49.68 | 52.09 | 46.97 | 41.69 | 42.72 | 47.37 | ||
[InfoXLM] base, Multi-task Fine-tuning | 90.47 | 80.88 | 83.40 | 85.77 | 87.92 | 87.69 | 84.93 | 85.86 | 84.71 | 83.45 | 83.35 | 84.66 | 84.58 | 78.78 | 76.43 | 82.28 | ||
KVPFormer | - | - | - | - | - | - | - | - | 94.27 | 94.23 | 95.23 | 97.19 | 94.11 | 92.41 | 92.19 | 94.23 | ||
HGALayoutLM | 94.22 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
End-to-End | ESP | Language Specific Fine-tuning | 90.30 | 81.10 | 85.40 | 90.50 | 88.90 | 87.20 | 87.50 | 87.30 | 90.80 | 88.30 | 85.20 | 90.90 | 90.00 | 85.20 | 86.20 | 88.10 |
Multitask Fine-tuning | - | - | - | - | - | - | - | 89.13 | - | - | - | - | - | - | - | 92.31 |
EPHOIE consists of 11 key categories for Entity Extraction and takes the Entity F1 as the evaluation metric. If the predicted string of a key category is consistent with the ground-truth string and not empty, it will be recorded as a TP sample.
Type | Approach | Precision | Recall | F1 | |
---|---|---|---|---|---|
Grid-based | MathcVIE | - | - | 96.87 | |
Large-Scale Pre-trained | StrucTexT | chn&eng-base | - | - | 98.84 |
chn&eng-large | - | - | 99.30 | ||
LiLT | [InfoXLM]base | 96.99 | 98.20 | 97.59 | |
[ZH-RoBERTa]base | 97.62 | 98.33 | 97.97 | ||
QGN | - | - | 98.49 | ||
End-to-End | VIES | ground-truth | - | - | 95.23 |
end-to-end | - | - | 83.81 | ||
Other Methods | TCPN | TextLattice | - | - | 98.06 |
Copy Mode, end-to-end | - | - | 84.67 | ||
Tag Mode, end-to-end | - | - | 86.19 | ||
Tag Mode, ground-truth | - | - | 97.59 |
Type | Approach | QA F1 | |
---|---|---|---|
End-to-end | Donut | 61.60 | |
LLM-based | Qwen-VL | 4.10 | |
Monkey | 40.60 | ||
mPLUG-DocOwl | 42.60 | ||
mPLUG-DocOwl 1.5 | DocOwl-1.5 | 68.80 | |
DocOwl-1.5 chat | 68.80 | ||
UReader | 49.50 |
Kleister Charity (KLC) contains 8 kind of key categories. It contains 2788 financial reports with 61643 pages in total. This benchmark is commonly used by LLM-based approaches in a QA-manner.
Type | Approach | QA F1 | |
---|---|---|---|
End-to-end | Donut | 30.00 | |
LLM-based | Qwen-VL | 15.90 | |
Monkey | 32.80 | ||
mPLUG-DocOwl | 30.30 | ||
mPLUG-DocOwl 1.5 | DocOwl-1.5 | 37.90 | |
DocOwl-1.5 chat | 38.70 | ||
UReader | 32.80 | ||
DoCo | Qwen-VL-Chat | 33.80 | |
mPLUG-Owl | 32.90 |