Skip to content

Latest commit

 

History

History
2661 lines (2590 loc) · 55.5 KB

sotas_vie.md

File metadata and controls

2661 lines (2590 loc) · 55.5 KB

Visual Information Extraction

SOTAs

This page serves as a compilation of the performance metrics achieved by various visual information extraction algorithms on public benchmarks. The data presented here are collected from research papers as well as official code repositories.

🎖️Commonly Used Metrics

F1-score

Given the prediction of the model and the ground-truth, if the predicted content is exactly consistent with the ground-truth, it will be recorded as a true positive(TP) sample. Let $N_p$ denotes the number of predictions, $N_g$ for the number of ground-truths, $N_t$ for the number of TP samples, then we have

$$ precision = \frac{N_t}{N_p} $$

$$ recall = \frac{N_t}{N_g} $$

$$ F1 = \frac{2 \times precision \times recall}{precision + recall} $$

Entity F1 score

The Entity F1 score is a metric used for the Entity Extraction task (also known as Semantic Entity Recognition, SER). It measures the accuracy of the predicted string and its corresponding category with respect to the ground-truth. When both the predicted string and category match the ground-truth, it is considered a TP sample.

If you are using BIO-tagging models, such as LayoutLM, LiLT, etc., you can utilize the seqeval library for metric calculation.

Linking F1 score

Used as a metric for Entity Linking (or Relation Extraction, RE) task. Models takes the ground-truths of Entity Extraction as input, then predicts the linking relation between entities. A linking is considered as TP if and only if the predicted pair exists in the ground-truth pairs.

Pair F1 score

Used as a metric for end-to-end pair extraction task. A prediction is considered TP if and only if the predicted key-value pair exactly matches the ground-truth pair.

QA F1 score

The metric is used specifically for LLM-based models.

For Entity Extraction, two types of operations are employed:

  1. The model takes the text content of an entity as input and predicts its corresponding key category. Used for datasets like FUNSD, where each key category contains multiple entities.
  2. The model takes the key category name as input and predicts the corresponding text content. Used for datasets where each key category contains one or no entity.

For Entity Linking, the model takes the question entity as input, then predicts the corresponding answer entity.

⚠️ It is worth-noting that, the QA F1 score is a more relaxed metric compared to the conventional settings, since prior information like the entity span is provided to the model. Therefore, scores obtained through the QA pipeline cannot be directly compared with the scores obtained through the conventional settings. In this section, we will list these QA scores separately.

Edit Distance Score

The edit distance between the prediction string and ground-truth of a key category is calculated as follow

$$ score = 1 - \frac{i + d + m}{N} $$

where $i$, $d$, $m$, $N$ denotes the number of insertions, number of deletions, number of modifications and the total number of instances occurring in the ground truth, respectively.

The document parsing task in CORD employs this metric. The zhang-shasha library can be used to calculate the edit distance between strings.


🗒️List of Index


SROIE

The SROIE dataset takes the entity micro F1-score as the evaluation metric. The dataset contains 4 key categories, each category contains one or no entity. The dataset consists of four key categories, with each category containing one or no entity. In this metric, if the predicted string of a key category is consistent with the ground-truth string, it will be recorded as a true positive (TP) sample. The total number of TP samples, total number of predictions, and total number of ground-truth strings are used to calculate the micro-F1 score.

You can find the evaluation scripts for the SROIE dataset on the ICDAR2019 SROIE official page (Download tab, Task 3 Evaluation script).

Type Approach Precision Recall F1 QA F1
Grid-based ViBERTgrid BERT-base - - 96.25 -
RoBERTa-base - - 96.40 -
GNN-based PICK - - 96.12 -
MatchVIE - - 96.57 -
GraphDoc - - 98.45 -
FormNetV2 - - 98.31 -
Large Scale Pre-trained LayoutLM base 94.38 94.38 94.38 -
large 95.24 95.24 95.24 -
LayoutLMv2 base 96.25 96.25 96.25 -
large 99.04 96.61 97.81 -
TILT base - - 97.65 -
large - - 98.10 -
BROS base - - 95.91 -
large - - 96.62 -
StrucTexT eng-base - - 96.88 -
chn&eng-base - - 98.27 -
chn&eng-large - - 98.70 -
WUKONG-READER base - - 96.88 -
large - - 98.15 -
ERNIE-layout large - - 97.55 -
QGN - - 97.90 -
LayoutMask base - - 96.87 -
large - - 97.27 -
HGALayoutLM base 99.58 99.48 99.53 -
large 99.69 99.53 99.61 -
End-to-End TRIE ground-truth - - 96.18 -
end-to-end - - 82.06 -
VIES ground-truth - - 96.12 -
end-to-end - - 91.07 -
Kuang CFAM end-to-end - - 85.87 -
OmniParser - - 85.60 -
HIP - - 87.60 -
LLM-based HRVDA - - - 91.00
Monkey - - - 41.90
TextMonkey - - - 47.00
MiniMonkey - - - 70.30
UniDoc 224 - - - 1.40
336 - - - 2.92
DocPedia 224 - - - 17.01
336 - - - 21.44
LayoutLLM Llama2-7B-chat - - - 70.97
Vicuna-1.5-7B - - - 72.12
Other Methods TCPN TextLattice - - 96.54 -
Tag, ground-truth - - 95.46 -
Tag, end-to-end - - 91.21 -
Tag&Copy, end-to-end - - 91.93 -

CORD

The authors of the CORD dataset, the Clova-AI team, have not explicitly specified the task type and evaluation metrics for this dataset. However, upon reviewing the source code of Donut, one of Clova-AI's works, it is apparent that they evaluate the model's performance in Document Structure Parsing. In a typical receipt, various details about the purchased items are provided, such as their names, quantities, and unit prices. These entities have a hierarchical relationship, and a receipt can be represented by a JSON-like structure as shown below:

{
    "menu": [
        {
            "nm": "EGG TART",
            "cnt": "1",
            "price": "13,000"
        },
        {
            "nm": "CHOCO CUS ARD PASTRY",
            "cnt": "2",
            "price": "24,000"
        },
        {
            "nm": "REDBEAN BREAD",
            "cnt": "1",
            "price": "9,000"
        }
    ],
    "total": {
        "total_price": "46,000",
        "cashprice": "50,000",
        "changeprice": "4,000"
    }
}

The evaluation metric used by Donut is the TED Acc (Tree Edit Distance Accuracy), which measures the similarity between the predicted JSON and the ground-truth.

In addition to Document Structure Parsing, Donut also evaluates the model's performance on the Entity Extraction task, using the Entity F1 score as the evaluation metric. Most SOTA models follows this evaluation pipeline.

Another work by clovvai, SPADE, evaluate the model's performance on Document Structure Parsing through a relaxed structured field F1-score. This evaluation measures the accuracy of dependency parsing by computing the F1 score of predicted edges. The task is simplified by not considering differences between predictions and ground truth in certain fields (such as store name, menu name, and item name) when the edit distance is less than 2 or when the ratio of edit distance to the ground truth string length is less than or equal to 0.4. Details can be found in their paper (Section 5.3 and A.2).

Some other works, such as BROS, evaluate the model's performance on Entity Linking using the Linking F1 score.

Type Approach Entity Extraction Entity Linking Document Structure Parsing
Precision Recall F1 QA F1 Precision Recall F1 Precision Recall F1 TED Acc
GNN-based GraphDoc - - 96.93 - - - - - - - -
FormNet 98.02 96.55 97.28 - - - - - - - -
FomNetV2 - - 97.70 - - - - - - - -
Large Scale Pre-trained LayoutLM base 94.37 95.08 94.72 - - - - - - - -
large 94.32 95.54 94.93 - - - - - - - -
LayoutLMv2 base 94.53 95.39 94.95 - - - - - - - -
large 95.65 96.37 96.01 - - - - - - - -
LayoutLMv3 base - - 96.56 - - - - - - - -
large - - 97.46 - - - - - - - -
DocFormer base 96.52 96.14 96.33 - - - - - - - -
large 97.25 96.74 96.99 - - - - - - - -
TILT base - - 95.11 - - - - - - - -
large - - 96.33 - - - - - - - -
BROS base - - 96.50 - - - 95.73 - - - -
large - - 97.28 - - - 97.40 - - - -
UDoc UDoc - - 96.64 - - - - - - - -
UDoc* - - 96.86 - - - - - - - -
LiLT [EN-RoBERTa]base - - 96.07 - - - - - - - -
[InfoXLM]base - - 95.77 - - - - - - - -
DocReL - - 97.00 - - - - - - - -
WUKONG-READER base - - 96.54 - - - - - - - -
large - - 97.27 - - - - - - - -
ERNIE-layout large - - 96.99 - - - - - - - -
QGN - - 96.84 - - - - - - - -
GeoLayoutLM - - 97.97 - - - 99.45 - - - -
GraphLayoutLM base - - 97.28 - - - - - - - -
large - - 97.75 - - - - - - - -
HGALayoutLM base 97.89 97.16 97.52 - - - - - - - -
large 97.97 97.38 97.67 - - - - - - - -
DocFormerv2 base 97.51 96.10 96.80 - - - - - - - -
large 97.71 97.70 97.70 - - - - - - - -
DocTr - - 98.20 - - - - - - 94.40 -
LayoutMask base - - 96.99 - - - - - - - -
large - - 97.19 - - - - - - - -
End-to-End Donut - - 84.10 - - - - - - - 90.90
ESP - - 95.65 - - - - - - - -
UDOP - - 97.58 - - - - - - - -
CREPE - - 85.00 - - - - - - - -
OmniParser - - 84.80 - - - - - - - 88.00
HIP - - 85.70 - - - - - - - -
LLM-based HRVDA - - - 89.30 - - - - - - -
LayoutLLM Llama2-7B-chat - - - 62.21 - - - - - - -
Vicuna-1.5-7B - - - 63.10 - - - - - - -
Other Methods SPADE ♠ CORD, oracle input - - - - - - - - - 92.50 -
♠ CORD - - - - - - - - - 88.20 -
♠ CORD+ - - - - - - - - - 87.40 -
♠ CORD++ - - - - - - - - - 83.10 -
♠ w/o TCM, CORD, oracle input - - - - - - - - - 91.50 -
♠ w/o TCM, CORD - - - - - - - - - 87.40 -
♠ w/o TCM, CORD+ - - - - - - - - - 86.10 -
♠ w/o TCM, CORD++ - - - - - - - - - 82.60 -

FUNSD

FUNSD comprises two tasks: Entity Extraction and Entity Linking. The Entity Extraction task requires extracting header, question, and answer entities from the document, and employs Entity F1 Score as the evaluation metric. The Entity Linking task focuses on linking predictions between question and answer entities, and uses Linking F1 Score as the evaluation metric.

It is worth noting that, in most mainstream approaches, these two subtasks are considered independent. For instance, LayoutLM's Entity Linking official implementation takes the ground-truth of question and answer entities as input and predict the linkings only, without considering the performance of Entity Extraction.

Real-world applications require extracting all key-value pairs from the document, which involves combining the EE and EL tasks to predict the entire kv-pair content. We termed this task as the End-to-end Pair Extraction. It presents challenges such as error accumulation and text segment aggregation. Regrettably, only a few studies have recognized and addressed these challenges, while the majority of research continues to follow the conventional EE+EL setting. We hope to see more studies that delve into this particular case.

Type Approach Entity Extraction Entity Linking E2E Pair Extraction
Precision Recall F1 QA F1 Precision Recall F1 QA F1 Precision Recall F1
Grid-based MSAU-PAF - - 83.00 - - - - - - - 75.00
GNN-based GraphDoc - - 87.77 - - - - - - - -
MatchVIE - - 81.33 - - - - - - - -
FormNet - - 84.69 - - - - - - - -
FormNetV2 - - 92.51 - - - - - - - -
Large Scale Pre-trained LayoutLM base 75.97 81.55 78.66 - - - - - - - -
large 75.96 82.19 78.95 - - - - - - - -
LayoutLMv2 base 80.29 85.39 82.76 - - - - - - - -
large 83.24 85.19 84.20 - - - - - - - -
LayoutXLM base, Language Specific Fine-tuning - - 79.40 - - - 54.83 - - - -
large, Language Specific Fine-tuning - - 82.25 - - - 64.04 - - - -
base, Multitask Fine-tuning - - 79.24 - - - 66.71 - - - -
large, Multitask Fine-tuning - - 80.68 - - - 76.83 - - - -
LayoutLMv3 base - - 90.29 - - - - - - - -
large - - 92.08 - - - - - - - -
XYLayoutLM - - 83.35 - - - - - - - -
SelfDoc - - 83.36 - - - - - - - -
DocFormer base 80.76 86.09 83.34 - - - - - - - -
large 82.29 86.94 84.55 - - - - - - - -
StructuralLM-large 83.52 - 85.14 - - - - - - - -
BROS base 81.16 85.02 83.05 - - - 71.46 - - - -
large 82.81 86.31 84.52 - - - 77.01 - - - -
StrucTexT eng-base - - 83.09 - - - 44.10 - - - -
chn&eng-base - - 84.83 - - - 70.45 - - - -
chn&eng-large - - 87.56 - - - 74.21 - - - -
UDoc UDoc - - 87.96 - - - - - - - -
UDoc* - - 87.93 - - - - - - - -
LiLT [En RoBERTa]base 87.21 89.65 88.41 - - - - - - - -
[InfoXLM]base 84.67 87.09 85.86 - - - - - - - -
[InfoXLM]base, Language Specific Fine-tuning - - 84.15 - - - 62.76 - - - -
[InfoXLM]base, Multitask Fine-tuning - - 85.74 - - - 74.07 - - - -
DocReL - - - - - - 46.10 - - - -
WUKONG-READER base - - 91.52 - - - - - - - -
large - - 93.62 - - - - - - - -
ERNIE-layout large - - 93.12 - - - - - - - -
GeoLayoutLM - - 92.86 - - - 89.45 - - - -
KVPFormer - - - - - - 90.86 - - - -
GraphLayoutLM base - - 93.15 - - - - - - - -
large - - 94.39 - - - - - - - -
HGALayoutLM base 94.84 93.80 94.32 - - - - - - - -
large 95.67 94.95 95.31 - - - - - - - -
DocFormerv2 base 89.15 87.60 88.37 - - - - - - - -
large 89.88 87.92 88.89 - - - - - - - -
DocTr - - 84.00 - - - 73.90 - - - -
DocFormerv2 base 89.15 87.60 88.37 - - - - - - - -
large 89.88 87.92 88.89 - - - - - - - -
LayoutMask base - - 92.91 - - - - - - - -
large - - 93.20 - - - - - - - -
End-to-End ESP - - 91.12 - - - 88.88 - - - -
UDOP - - 91.62 - - - - - - - -
HIP - - 52.00 - - - - - - - -
LLM-based Monkey - - - - - - - 24.10 - - -
TextMonkey - - - - - - - 32.30 - - - - - - - -
MiniMonkey - - - - - - - 42.90 - - - - - - - -
UniDoc 224 - - - - - - - 1.19 - - -
336 - - - - - - - 1.02 - - -
DocPedia 224 - - - - - - - 18.75 - - -
336 - - - - - - - 29.86 - - -
LayoutLLM Llama2-7B-chat - - - - - - - 78.65 - - -
Vicuna-1.5-7B - - - - - - - 79.98 - - -
Other Methods SPADE - - 71.60 - - - 41.30 - - - -

XFUND

XFUND is an multi-lingual extension of FUNSD, covering 7 languages: Chinese, Japanese, Spanish, French, Italian, German, and Portuguese. It contains 1,393 fully annotated forms, with each language containing 199 forms. The training set comprises 149 forms, while the test set includes 50 forms. XFUND also includes two subtasks: Entity Extraction and Entity Linking. Its follows the same evaluation protocol as FUNSD.

Note: In the following chart, the term Avg. represents the average score of the 7 non-English subsets. Some methods include the English subset in their reported average scores. To ensure a fair comparison, we made adjustments accordingly.

Type Approach Entity Extraction Entity Linking
ZH JA ES FR IT DE PT Avg. ZH JA ES FR IT DE PT Avg.
Large Scale Pre-trained LayoutXLM base, Language Specific Fine-tuning 89.24 79.21 75.50 79.02 80.02 82.22 79.03 82.40 70.73 69.63 68.96 63.53 64.15 65.51 57.18 65.67
large, Language Specific Fine-tuning 91.61 80.33 78.30 80.98 82.75 83.61 82.73 82.90 78.88 72.25 76.66 71.02 76.91 68.43 67.96 73.16
base, Zero-shot transfer 60.19 47.15 45.65 57.57 48.46 52.52 53.90 52.21 44.94 44.08 47.08 44.16 40.90 38.20 36.85 42.31
large, Zero-shot transfer 68.96 51.90 49.76 61.35 55.17 59.05 60.77 58.14 55.31 56.96 57.80 56.15 51.84 48.90 47.95 53.56
base, Multitask Fine-tuning 89.73 79.64 77.98 81.73 82.10 83.22 82.41 82.40 82.41 81.42 81.04 82.21 83.10 78.54 70.44 79.88
large, Multitask Fine-tuning 91.55 82.16 80.55 83.84 83.72 85.30 86.50 84.80 90.00 86.21 85.92 86.69 86.75 82.63 81.60 85.69
XYLayoutLM 91.76 80.57 76.87 79.97 81.75 83.35 80.01 82.04 74.45 70.59 72.59 65.21 65.72 67.03 58.98 67.79
LiLT [InfoXLM] base, Language Specific Fine-tuning 89.38 79.64 79.11 79.53 83.76 82.31 82.20 82.27 72.97 70.37 71.95 69.65 70.43 65.58 58.74 68.53
[InfoXLM] base, Zero-shot transfer 61.52 51.84 51.01 59.23 53.71 60.13 63.25 57.24 47.64 50.81 49.68 52.09 46.97 41.69 42.72 47.37
[InfoXLM] base, Multi-task Fine-tuning 90.47 80.88 83.40 85.77 87.92 87.69 84.93 85.86 84.71 83.45 83.35 84.66 84.58 78.78 76.43 82.28
KVPFormer - - - - - - - - 94.27 94.23 95.23 97.19 94.11 92.41 92.19 94.23
HGALayoutLM 94.22 - - - - - - - - - - - - - - -
End-to-End ESP Language Specific Fine-tuning 90.30 81.10 85.40 90.50 88.90 87.20 87.50 87.30 90.80 88.30 85.20 90.90 90.00 85.20 86.20 88.10
Multitask Fine-tuning - - - - - - - 89.13 - - - - - - - 92.31

EPHOIE

EPHOIE consists of 11 key categories for Entity Extraction and takes the Entity F1 as the evaluation metric. If the predicted string of a key category is consistent with the ground-truth string and not empty, it will be recorded as a TP sample.

Type Approach Precision Recall F1
Grid-based MathcVIE - - 96.87
Large-Scale Pre-trained StrucTexT chn&eng-base - - 98.84
chn&eng-large - - 99.30
LiLT [InfoXLM]base 96.99 98.20 97.59
[ZH-RoBERTa]base 97.62 98.33 97.97
QGN - - 98.49
End-to-End VIES ground-truth - - 95.23
end-to-end - - 83.81
Other Methods TCPN TextLattice - - 98.06
Copy Mode, end-to-end - - 84.67
Tag Mode, end-to-end - - 86.19
Tag Mode, ground-truth - - 97.59

DeepForm

Type Approach QA F1
End-to-end Donut 61.60
LLM-based Qwen-VL 4.10
Monkey 40.60
mPLUG-DocOwl 42.60
mPLUG-DocOwl 1.5 DocOwl-1.5 68.80
DocOwl-1.5 chat 68.80
UReader 49.50

Kleister Charity

Kleister Charity (KLC) contains 8 kind of key categories. It contains 2788 financial reports with 61643 pages in total. This benchmark is commonly used by LLM-based approaches in a QA-manner.

Type Approach QA F1
End-to-end Donut 30.00
LLM-based Qwen-VL 15.90
Monkey 32.80
mPLUG-DocOwl 30.30
mPLUG-DocOwl 1.5 DocOwl-1.5 37.90
DocOwl-1.5 chat 38.70
UReader 32.80
DoCo Qwen-VL-Chat 33.80
mPLUG-Owl 32.90