[Competition Organizer] [Problem] [Demo]
The model is built for AI/ML modeling competition hosting from KISTI (Korea Institute of Science and Technology Information). The main problem of this model is classifying sentences from research papers written in Korean that had been tagged based on rhetorical meaning.
- Problem Serving
- Paper Reviews
- Prerequisites
- Command Line Interface
- Performance
- Acknowledgement
- Notes
- Citation
- References
The problem have following hierarchical categories.
- Research purpose
- Problem definition
- Hypothesis
- Technology definition
- Research method
- Suggestion
- The target data
- Data processing
- Theory / models
- Research result
- Performance / effects
- Follow-up research
To solve the problem effectively, I have decided to train the model in Contrastive Learning manner. You can use following Pre-trained models: KorSci-BERT, KorSci-ELECTRA, and other BERT, ELECTRA, RoBERTa based models from Hugging Face.
I have used klue/roberta-base
for additional pre-trained model.
[arXiv] - Supervised Contrastive Learning
The classic contrastive learning is Self-supervised Learning, the model can classify between different objects, but struggling with classifying objects in same label.
In the paper, they suggest supervised-manner learning when you have labels.
I've used contrastive loss from the paper, and Pre-training and Fine-tuning separation.
- Perform Pre-training in representation learning manner.
- To perform Fine-tuning, cut off the representation projection layer and attach new classifier layers.
This gave me significant improvement on performance and speed of converge.
[arXiv] - Contrastive Out-of-Distribution Detection for Pretrained Transformers
Contrastive Representation Learning is powerful enough, but pushing all of labels each other may not be that easy.
Maximizing margin between representations is very helpful on clarifying decision boundaries between representations.
Although the paper suggest this for out-of-distribution problem, but experimenting clarifying decision boundaries in other tasks is reasonable.
[arXiv] - Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework
Since the dataset have sub-depth categories, I thought the model can learn about relationships between top-level categories, and between sub-depth categories at the same time.
The paper suggests training strategy, by pulling together in the same-level categories and pulling stronger when the level is lower and lower.
All prerequisites must be up-to-date. W&B is always required to run pre-training and fine-tuning scripts. Requiring to use Python 3.8 or above and CUDA 11.4 or above.
Install following main packages by manually, or use requirements.txt
. You must install Mecab on your system prior to install Python packages. Visit official KoNLPy documentation for the installation guide.
- tensorflow
- tensorflow_addons
- tensorflow-serving-api
- torch
- transformers
- scikit-learn
- pandas
- wandb # for parameter tracking
- konlpy # for mecab
- soynlp # for text normalization
- rich # for text highlighting
- flask # for middleware api
pip install -r requirements.txt
W&B Sweeps configurations are available in ./sweeps
directory.
Run automatic hyperparameter tuning by (for example) wandb sweep ./sweeps/pretraining_supervised.yaml
And run wandb agent "{entity_name}/CoRT Pre-training/{sweep_id}"
To find out how to prepare Pre-trained backbones for Pre-training, read Pre-trained Backbones README
Use build_pretraining_data.py
to create a pre-training dataset from raw texts.
It has the following arguments:
--filepath
: Location of raw texts dump that is available at KISTI.--model_name
: Model name to be used as Pre-trained backbones.korscibert
andkorscielectra
is available by default.--output_dir
: Destination directory path to write out the tfrecords.--korscibert_vocab
: Location of KorSci-BERT vocabulary file. (optional)--korscielectra_vocab
: Location of KorSci-ELECTRA vocabulary file. (optional)--num_processes
: Parallelize tokenization across multi processes. (4 as default)--num_k_fold
: Number of K-Fold splits. (10 as default)--test_size
: Rate of testing dataset. (0.0 as default)--seed
: Seed of random state. (42 as default)
Use run_pretraining.py
to pre-train the backbone model in representation learning manner.
It has the following arguments:
--gpu
: GPU to be utilized for training. ('all' as default, must be int otherwise)--batch_size
: Size of the mini-batch. (64 as default)--learning_rate
: Learning rate. (1e-3 as default)--lr_fn
: Learning rate scheduler type. ('cosine_decay' as default. 'constant', 'cosine_decay', 'polynomial_decay', 'linear_decay' is available)--weight_decay
: Rate of weight decay. (1e-6 as default)--warmup_rate
: Rate of learning rate warmup on beginning. (0.06 as default. the total warmup steps isint(num_train_steps * warmup_rate)
)--repr_size
: Size of representation projection layer units. (1024 as default)--gradient_accumulation_steps
: Multiplier for gradient accumulation. (1 as default)--model_name
: Model name to be used as Pre-trained backbones.--num_train_steps
: Total number of training steps. (10000 as default)--loss_base
: Name of loss function for contrastive learning. ('margin' as default. 'margin', 'supervised' and 'hierarchical' is available)
The Pre-training takes 3 ~ 4 hours to complete on NVIDIA A100
When pre-training is completed, all checkpoints would be located in pretraining-checkpoints/{wandb_run_id}
Use run_finetuning.py
to fine-tune the pre-trained models.
It has the following arguments:
--gpu
: GPU to be utilized for training. ('all' as default, must be int otherwise)--batch_size
: Size of the mini-batch. (64 as default)--learning_rate
: Learning rate. (1e-3 as default)--lr_fn
: Learning rate scheduler type. ('cosine_decay' as default. 'constant', 'cosine_decay', 'polynomial_decay', 'linear_decay' is available)--weight_decay
: Rate of weight decay. (1e-6 as default. I recommend to use 0 when fine-tune)--warmup_rate
: Rate of learning rate warmup on beginning. (0.06 as default. the total warmup steps isint(epochs * steps_per_epoch * warmup_rate)
)--repr_size
: Size of classifier dense layer. (1024 as default)--model_name
: Model name to beu sed as Pre-trained backbones.--pretraining_run_name
: W&B Run ID inpretraining-checkpoints
. The pre-trained checkpoint model must be same with--model_name
model.--epochs
: Number of training epochs. (10 as default)--repr_act
: Activation function name to be used after classifier dense layer. ('tanh' as default. 'none', and other name of activations supported from TensorFlow is available)--concat_hidden_states
: Number of hidden states to concatenate. (1 as default)--loss_base
: Name of loss function for contrastive learning. ('margin' as default. 'margin', 'supervised' and 'hierarchical' is available)--restore_checkpoint
: Name of checkpoint file. (None
as default. I recommend 'latest' when fine-tune)--repr_classifier
: Type of classification head. ('seq_cls' as default. 'seq_cls' and 'bi_lstm' is available)--repr_preact
: Boolean to use pre-activation when activating representation logits. (True
as default)--train_at_once
: Boolean when you want to train the model from scratch without pre-training. (False
as default)--repr_finetune
: Boolean when you want to fine-tune the model with additional Representation Learning. (False
as default)--include_sections
: Boolean when you want to use 'representation logits of sections' on label representation logits. (False
as default.--repr_finetune True
is required for this)
Use run_inference.py
to perform inference on fine-tuned models.
It has the following arguments:
--checkpoint_path
: Location of trained model checkpoint. (Required when gRPC server is not provided)--model_name
: Name of pre-trained models. (One of korscibert, korscielectra, huggingface models is allowed)--tfrecord_path
: Location of TFRecord file for inference. {model_name} is a placeholder.--repr_classifier
: Name of classification head for classifier. (One of 'seq_cls' and 'bi_lstm' is allowed)--repr_act
: Name of activation function for representation. (One of 'tanh' and 'gelu' is allowed)--concat_hidden_states
: Number of hidden states to concatenate. (1 as default)--batch_size
: Number of batch size. (64 as default)--max_position_embeddings
: Number of maximum position embeddings. (512 as default)--repr_size
: Number of representation dense units. (1024 as default)--num_labels
: Number of labels. (9 as default)--interactive
: Interactive mode for real-time inference. (False
as default)--grpc_server
: Address to TFServing gRPC API endpoint. Specify this argument when gRPC API is available. (None
as default)--model_spec_name
: Name of model spec. ('cort' as default)--signature_name
: Name of signature of SavedModel ('serving_default' as default)
Perform inference for metrics by (for example) python run_inference.py --checkpoint_path ./finetuning-checkpoints/wandb_run_id/ckpt-0 --tfrecord_path ./data/tfrecords/{model_name}/valid.fold-1-of-10.tfrecord --concat_hidden_states 2 --repr_act tanh --repr_classifier bi_lstm --repr_size 1024
.
--concat_hidden_states
, --repr_act
, --repr_classifier
, --repr_size
must be same with configurations that used for fine-tuned model's architecture.
CoRT supports TensorFlow Serving on Docker, use configure_docker_image.py
to prepare components for Docker container.
It has the following arguments:
--checkpoint_path
: Location of trained model checkpoint. (Required)--saved_model_dir
: Location of SavedModel to be stored. ('./models' as default)--model_spec_name
: Name of model spec. ('cort' as default)--model_spec_version
: Version of model spec. ('1' as default)--signature_name
: Name of signature of SavedModel ('serving_default' as default)--model_name
: Name of pre-trained models. (One of korscibert, korscielectra, huggingface models is allowed)--tfrecord_path
: Location of TFRecord file for warmup requests. {model_name} is a placeholder.--num_warmup_requests
: Number of warmup requests. Pass 0 to skip (10 as default)--repr_classifier
: Name of classification head for classifier. (One of 'seq_cls' and 'bi_lstm' is allowed)--repr_act
: Name of activation function for representation. (One of 'tanh' and 'gelu' is allowed)--concat_hidden_states
: Number of hidden states to concatenate. (1 as default)--repr_size
: Number of representation dense units. (1024 as default)--num_labels
: Number of labels. (9 as default)
Once configuring is done, run following commands to build and run Docker container.
nvidia-docker build -t cort/serving:latest -f ./docker/serving/Dockerfile .
docker run -d -p 8500:8500 --name cort-grpc-server cort/serving
Intermediate API middleware is written in Flask. Use run_flask_middleware.py
to open a HTTP server that communicates with gRPC backend directly. It has the following arguments:
--host
: Listening address for Flask server ('0.0.0.0' as default)--port
: Number of port for Flask server (8080 as default)--grpc_server
: Address to TFServing gRPC API endpoint. ('localhost:8500' as default)--model_name
: Name of pre-trained models. (One of korscibert, korscielectra, huggingface models is allowed)--model_spec_name
: Name of model spec. ('cort' as default)--signature_name
: Name of signature of SavedModel ('serving_default' as default)
Use POST http://127.0.0.1:8080/predict
to request prediction over HTTP protocol.
POST http://127.0.0.1:8080/predict
Content-Type: application/json
{"sentence": "<sentence>"}
Flask server also can be run on Docker container.
docker build -t cort/flask:latest -f ./docker/middleware/Dockerfile .
docker run -d -p 8080:8080 --name cort-flask-server cort/flask
To make those Docker containers communicate each other, create network and connect them via following commands.
docker network create cort
docker run -d -p 8500:8500 --name cort-grpc-server --network cort cort/serving
docker run -d -p 8080:8080 --name cort-flask-server --network cort --env GRPC_SERVER=cort-grpc-server:8500 cort/flask
For people who is unfamiliar with this, the middleware is also providing static website. Visit http://127.0.0.1:8080/site
to try out easily.
LAN (Label Attention Network) has been proposed in 2021 KISTI AI/ML Competition.
Sentence Concat and Encoder Concat have been proposed by Changwon National Univ. and KISTI Researchers
Model | Macro F1-score | Accuracy |
---|---|---|
W/o Contrastive Loss (KorSci-BERT) | 81.35 | 82.21 |
Sentence Concat (KLUE BERT base) | 70.85 | 88.77 |
Encoder Concat (KLUE BERT base) | 71.91 | 88.59 |
LAN (KorSci-BERT) | 89.95 | 89.76 |
LAN (KLUE RoBERTA base) | 90.00 | 89.85 |
CoRT (KLUE RoBERTA base) | 90.50 | 90.17 |
CoRT (KorSci-BERT) | 90.42 | 90.25 |
CoRT shows better performance on overall scores comparing with baseline models despite its smaller model architecture.
CoRT was created with GPU support from the KISTI National Supercomputing Center (KSC) Neuron free trial. Also, 2 NVIDIA A100 GPUs have been used for Pre-training, and 2 NVIDIA V100 GPUs have been used for Fine-tuning.
I don't recommend to use KorSci-ELECTRA because of too high [UNK]
token rate (about 85.2%).
Model | Number of [UNK] | Total Tokens | [UNK] Rate |
---|---|---|---|
klue/roberta-base | 2,734 | 9,269,131 | 0.000295 |
KorSci-BERT | 14,237 | 9,077,386 | 0.001568 |
KorSci-ELECTRA | 7,345,917 | 8,621,489 | 0.852047 |
If you use this code for research, please cite:
@misc{CoRT2022,
author = {OrigamiDream},
title = {CoRT: Contrastive Rhetorical Tagging},
year = {2022},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url(https://github.com/OrigamiDream/CoRT)}
}
- Khosla., "Supervised Contrastive Learning", 2020
- Zhou., "Contrastive Out-of-Distribution Detection for Pretrained Transformers", 2021
- Zhang., "Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework", 2022
- Seong., "Rhetorical Sentence Classification Using Context Information", 2021
- Kim., "Fine-grained Named Entity Recognition using Hierarchical Label Embedding", 2021