-
Notifications
You must be signed in to change notification settings - Fork 526
Modelzoo
With the help of UER, we pre-trained models of different properties (for example, models based on different corpora, encoders, and targets). All pre-trained weights introduced in this section are in UER format and can be loaded by UER directly. More pre-trained weights will be released in the near future. Unless otherwise noted, Chinese pre-trained models use BERT tokenizer and models/google_zh_vocab.txt as vocabulary (which is used in original BERT project). models/bert/base_config.json is used as configuration file in default. Commonly-used vocabulary and configuration files are included in models/ folder and users do not need to download them. In addition, We use scripts/convert_xxx_from_uer_to_huggingface.py to convert pre-trained weights into format that Huggingface Transformers supports, and upload them to Huggingface model hub (uer). In the rest of the section, we provide download links of pre-trained weights and the right ways of using them. Notice that, for space constraint, more details of a pre-trained weight are discussed in corresponding Huggingface model hub. We will provide the link of Huggingface model hub when we introduce the pre-trained weight.
This is the set of 24 Chinese RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. We only provide configuration files for Tiny,Mini,Small,Medium,Base,and Large models. To load other models, we need to modify emb_size,feedforward_size,hidden_size,heads_num,layers_num in the configuration file. Notice that emb_size = emb_size, feedforward_size = 4 * hidden_size, heads_num = hidden_size / 64 . More details of these pre-trained weights are discussed here.
The pre-trained Chinese weight links of different layers (L) and hidden sizes (H):
H=128 | H=256 | H=512 | H=768 | |
---|---|---|---|---|
L=2 | 2/128 (Tiny) | 2/256 | 2/512 | 2/768 |
L=4 | 4/128 | 4/256 (Mini) | 4/512 (Small) | 4/768 |
L=6 | 6/128 | 6/256 | 6/512 | 6/768 |
L=8 | 8/128 | 8/256 | 8/512 (Medium) | 8/768 |
L=10 | 10/128 | 10/256 | 10/512 | 10/768 |
L=12 | 12/128 | 12/256 | 12/512 | 12/768 (Base) |
Take the Tiny weight as an example, we download the Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --data_processor mlm
python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
--output_model_path models/output_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
--data_processor mlm --target mlm
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--test_path datasets/douban_book_review/test.tsv \
--learning_rate 3e-4 --epochs_num 8 --batch_size 64
In fine-tuning stage, pre-trained models of different sizes usually require different hyper-parameters. The example of using grid search to find best hyper-parameters:
python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/tiny_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64
We can reproduce the experimental results reported here through above grid search script.
This is the set of 5 Chinese word-based RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. Google sentencepiece is used as tokenizer tool and models/cluecorpussmall_spm.model is used as sentencepiece model. Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. More details of these pre-trained weights are discussed here
The pre-trained Chinese weight links of different sizes:
Link |
---|
L=2/H=128 (Tiny) |
L=4/H=256 (Mini) |
L=4/H=512 (Small) |
L=8/H=512 (Medium) |
L=12/H=768 (Base) |
Take the word-based Tiny weight as an example, we download the word-based Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path models/cluecorpussmall_spm.model \
--dataset_path dataset.pt --processes_num 8 --data_processor mlm
python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model --config_path models/bert/tiny_config.json \
--output_model_path models/output_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
--data_processor mlm --target mlm
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model \
--config_path models/bert/tiny_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--test_path datasets/douban_book_review/test.tsv \
--learning_rate 3e-4 --epochs_num 8 --batch_size 64
The example of using grid search to find best hyper-parameters for word-based model:
python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model \
--config_path models/bert/tiny_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv
--learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64
We can reproduce the experimental results reported here through above grid search script.
This is the set of Chinese GPT-2 pre-trained weights. Configuration files are in models/gpt2/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained GPT-2 weights:
Notice that extended vocabularies (models/google_zh_poem_vocab.txt and models/google_zh_ancient_vocab.txt) are used in Poem and Ancient GPT-2 models. CLUECorpusSmall GPT-2-distil model uses models/gpt2/distil_config.json configuration file. models/gpt2/config.json are used for other weights.
Take the CLUECorpusSmall GPT-2-distil weight as an example, we download the CLUECorpusSmall GPT-2-distil weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 \
--seq_length 128 --data_processor lm
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--output_model_path models/book_review_gpt2_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-5 --batch_size 64
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--train_path datasets/douban_book_review/train.tsv \
--dev_path datasets/douban_book_review/dev.tsv \
--test_path datasets/douban_book_review/test.tsv \
--learning_rate 3e-5 --epochs_num 8 --batch_size 64
GPT-2 model can be used for text generation. First of all, we create story_beginning.txt and enter the beginning of the text. Then we use scripts/generate_lm.py to do text generation:
python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--test_path story_beginning.txt --prediction_path story_full.txt \
--seq_length 128
This is the set of Chinese T5 pre-trained weights. Configuration files are in models/t5/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained T5 weights:
Model link | Description link |
---|---|
CLUECorpusSmall T5-small | https://huggingface.co/uer/t5-small-chinese-cluecorpussmall |
CLUECorpusSmall T5-base | https://huggingface.co/uer/t5-base-chinese-cluecorpussmall |
Take the CLUECorpusSmall T5-small weight as an example, we download the CLUECorpusSmall T5-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path dataset.pt \
--processes_num 8 --seq_length 128 \
--dynamic_masking --data_processor t5
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--output_model_path models/book_review_t5_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-4 --batch_size 64 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
or use it on downstream dataset:
python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--train_path datasets/tnews_text2text/train.tsv \
--dev_path datasets/tnews_text2text/dev.tsv \
--seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32
python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--test_path datasets/tnews_text2text/test_nolabel.tsv \
--prediction_path datasets/tnews_text2text/prediction.tsv \
--seq_length 128 --tgt_seq_length 8 --batch_size 32
Users can download tnews dataset of text2text format from here.
This is the set of Chinese T5-v1_1 pre-trained weights. Configuration files are in models/t5-v1_1/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained T5 weights:
Take the CLUECorpusSmall T5-v1_1-small weight as an example, we download the CLUECorpusSmall T5-v1_1-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path dataset.pt \
--processes_num 8 --seq_length 128 \
--dynamic_masking --data_processor t5
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--output_model_path models/book_review_t5-v1_1_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-4 --batch_size 64 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
or use it on downstream dataset:
python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--train_path datasets/tnews_text2text/train.tsv \
--dev_path datasets/tnews_text2text/dev.tsv \
--seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32
python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--test_path datasets/tnews_text2text/test_nolabel.tsv \
--prediction_path datasets/tnews_text2text/prediction.tsv \
--seq_length 128 --tgt_seq_length 8 --batch_size 32
This is the set of fine-tuned Chinese RoBERTa weights. All of them use models/bert/base_config.json configuration file.
The link and detailed description (Huggingface model hub) of different fine-tuned RoBERTa weights:
One can load these pre-trained models for pre-training, fine-tuning, and inference.
This is the set of pre-trained weights besides Transformer.
The link and detailed description of different pre-trained weights:
Model link | Configuration file | Model details | Training details |
---|---|---|---|
CLUECorpusSmall LSTM language model | models/rnn_config.json | --embedding word --remove_embedding_layernorm --encoder lstm --target lm | steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall GRU language model | models/rnn_config.json | --embedding word --remove_embedding_layernorm --encoder gru --target lm | steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall GatedCNN language model | models/gatedcnn_9_config.json | --embedding word --remove_embedding_layernorm --encoder gatedcnn --target lm | steps: 500000 learning rate: 1e-4 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall ELMo | models/birnn_config.json | --embedding word --remove_embedding_layernorm --encoder bilstm --target bilm | steps: 500000 learning rate: 5e-4 batch size: 64*8 (the number of GPUs) sequence length: 256 |
Model link | Description | Description link |
---|---|---|
Google Chinese BERT-Base | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
Google Chinese ALBERT-Base | Configuration file: models/albert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Large | Configuration file: models/albert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Xlarge | Configuration file: models/albert/xlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Xxlarge | Configuration file: models/albert/xxlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
HFL Chinese BERT-wwm | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese BERT-wwm-ext | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese RoBERTa-wwm-ext | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese RoBERTa-wwm-large-ext | Configuration file: models/bert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
Models pre-trained by UER:
Pre-trained model | Link | Description |
---|---|---|
Wikizh(word-based)+BertEncoder+BertTarget | Model: https://share.weiyun.com/5s4HVMi Vocab: https://share.weiyun.com/5NWYbYn | Word-based BERT model pre-trained on Wikizh. Training steps: 500,000 |
RenMinRiBao+BertEncoder+BertTarget | https://share.weiyun.com/5JWVjSE | The training corpus is news data from People's Daily (1946-2017). |
Webqa2019+BertEncoder+BertTarget | https://share.weiyun.com/5HYbmBh | The training corpus is WebQA, which is suitable for datasets related with social media, e.g. LCQMC and XNLI. Training steps: 500,000 |
Weibo+BertEncoder+BertTarget | https://share.weiyun.com/5ZDZi4A | The training corpus is Weibo. |
Weibo+BertEncoder(large)+MlmTarget | https://share.weiyun.com/CFKyMkp3 | The training corpus is Weibo. The configuration file is bert_large_config.json |
Reviews+BertEncoder+MlmTarget | https://share.weiyun.com/tBgaSx77 | The training corpus is reviews. |
Reviews+BertEncoder(large)+MlmTarget | https://share.weiyun.com/hn7kp9bs | The training corpus is reviews. The configuration file is bert_large_config.json |
MixedCorpus+BertEncoder(xlarge)+MlmTarget | https://share.weiyun.com/J9rj9WRB | Pre-trained on mixed large Chinese corpus. The configuration file is bert_xlarge_config.json |
MixedCorpus+BertEncoder(xlarge)+BertTarget(WWM) | https://share.weiyun.com/UsI0OSeR | Pre-trained on mixed large Chinese corpus. The configuration file is bert_xlarge_config.json |
MixedCorpus+BertEncoder(large)+MlmTarget | https://share.weiyun.com/5G90sMJ | Pre-trained on mixed large Chinese corpus. The configuration file is bert_large_config.json |
MixedCorpus+BertEncoder(base)+BertTarget | https://share.weiyun.com/5QOzPqq | Pre-trained on mixed large Chinese corpus. The configuration file is bert_base_config.json |
MixedCorpus+BertEncoder(small)+BertTarget | https://share.weiyun.com/fhcUanfy | Pre-trained on mixed large Chinese corpus. The configuration file is bert_small_config.json |
MixedCorpus+BertEncoder(tiny)+BertTarget | https://share.weiyun.com/yXx0lfUg | Pre-trained on mixed large Chinese corpus. The configuration file is bert_tiny_config.json |
MixedCorpus+GptEncoder+LmTarget | https://share.weiyun.com/51nTP8V | Pre-trained on mixed large Chinese corpus. Training steps: 500,000 (with sequence lenght of 128) + 100,000 (with sequence length of 512) |
Reviews+LstmEncoder+LmTarget | https://share.weiyun.com/57dZhqo | The training corpus is amazon reviews + JDbinary reviews + dainping reviews (11.4M reviews in total). Language model target is used. It is suitable for datasets related with reviews. It achieves over 5 percent improvements on some review datasets compared with random initialization. Set hidden_size in models/rnn_config.json to 512 before using it. Training steps: 200,000; Sequence length: 128; |
(MixedCorpus & Amazon reviews)+LstmEncoder+(LmTarget & ClsTarget) | https://share.weiyun.com/5B671Ik | Firstly pre-trained on mixed large Chinese corpus with LM target. And then is pre-trained on Amazon reviews with lm target and cls target. It is suitable for datasets related with reviews. It can achieve comparable results with BERT on some review datasets. Training steps: 500,000 + 100,000; Sequence length: 128 |
IfengNews+BertEncoder+BertTarget | https://share.weiyun.com/5HVcUWO | The training corpus is news data from Ifeng website. We use news title to predict news abstract. Training steps: 100,000; Sequence length: 128 |
jdbinary+BertEncoder+ClsTarget | https://share.weiyun.com/596k2bu | The training corpus is review data from JD (jingdong). CLS target is used for pre-training. It is suitable for datasets related with shopping reviews. Training steps: 50,000; Sequence length: 128 |
jdfull+BertEncoder+MlmTarget | https://share.weiyun.com/5L6EkUF | The training corpus is review data from JD (jingdong). MLM target is used for pre-training. Training steps: 50,000; Sequence length: 128 |
Amazonreview+BertEncoder+ClsTarget | https://share.weiyun.com/5XuxtFA | The training corpus is review data from Amazon (including book reviews, movie reviews, and etc.). Classification target is used for pre-training. It is suitable for datasets related with reviews, e.g. accuracy is improved on Douban book review datasets from 87.6 to 88.5 (compared with Google BERT). Training steps: 20,000; Sequence length: 128 |
XNLI+BertEncoder+ClsTarget | https://share.weiyun.com/5oXPugA | Infersent with BertEncoder |