emBERT

emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

emBERT defines the following tools:

Name(s)	Task	Training corpus	F1 score
`bert-ner`	NER	Szeged NER corpus	97.08%
`bert-basenp`	base NP chunking	Szeged TreeBank 2.0	95.58%
`bert-np` (or `bert-chunk`)	maximal NP chunking	Szeged TreeBank 2.0	95.05%

(The results in bold are state-of-the-art for Hungarian.)

Due to their size (a little over 700M apiece), the models are stored in a separate repository. emBERT-models is a submodule of this repository, so if cloned recursively with git LFS, the models will be downloaded as well:

git clone --recursive https://github.com/DavidNemeskey/emBERT.git

Alternatively, the models can be obtained via emtsv's download_models.py script.

If you use emBERT in your work, please cite the following paper (see link for bib; Hungarian):

Nemeskey Dávid Márk 2020. Egy emBERT próbáló feladat. In Proceedings of the 16th Conference on Hungarian Computational Linguistics (MSZNY 2020). pp. 409-418.

Training

Should the need arise to train a better model, or to build one for a different domain, the train_embert.py script can be used to fine-tune a BERT model on a token classification task. An example run that reproduces the chunking results (given the same train-valid-test split and a GPU with 11G+ memory) is:

train_embert.py --data_dir ~/data/chunking/szeged_max_bioe1_100/ \
    --bert_model SZTAKI-HLT/hubert-base-cc --task_name szeged_bioes_chunk \
    --data_format tsv --output_dir bert_np --do_train --max_seq_length 384 \
    --num_train_epochs=4 --train_batch_size 10 --learning_rate "1e-5" \
    --do_eval --eval_batch_size 1 --use_viterbi --seed 42

Note that if the model is trained on a new tag set, it has to be added to embert/processors.py.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
configs		configs
embert		embert
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emBERT

Training

About

Releases

Packages

Contributors 2

Languages

License

DavidNemeskey/emBERT

Folders and files

Latest commit

History

Repository files navigation

emBERT

Training

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages