Berserker

Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's BERT model.

Installation

pip install basaka

Usage

import berserker

berserker.load_model() # An one-off installation
berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']

Benchmark

The table below shows that Berserker achieved state-of-the-art F1 measure on the SIGHAN 2005 dataset.

The result below is trained with 15 epoches on each dataset with a batch size of 64.

	PKU	CITYU	MSR	AS
Liu et al. (2016)	96.8	--	97.3	--
Yang et al. (2017)	96.3	96.9	97.5	95.7
Zhou et al. (2017)	96.0	--	97.8	--
Cai et al. (2017)	95.8	95.6	97.1	--
Chen et al. (2017)	94.3	95.6	96.0	94.6
Wang and Xu (2017)	96.5	--	98.0	--
Ma et al. (2018)	96.1	97.2	98.1	96.2
--------------------	----------	----------	----------	----------
Berserker	96.6	97.1	98.4	96.5

Reference: Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Limitation

Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with SIGHAN 2005 PKU dataset. We plan to release more pretrained model in the future.

Architecture

Berserker is fine-tuned over TPU with pretrained Chinese BERT model. It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

Training

We provided the source code for training under the trainer subdirectory. Feel free to contact me if you need any help reproducing the result.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
berserker		berserker
trainer		trainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Berserker

Installation

Usage

Benchmark

Limitation

Architecture

Training

Bonus Video

About

Releases

Packages

Languages

License

Hoiy/berserker

Folders and files

Latest commit

History

Repository files navigation

Berserker

Installation

Usage

Benchmark

Limitation

Architecture

Training

Bonus Video

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages