Details of preprocessing #1

himkt · 2020-08-01T05:29:07Z

Hello, I'm the owner of awesome-bert-japanese, a list of pretrained BERT models available online.
@kkadowa taught me about your BERT model in the PR.
I'm really interested in adding your model to our list.

If you don't mind asking you a question, and adding your model to awesome-bert-japanese,
could you please tell me how to combine different morphological analyzers in preprocessing?

My questions here:

How to combine two analyzers (MeCab+unidic-cwj and MeCab+Unidic-qkana)
How to combine two analyzers (SudachiPy+SudachiDict_core-20191224 and MeCab+UniDic-qkana_1603)
In SudachiPy, what tokenization mode did you use? (ref, Multi-granular Tokenization in the official documentation)

I'd really appreciate your response.
Thanks!

The text was updated successfully, but these errors were encountered:

akirakubo · 2020-08-01T07:50:40Z

Thank you for your interest. I think awesome-bert-japanese is a really good work.

For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling (現代仮名遣), unidic-qkana is used for classical kana spelling (歴史的仮名遣).

We use the simple method to identify kana spelling as follows:

Wikipedia
We assume that contemporary kana spelling is used.
Aozora Bunko
公開中　作家別作品一覧拡充版：全て(CSV形式、UTF-8、zip圧縮） has 文字遣い種別 column that is the information of kanji (旧字 or 新字) and kana spelling (旧仮名 or 新仮名).
We use this kana spelling information.

In SudachiPy, we use split mode A ($ sudachipy -m A -a file) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.

After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.

I will update README soon.

himkt · 2020-08-01T08:24:24Z

I'm really happy to hear your kindful response. 😸
Let me ask additional questions for a precise understanding of the steps. 🙇

In the list of documents, I can see 文字遣い種別.
As you said, this column has two kinds of information; kanji spelling and kana spelling.

If I understand correctly, each document in Aozora bunko could be written in one of
{新漢字+新仮名, 新漢字+旧仮名, 旧漢字+新仮名, 旧漢字+旧仮名}.

Did you consider kanji spelling to determine which analyzer+dictionary you used?
(or analyzer+dictionary is determined by kana spelling?)

I share my current understanding of your configurations with you. Does it sufficiently describe your preprocessing?

akirakubo (Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603)
- sentence -> words (1): MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名
- sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in 旧仮名
- word -> subword: WordPiece
- algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)
akirakubo (Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603)
- sentence -> words (1): SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名
- sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in 旧仮名
- word -> subword: WordPiece
- algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

akirakubo · 2020-08-01T08:30:33Z

If I understand correctly, each document in Aozora bunko could be written in one of
{新漢字+新仮名, 新漢字+旧仮名, 旧漢字+新仮名, 旧漢字+旧仮名}.

Did you consider kanji spelling to determine which analyzer+dictionary you used?
(or analyzer+dictionary is determined by kana spelling?)

No, we only use kana spelling information.

I share my current understanding of your configurations with you. Does it sufficiently describe your preprocessing?

* akirakubo (`Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603`)
  
  * sentence -> words (1): MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in `新仮名`
  * sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`
  * word -> subword: WordPiece
  * algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

* akirakubo (`Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603`)
  
  * sentence -> words (1): SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in `新仮名`
  * sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`
  * word -> subword: WordPiece
  * algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

Looks good. Thank you!

himkt · 2020-08-01T08:37:38Z

Now it makes sense. 👍
Let me merge the PR by kkadowa (with small fixes based on your response) and add your published models to the list.

I close the issue.
Again, thank you so much for the quick+kindful response!

Fix [#1]

akirakubo added a commit that referenced this issue Aug 1, 2020

Fix [#1]

837e381

himkt closed this as completed Aug 1, 2020

akirakubo added a commit that referenced this issue Aug 1, 2020

Merge pull request #2 from akirakubo/add-pretokenization-description

3e44602

Fix [#1]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details of preprocessing #1

Details of preprocessing #1

himkt commented Aug 1, 2020

akirakubo commented Aug 1, 2020

himkt commented Aug 1, 2020

akirakubo commented Aug 1, 2020

himkt commented Aug 1, 2020

Details of preprocessing #1

Details of preprocessing #1

Comments

himkt commented Aug 1, 2020

akirakubo commented Aug 1, 2020

himkt commented Aug 1, 2020

akirakubo commented Aug 1, 2020

himkt commented Aug 1, 2020