Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Details of preprocessing #1

Closed
himkt opened this issue Aug 1, 2020 · 4 comments
Closed

Details of preprocessing #1

himkt opened this issue Aug 1, 2020 · 4 comments

Comments

@himkt
Copy link

himkt commented Aug 1, 2020

Hello, I'm the owner of awesome-bert-japanese, a list of pretrained BERT models available online.
@kkadowa taught me about your BERT model in the PR.
I'm really interested in adding your model to our list.

If you don't mind asking you a question, and adding your model to awesome-bert-japanese,
could you please tell me how to combine different morphological analyzers in preprocessing?

My questions here:

  1. How to combine two analyzers (MeCab+unidic-cwj and MeCab+Unidic-qkana)
  2. How to combine two analyzers (SudachiPy+SudachiDict_core-20191224 and MeCab+UniDic-qkana_1603)
  3. In SudachiPy, what tokenization mode did you use? (ref, Multi-granular Tokenization in the official documentation)

I'd really appreciate your response.
Thanks!

@akirakubo
Copy link
Owner

Thank you for your interest. I think awesome-bert-japanese is a really good work.

For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling (現代仮名遣), unidic-qkana is used for classical kana spelling (歴史的仮名遣).

We use the simple method to identify kana spelling as follows:

In SudachiPy, we use split mode A ($ sudachipy -m A -a file) because it's equivalent to short unit word (SUW) in UniDic and unidic-cwj and unidic-qkana have only SUW mode.

After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt.

I will update README soon.

akirakubo added a commit that referenced this issue Aug 1, 2020
@himkt
Copy link
Author

himkt commented Aug 1, 2020

I'm really happy to hear your kindful response. 😸
Let me ask additional questions for a precise understanding of the steps. 🙇

In the list of documents, I can see 文字遣い種別.
As you said, this column has two kinds of information; kanji spelling and kana spelling.

If I understand correctly, each document in Aozora bunko could be written in one of
{新漢字+新仮名, 新漢字+旧仮名, 旧漢字+新仮名, 旧漢字+旧仮名}.

Did you consider kanji spelling to determine which analyzer+dictionary you used?
(or analyzer+dictionary is determined by kana spelling?)

I share my current understanding of your configurations with you. Does it sufficiently describe your preprocessing?

  • akirakubo (Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603)

    • sentence -> words (1): MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名
    • sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in 旧仮名
    • word -> subword: WordPiece
    • algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)
  • akirakubo (Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603)

    • sentence -> words (1): SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名
    • sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in 旧仮名
    • word -> subword: WordPiece
    • algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

@akirakubo
Copy link
Owner

If I understand correctly, each document in Aozora bunko could be written in one of
{新漢字+新仮名, 新漢字+旧仮名, 旧漢字+新仮名, 旧漢字+旧仮名}.

Did you consider kanji spelling to determine which analyzer+dictionary you used?
(or analyzer+dictionary is determined by kana spelling?)

No, we only use kana spelling information.

I share my current understanding of your configurations with you. Does it sufficiently describe your preprocessing?

* akirakubo (`Pre-tokenized by MeCab with unidic-cwj-2.3.0 and UniDic-qkana_1603`)
  
  * sentence -> words (1): MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in `新仮名`
  * sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`
  * word -> subword: WordPiece
  * algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

* akirakubo (`Pre-tokenized by SudachiPy with SudachiDict_core-20191224 and MeCab with UniDic-qkana_1603`)
  
  * sentence -> words (1): SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in `新仮名`
  * sentence -> words (2): MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`
  * word -> subword: WordPiece
  * algorithm for constructing vocabulary for subword tokenization: subword-nmt (BPE)

Looks good. Thank you!

@himkt
Copy link
Author

himkt commented Aug 1, 2020

Now it makes sense. 👍
Let me merge the PR by kkadowa (with small fixes based on your response) and add your published models to the list.

I close the issue.
Again, thank you so much for the quick+kindful response!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants