-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Details of preprocessing #1
Comments
Thank you for your interest. I think awesome-bert-japanese is a really good work. For each document, we identify kana spelling method and then pre-tokenize by using morphological analyzer with the dictionary associated with the spelling, i.e. unidic-cwj or SudachiDict-core is used for contemporary kana spelling (現代仮名遣), unidic-qkana is used for classical kana spelling (歴史的仮名遣). We use the simple method to identify kana spelling as follows:
In SudachiPy, we use split mode A ( After pre-tokenization, we concatenate texts of Aozora Bunko and random sampled Wikipedia (or only Aozora Bunko), and get vocabulary by using subword-nmt. I will update README soon. |
I'm really happy to hear your kindful response. 😸 In the list of documents, I can see If I understand correctly, each document in Aozora bunko could be written in one of Did you consider I share my current understanding of your configurations with you. Does it sufficiently describe your preprocessing?
|
No, we only use kana spelling information.
Looks good. Thank you! |
Now it makes sense. 👍 I close the issue. |
Hello, I'm the owner of awesome-bert-japanese, a list of pretrained BERT models available online.
@kkadowa taught me about your BERT model in the PR.
I'm really interested in adding your model to our list.
If you don't mind asking you a question, and adding your model to
awesome-bert-japanese
,could you please tell me how to combine different morphological analyzers in preprocessing?
My questions here:
Multi-granular Tokenization
in the official documentation)I'd really appreciate your response.
Thanks!
The text was updated successfully, but these errors were encountered: