-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add entries to pos map #22
Add entries to pos map #22
Conversation
Hello, thank you for the pull request. Where did you actually find the POS tags you're trying to add here? The |
There are some that only exist in the paper/tagging scheme but not in the corpus (the morpheme-related ones, eg "Bg", "g", "Qg"). I think it would make sense to include them for the sake of coverage, although the changes are minimal in practice. The remaining two are really just edge cases, possibly introduced in v2 of the corpus:
This is based on the corpus hosted on github and I found the same entries in the data downloaded directly from the website. |
Thank you for the pointers. I wasn't aware of differences between the HKCanCor included in PyCantonese and that from the fcbond/hkcancor repo, and therefore the {N1, XJA, XO} tags were unknown to me. I got the HKCanCor data about 6 years ago and did a lot of heavy scripting to transform it into the CHAT data format currently used in PyCantonese. Unfortunately, I'm unable to locate whatever I used to do the transformation, and so probably won't be able to update the HKCanCor data here for any inconsistencies. For all these tags you're adding (both the new tags only found in the "upstream" HKCanCor data, as well as those documented in its paper/website but actually unused in the data), practically they're unlikely to have any effect in part-of-speech tagging, since the POS tagger would never see them in the data anyway. This being said, precisely because these tags have no effect and just sit there in the LGTM. Thank you for your pull request again. |
Acknowledge Jenny Chim's contribution in readme. Also update the hkcancor-to-ud map for where some of the edge-case tags come from in the upstream data. Update the X-initial tags to better match what the data would suggest for a more meaningful POS tag rather than the catch-all X.
Acknowledge Jenny Chim's contribution in readme. Also update the hkcancor-to-ud map for where some of the edge-case tags come from in the upstream data. Update the X-initial tags to better match what the data would suggest for a more meaningful POS tag rather than the catch-all X.
Hi, I was working with the hkcancor dataset and saw the mapping here. Many thanks for providing this resource!
This PR slightly modifies the _MAP to include edge case POS tags (mostly morpheme-related ones) that I think we should consider as an explicit entry, rather than being bucketed to "X" by default.