Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add entries to pos map #22

Merged
merged 1 commit into from
Dec 9, 2020

Conversation

j-chim
Copy link
Contributor

@j-chim j-chim commented Dec 8, 2020

Hi, I was working with the hkcancor dataset and saw the mapping here. Many thanks for providing this resource!

This PR slightly modifies the _MAP to include edge case POS tags (mostly morpheme-related ones) that I think we should consider as an explicit entry, rather than being bucketed to "X" by default.

@jacksonllee
Copy link
Owner

Hello, thank you for the pull request. Where did you actually find the POS tags you're trying to add here? The _MAP dict already contains all and only the POS tags actually used in the HKCanCor data. None of your added ones are found in the data.

@j-chim
Copy link
Contributor Author

j-chim commented Dec 8, 2020

There are some that only exist in the paper/tagging scheme but not in the corpus (the morpheme-related ones, eg "Bg", "g", "Qg"). I think it would make sense to include them for the sake of coverage, although the changes are minimal in practice.

The remaining two are really just edge cases, possibly introduced in v2 of the corpus:

  • n1: found in "FC-035_v2" (only appears once)
  • xo: found in "FC-R006_v2" (I can see this as just "X").

This is based on the corpus hosted on github and I found the same entries in the data downloaded directly from the website.

@jacksonllee
Copy link
Owner

Thank you for the pointers. I wasn't aware of differences between the HKCanCor included in PyCantonese and that from the fcbond/hkcancor repo, and therefore the {N1, XJA, XO} tags were unknown to me. I got the HKCanCor data about 6 years ago and did a lot of heavy scripting to transform it into the CHAT data format currently used in PyCantonese. Unfortunately, I'm unable to locate whatever I used to do the transformation, and so probably won't be able to update the HKCanCor data here for any inconsistencies.

For all these tags you're adding (both the new tags only found in the "upstream" HKCanCor data, as well as those documented in its paper/website but actually unused in the data), practically they're unlikely to have any effect in part-of-speech tagging, since the POS tagger would never see them in the data anyway. This being said, precisely because these tags have no effect and just sit there in the _MAP dict, I don't see any harm including them for completeness!

LGTM. Thank you for your pull request again.

@jacksonllee jacksonllee merged commit 4a95250 into jacksonllee:master Dec 9, 2020
jacksonllee added a commit that referenced this pull request Dec 9, 2020
Acknowledge Jenny Chim's contribution in readme.
Also update the hkcancor-to-ud map for where some of the edge-case
tags come from in the upstream data.
Update the X-initial tags to better match what the data would
suggest for a more meaningful POS tag rather than the catch-all X.
@jacksonllee
Copy link
Owner

@j-chim Just wanted to note that your contribution has been noted in the readme now (I also updated the new X-initial edge case tags to better match what the data would suggest rather than a generic catch-all X):

85a2a82

Thanks again!

ZhanruiLiang pushed a commit to ZhanruiLiang/pycantonese that referenced this pull request Oct 4, 2022
ZhanruiLiang pushed a commit to ZhanruiLiang/pycantonese that referenced this pull request Oct 4, 2022
Acknowledge Jenny Chim's contribution in readme.
Also update the hkcancor-to-ud map for where some of the edge-case
tags come from in the upstream data.
Update the X-initial tags to better match what the data would
suggest for a more meaningful POS tag rather than the catch-all X.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants