Add entries to pos map #22

j-chim · 2020-12-08T13:23:52Z

Hi, I was working with the hkcancor dataset and saw the mapping here. Many thanks for providing this resource!

This PR slightly modifies the _MAP to include edge case POS tags (mostly morpheme-related ones) that I think we should consider as an explicit entry, rather than being bucketed to "X" by default.

jacksonllee · 2020-12-08T14:26:30Z

Hello, thank you for the pull request. Where did you actually find the POS tags you're trying to add here? The _MAP dict already contains all and only the POS tags actually used in the HKCanCor data. None of your added ones are found in the data.

j-chim · 2020-12-08T14:43:03Z

There are some that only exist in the paper/tagging scheme but not in the corpus (the morpheme-related ones, eg "Bg", "g", "Qg"). I think it would make sense to include them for the sake of coverage, although the changes are minimal in practice.

The remaining two are really just edge cases, possibly introduced in v2 of the corpus:

n1: found in "FC-035_v2" (only appears once)
xo: found in "FC-R006_v2" (I can see this as just "X").

This is based on the corpus hosted on github and I found the same entries in the data downloaded directly from the website.

jacksonllee · 2020-12-09T17:42:41Z

Thank you for the pointers. I wasn't aware of differences between the HKCanCor included in PyCantonese and that from the fcbond/hkcancor repo, and therefore the {N1, XJA, XO} tags were unknown to me. I got the HKCanCor data about 6 years ago and did a lot of heavy scripting to transform it into the CHAT data format currently used in PyCantonese. Unfortunately, I'm unable to locate whatever I used to do the transformation, and so probably won't be able to update the HKCanCor data here for any inconsistencies.

For all these tags you're adding (both the new tags only found in the "upstream" HKCanCor data, as well as those documented in its paper/website but actually unused in the data), practically they're unlikely to have any effect in part-of-speech tagging, since the POS tagger would never see them in the data anyway. This being said, precisely because these tags have no effect and just sit there in the _MAP dict, I don't see any harm including them for completeness!

LGTM. Thank you for your pull request again.

Acknowledge Jenny Chim's contribution in readme. Also update the hkcancor-to-ud map for where some of the edge-case tags come from in the upstream data. Update the X-initial tags to better match what the data would suggest for a more meaningful POS tag rather than the catch-all X.

jacksonllee · 2020-12-09T18:09:57Z

@j-chim Just wanted to note that your contribution has been noted in the readme now (I also updated the new X-initial edge case tags to better match what the data would suggest rather than a generic catch-all X):

85a2a82

Thanks again!

Acknowledge Jenny Chim's contribution in readme. Also update the hkcancor-to-ud map for where some of the edge-case tags come from in the upstream data. Update the X-initial tags to better match what the data would suggest for a more meaningful POS tag rather than the catch-all X.

Add entries to pos map

6598ac8

jacksonllee merged commit 4a95250 into jacksonllee:master Dec 9, 2020

ZhanruiLiang pushed a commit to ZhanruiLiang/pycantonese that referenced this pull request Oct 4, 2022

ENH add entries to HKCanCor-to-UD map (jacksonllee#22)

a643b2d

jacksonllee mentioned this pull request Aug 7, 2024

Undocumented differences between the HKCanCor corpus on HuggingFace vs PyCantonese #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add entries to pos map #22

Add entries to pos map #22

j-chim commented Dec 8, 2020

jacksonllee commented Dec 8, 2020

j-chim commented Dec 8, 2020 •

edited

Loading

jacksonllee commented Dec 9, 2020

jacksonllee commented Dec 9, 2020

Add entries to pos map #22

Add entries to pos map #22

Conversation

j-chim commented Dec 8, 2020

jacksonllee commented Dec 8, 2020

j-chim commented Dec 8, 2020 • edited Loading

jacksonllee commented Dec 9, 2020

jacksonllee commented Dec 9, 2020

j-chim commented Dec 8, 2020 •

edited

Loading