Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use all char points when generating indexes instead of only the first #41

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TonyStew
Copy link

When generating indexes for a given label we currently only take into account the first codepoint of any given char. This can cause issues. For example the common LGR currently provided by ICANN contains:

    <char cp="0073 0073" ref="118" comment="Sequence added for variant mapping">
      <var cp="00DF" type="blocked" ref="118" comment="IDNA2003 Compatibility" />
      <var cp="03B2" type="blocked" ref="118" />
    </char>

This is meant to mark "ß" as a variant of "ss" but because of the current index generation logic any label with "ss" will return an index containing only "s". IE "sharpness" -> "sharpnes". I don't think this is intended behavior and will cause a lot of collisions.

This PR seems to resolve the issue for me ("sharpness" -> "sharpnes", "teßt" -> "tesst") but more rigorous testing may be needed.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant