Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cannot process Supplementary Character in Java / 无法在Java中处理Supplementary字符 #1564

Closed
1 task done
hurui200320 opened this issue Sep 29, 2020 · 6 comments
Closed
1 task done
Assignees
Labels

Comments

@hurui200320
Copy link

Describe the bug
A clear and concise description of what the bug is.

When handling supplementary character, HanLP(I tested with pinyin and word segmentation) couldn't handle the supplementary character properly. For short. Java represent a unicode character > 0xFFFF as two sepeprate char, thus HanLP treat them as two seperate Chinese character when getting Pinyin on it. However, those chars not assigned to any validate charset, so the pinyin result would be two 'none', rather than one 'none'.

Word segmentation cannot recognize it, but would always keep them as a word.

处理Supplementary字符时,HanLP(我测试了拼音标注和分词)似乎没法恰当的处理Supplementary字符。简单地说,Java将0xFFFF以上的Unicode字符表示为两个char,因此HanLP在标注拼音的时候会将其视为两个独立的汉字。然而这些char的值特意的没有指定给任意有效的字符集,因此拼音标注的结果是两个'none',而不是一个'none'

分词也并不能识别这种字符,但是分词总会确保这些字符是一个词,结果中不会产生破碎的char。

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

println(HanLP.convertToPinyinList("鼖").map { it.pinyinWithToneMark })

Describe the current behavior
A clear and concise description of what happened.

Get: [none, none]
鼖 would be represent as \uD87E\uDE1B in Java, none of them is validate Chinese character. So got 2 none.
鼖在Java中被表示为 \uD87E\uDE1B,每一个单独的char都不是有效的中文字符,因此得到了两个none

Expected behavior
A clear and concise description of what you expected to happen.

Should get: [fén], or at least a [none]

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 2004
  • Python version: Not available
  • HanLP version: 1.7.8
  • Java Version: Java 11

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

  • I've completed this form and searched the web for solutions.
@hankcs
Copy link
Owner

hankcs commented Sep 29, 2020

感谢反馈。1.x的确没考虑32位的Supplementary Character,1.x是建立于double array trie这个数据结构之上,它的转移函数基本单位为16位的char。如果要支持的话,每个char都得调用一次if (!Character.isLowSurrogate(ch) && !Character.isHighSurrogate(ch)),以决定转移1次还是2次。这在设计和执行效率上都是不小的负担,有什么好主意吗?

具体到这个字符,它在CJK Unified Ideographs中的编码为9F16,HanLP是支持的。

System.out.println(HanLP.convertToPinyinList("\u9F16"));

如果将CJK Compatibility Ideographs SupplementCJK Unified Ideographs重合的部分转为后者,超出的部分替换为空白字符,那么中文基本是没有问题的。这种转换可以根据两个码表做一个映射,按需求执行,似乎是更好的方案?

@hurui200320
Copy link
Author

这个Supplementary字符我也是今天晚上解析IDS字形描述文件的时候遇到的。因为项目的前半部分使用了HanLP对中文维基Dump进行处理,要产生一个句子级别的语料,主要用到的就是简繁转换、感知机分词、感知机命名实体识别和拼音标注。今天遇到这个问题之后,回顾前面的处理没有考虑这种情况会不会导致问题。于是测试了一下HanLP的反应,总的来说感觉问题不大,因为我感觉这些位于扩展集中的字符不太能组出什么词或什么实体名。这种字对于语义或命名实体的识别,乃至其他中文NLP任务来说意义不大,当作外文或unknown也无可厚非。

而且刚才又仔细看了一下,虽然Unicode值不同,但是字形确实是同一个字。在ids字形中U+9F16和U+2FA1B在描述上的区别也只是⿱卉鼓⿱𠦄鼓的差别。除此之外还有不少类似的重复字。刚刚也验证了通过微软拼音输入的那个字是U+9F16,因此我认为在中文维基中键入扩展集字符的可能性不大。如果确实有需求的话可以考虑由用户去保证输入的每一个字符都尽量避开Supplementary字符(转换或剔除)。

最后感谢您百忙之中抽时间回复。

不过IDS那边确实介绍说如果有相同描述,会取Unicode较小的那一个,也许是他们那边bug了,我去提Issue(逃

@hurui200320
Copy link
Author

hurui200320 commented Oct 2, 2020

刚才突然想到或许可以利用CodePoint来解决这个问题。前面提到转移的基本单位是Char型,在前面有关Supplementary字符的链接中有提到JSR204专家组最后决定是将Char原样保留,而通过对字符串增加codePoints()方法能够将字符串转换为一个IntStream,其中每一个int实际上就是一个32位表示的Unicode码。尽管对于外部String的转换十分麻烦,但是对于内部的话,单Char表示只不过是把Char转换为Int去处理,而双字节Char会被自动合成一个Int。再还原的话Java的相关底层API也能将CodePoint转换为对应的一个或两个Char。

由此只需要将内部实现的基本类型从Char转为Int,在调入的部分增加对String到CodePoint的转换,我觉得这样子改动不大,而且大概率比较可行。

不过重复字的问题还是需要用户去保证,和之前说的一样。改为使用CodePoint作为内部处理的基本单位无非是增加了对生僻字更好的支持,不过从昨天对维基百科语料的筛选来看,这些生僻字无外乎就是古时候的人名、地名或是特别难写的元素名。如果未来HanLP有意支持,或有意允许用户训练对于古代命名实体识别之类的人物的话,或许可以一试

以及我万没想到𩽾𩾌二字竟然是补充集中的字符

@hurui200320 hurui200320 reopened this Oct 2, 2020
@hanlpbot
Copy link
Collaborator

hanlpbot commented Oct 2, 2020

This issue has been mentioned on 蝴蝶效应. There might be relevant details there:

https://bbs.hankcs.com/t/topic/2777/1

@hankcs
Copy link
Owner

hankcs commented Oct 2, 2020

codePoints的确是个好方法,内部肯定有效率上的优化,我测试该方法的速度大致为一亿字每秒。将此开销叠加到HanLP最快的分词器的四千万字每秒上之后,最终速度降低到两千万每秒:

        String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原";
        HanLP.Config.ShowTermNature = false;
        System.out.println(SpeedTokenizer.segment(text));
        long start = System.currentTimeMillis();
        int pressure = 1000000;
        for (int i = 0; i < pressure; ++i)
        {
            text.codePoints().toArray();
            SpeedTokenizer.segment(text);
        }
        double costTime = (System.currentTimeMillis() - start) / (double)1000;
        System.out.printf("SpeedTokenizer分词速度:%.2f字每秒\n", text.length() * pressure / costTime);

可行的确可行,只不过需要考虑如下cost:

  1. codePoints是Java8引入的,而HanLP是支持Java6的。使用后将不再支持Java6和Java7。
  2. 速度上最坏情况损失一半的性能。

究竟是否采用,我在论坛上发起了一个投票,综合一下广大用户的意见吧。

@hankcs
Copy link
Owner

hankcs commented Jan 31, 2021

好消息,我意识到从原理上HanLP的基础数据结构(双数组trie树等)是支持多字节字符的,只要多字节字符组成的词语位于词典中,一样可以正常匹配,只不过业务逻辑上需要将多字节字符的长度视为1。参考上面的补丁,这个bug得到了圆满的解决,也没有损失速度或Java6支持。

@hankcs hankcs closed this as completed Jan 31, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants