Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

扩展词库加入英文,输入扩展英文连接另一英文,分词会报错。 #2

Closed
timlincool opened this issue Apr 2, 2015 · 1 comment
Labels

Comments

@timlincool
Copy link

扩展词库加入英文,输入扩展英文连接另一英文,分词会报错。
原本以为是自定词性的问题,但将词性改为n,仍会报错

词库内容为
BENQ n 1024
BENTLEY n 1024

输入"BENQphone";
使用标准分词 HanLP.segment(text)

开启debug如下:

粗分词网:
0:[ ]
1:[BENQ]
2:[ENQphone]
3:[]
4:[]
5:[]
6:[]
7:[]
8:[]
9:[]
10:[ ]

会报出这样的错误

Exception in thread "main" java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.(ArrayList.java:142)
at com.hankcs.hanlp.seg.HiddenMarkovModelSegment.convert(HiddenMarkovModelSegment.java:238)
at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:50)
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:144)
at com.hankcs.hanlp.tokenizer.StandardTokenizer.segment(StandardTokenizer.java:39)
at com.hankcs.hanlp.HanLP.segment(HanLP.java:354)

原因是 com.hankcs.hanlp.seg.Viterbi.ViterbiSegment 中 47行
List vertexList = viterbi(wordNetAll);

返回结果
vertexList =[ ]
vertexList.size() = 1

但输入 "BENQBENTLEYphone"

则输出没报错,但结果不是想要的

人名角色观察:[ A 42634591 ][BENQ A 42634591 ][B L 3 ][ENTLEYphone A 42634591 ][ A 42634591 ]
人名角色标注:[ /A ,BENQ/A ,B/L ,ENTLEYphone/A , /A]
[BENQ/n, B/nx, ENTLEYphone/nx]

请问该如何修改才能添加英文词库呢?

@hankcs
Copy link
Owner

hankcs commented Apr 2, 2015

感谢反馈,这触发了一个隐藏很深的bug,已修复。

关于内部原因,稍作解释:
com/hankcs/hanlp/seg/HiddenMarkovModelSegment.java中的GenerateWordNet是一个按词典和原子分词构造词网的方法,但是我目前对AtomSegment的效率并不满意,所以作为一个优化手段,我尽量在减少对它的调用。有段时间优化过了头,导致整个词网无法构成连通图。

经过调整和检验,我认为已经修复了这个问题。

@hankcs hankcs closed this as completed Apr 2, 2015
@hankcs hankcs added the bug label Apr 2, 2015
hankcs pushed a commit that referenced this issue Oct 3, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants