Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

关于自定义字典的路径问题 #419

Closed
420672771 opened this issue Mar 6, 2017 · 7 comments
Closed

关于自定义字典的路径问题 #419

420672771 opened this issue Mar 6, 2017 · 7 comments
Labels

Comments

@420672771
Copy link

看hankcs给出的自定义字典的配置格式是这样的:
data/dictionary/custom/CustomDictionary.txt;CompanyName.txt;school.txt
但实际上这样配置却读不到,程序运行时直接找了根路径+CompanyName.txt文件,
改成:
data/dictionary/custom/CustomDictionary.txt;data/dictionary/custom/CompanyName.txt;data/dictionary/custom/school.txt
这个样子的绝对路径就可以读到了,不知是哪里的错误,还是我理解有偏差,望指教

@yesseecity
Copy link

  • data/dictionary/custom/CustomDictionary.txt;CompanyName.txt;school.txt

  • data/dictionary/custom/CustomDictionary.txt; CompanyName.txt; school.txt
    下面的這個在;後有多一個space,指的是 與data/dictionary/custom/CustomDictionary.txt 同樣的資料夾目錄底下的檔案

而上面的沒有用space隔開,指的是絕對路徑

@hankcs hankcs added the question label Mar 9, 2017
@cicido
Copy link

cicido commented Mar 13, 2017

源代码如下:
String[] pathArray = p.getProperty("CustomDictionaryPath", "data/dictionary/custom/CustomDictionary.txt").split(";");
String prePath = root;
for (int i = 0; i < pathArray.length; ++i)
{
if (pathArray[i].startsWith(" "))
{
pathArray[i] = prePath + pathArray[i].trim();
}
else
{
pathArray[i] = root + pathArray[i];
int lastSplash = pathArray[i].lastIndexOf('/');
if (lastSplash != -1)
{
prePath = pathArray[i].substring(0, lastSplash + 1);
}
}
}
CustomDictionaryPath = pathArray;

@cicido
Copy link

cicido commented Mar 13, 2017

有点不太明白,这里为何要单独处理空格。一个配置为何搞得这么复杂呢?

@hankcs
Copy link
Owner

hankcs commented Mar 20, 2017

@cicido 不复杂,一个配置项中多个路径而已。空格表示与前一个文件在同一个目录,如果不处理的话,路径超级长。其实如果大家嫌配置文件麻烦的话,可以完全脱离配置文件的,直接HanLP.Config.key = value写自己的配置。最开始的时候就预留了这种灵活性,可能大家没想到。

@cicido
Copy link

cicido commented Mar 20, 2017

我原来以为自定义的词典不会太多。后来想了下,可能会越来越来。目前我把jieba, scws的分词词典放在里面了,词性是一致的,频率默认给定2. 估计以后会添加更多的自定义的词典。这样配置,确实能减少长度。

@420672771
Copy link
Author

了解了

@hankcs
Copy link
Owner

hankcs commented Jan 1, 2020

感谢您对HanLP1.x的支持,我一直为没有时间回复所有issue感到抱歉,希望您提的问题已经解决。或者,您可以从《自然语言处理入门》中找到答案。

时光飞逝,HanLP1.x感谢您的一路相伴。我于东部标准时间2019年12月31日发布了HanLP1.x在上一个十年最后一个版本,代号为最后的武士。此后1.x分支将提供稳定性维护,但不是未来开发的焦点。

值此2020新年之际,我很高兴地宣布,HanLP2.0发布了。HanLP2.0的愿景是下一个十年的前沿NLP技术。为此,HanLP2.0采用TensorFlow2.0实现了最前沿的深度学习模型,通过精心设计的框架支撑下游NLP任务,在海量语料库上取得了最前沿的准确率。作为第一个alpha版本,HanLP 2.0.0a0支持分词、词性标注、命名实体识别、依存句法分析、语义依存分析以及文本分类。而且,这些功能并不仅限中文,而是面向全人类语种设计。HanLP2.0提供许多预训练模型,而终端用户仅需两行代码即可部署,深度学习落地不再困难。更多详情,欢迎观看HanLP2.0的介绍视频,或参与论坛讨论

展望未来,HanLP2.0将集成1.x时代继承下来的高效率务实风范,同时冲刺前沿研究,做工业界和学术界的两栖战舰,请诸君继续多多指教,谢谢。

@hankcs hankcs closed this as completed Jan 1, 2020
@hankcs hankcs added ignored and removed question labels Jan 1, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants