stopword词典加载问题 #530

cicido · 2017-05-12T02:44:16Z

我现在用的是hanlp 1.3.0版本. 在分析CoreStopWordDictionary.java发现以下词典加载语句：
dictionary = new StopWordDictionary(new File(HanLP.Config.CoreStopWordDictionaryPath));

之前的核心词典，用户自定义词典等均采用以下方式。以核心词典为例：CoreDictionary.java
br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8"));
是采用IOUtil的统一接口。
而StopWordDictionary直接使用了File来做，造成了不统一。是否考虑对CoreStopWordDictionary建立统一性？
因为我自己定义的JarIOAdapter.java：
public class JarIOAdapter implements IIOAdapter
{
@OverRide
public InputStream open(String path) throws FileNotFoundException
{
/*
采用第一行的方式加载资料会在分布式环境报错
改用第二行的方式
*/
//return ClassLoader.getSystemClassLoader().getResourceAsStream(path);
return JarIOAdapter.class.getClassLoader().getResourceAsStream(path);
}

@Override
public OutputStream create(String path) throws FileNotFoundException
{
    return new FileOutputStream(path);
}

}
这里是实现代码与词典数据的分离，单独把hanlp.properties与data目录做成一个jar。但由于CoreStopDictionary.java读文件接口不统一，导致读不到停用词典文件。
作者是否有意把代码与词典数据分成两个jar包，我这边已差不多完成，可以提交代码

The text was updated successfully, but these errors were encountered:

cicido · 2017-05-12T03:25:58Z

通过分析代码，真正的问题发生在MDAG.java中
public MDAG(File dataFile) throws IOException
{
BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?
new FileInputStream(dataFile) :
//IOAdapter.open(dataFile.getAbsolutePath())
IOAdapter.open(dataFile.getPath())
, "UTF-8"));

将原来的IOAdapter.open(dataFile.getAbsolutePath())改成 IOAdapter.open(dataFile.getPath())即可

hankcs · 2017-05-14T03:01:10Z

感谢建议

你使用的版本太旧了，最新版本已经是正确的了：https://github.com/hankcs/HanLP/blob/master/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAG.java#L171
分成两个jar包的提议是好的，但还需要再思考一下。portable的目的是让新手快速上路，maven用户快速部署；但老手一般都会自定义配置文件实现个性化的功能，配置文件因人而异，也不适合放到maven的jar包里面。换成两个jar包之后还可能会给新手造成麻烦，数据与程序版本容易不一致从而导致问题。
你可以做成插件的形式，很多用户还是挺喜欢把数据放到jar里面去的。我会积极支持的，包括在wiki中推荐。
任何意见，欢迎继续讨论

cicido · 2017-05-15T03:11:23Z

我的版本是1.3.2的，上面写成了1.3.0了，写错了。
另外我在前面的MDAG.java上写的就是
public MDAG(File dataFile) throws IOException
{
BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?
new FileInputStream(dataFile) :
//IOAdapter.open(dataFile.getAbsolutePath())
IOAdapter.open(dataFile.getPath())
, "UTF-8"));
前面为了jar包形式加载词典数据，将IOAdapter.open(dataFile.getAbsolutePath())改成 IOAdapter.open(dataFile.getPath()).
整个流程我写在oschina上了:
https://my.oschina.net/u/940663/blog/898790

hankcs · 2017-05-16T02:54:04Z

感谢建议，以File参数构造MDAG的确与InputStream不兼容。现在已经改为直接由IOAdapter打开的InputStream读取，欢迎测试。
如果还有问题，欢迎重开issue。

hankcs added the improvement label May 14, 2017

hankcs added a commit that referenced this issue May 16, 2017

集群环境中CoreStopWordDictionary适配IOAdapter： #530

06686dd

hankcs closed this as completed May 16, 2017

hankcs added the bug label May 16, 2017

hankcs mentioned this issue May 18, 2017

关于hanlp.properties中定义的data路径问题 #380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stopword词典加载问题 #530

stopword词典加载问题 #530

cicido commented May 12, 2017

cicido commented May 12, 2017 •

edited

Loading

hankcs commented May 14, 2017

cicido commented May 15, 2017

hankcs commented May 16, 2017

stopword词典加载问题 #530

stopword词典加载问题 #530

Comments

cicido commented May 12, 2017

cicido commented May 12, 2017 • edited Loading

hankcs commented May 14, 2017

cicido commented May 15, 2017

hankcs commented May 16, 2017

cicido commented May 12, 2017 •

edited

Loading