-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Tokenizer splitHyphenated regression #1289
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
My man I do not see any issue here
|
That is what happens if I use v4.4.0 via |
Well that's strange. Maybe some library interference? I've tried isolating the error as best as I can, and still get it: # lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar
$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP
$ cat foo.java
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import java.util.*;
import java.util.stream.*;
public class foo {
public static void main(String[] args) {
String text = "year-end";
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.language", "en");
props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation ann = new Annotation(text);
pipeline.annotate(ann);
List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}
}
$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored. Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
$ "$JAVA_HOME/bin/java" foo
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored. Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end] Maybe it's an Antlr version issue? We have Antlr Runtime 4.7.2 |
One step closer: apparently if I remove |
Blarg, the decompiler is barfing on Consider the block of code in the Properties prop = StringUtils.stringToProperties(options);
Set<Map.Entry<Object,Object>> props = prop.entrySet();
for (Map.Entry<Object,Object> item : props) {
String key = (String) item.getKey();
String value = (String) item.getValue();
boolean val = Boolean.parseBoolean(value);
if ("".equals(key)) {
// allow an empty item
//...
} else if ("ptb3Escaping".equals(key)) {
//...
splitHyphenated = ! val;
//...
} else if ("ud".equals(key)) {
//...
splitHyphenated=val;
//...
} else if ("splitHyphenated".equals(key)) {
splitHyphenated = val;
} If I inspect Are |
Well this might wind up being horrible. I tried on a couple different Java
8 installs and got the desired behavior in both, but with a Java 11 and a
Java 14 install I got the same error you did. What java version are you
running?
Maybe the string hash function changed between versions, and thus the keys
are iterated in a different order? I guess the simplest fix in that case
would be to make the later keys override the earlier ones in a
deterministic order.
|
Am certain now that it is the key order in the Properties object causing this problem While we come up with some sort of fix, in the meantime, you could always set the |
So, to what extent is this an issue where you would need a quick fix, versus being able to work around it (such as by setting the appropriate option in the Lexer after creating it) until the next release is made? |
The fix for the tokenizer is now in |
The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?
Old output:
[year, -, end]
New output:
[year-end]
The text was updated successfully, but these errors were encountered: