Add encoding detection callback #2788

Egor18 · 2018-11-27T14:46:20Z

This PR provides ability to specify user-defined callback, which could be used to detect encoding for each file separately. See #2781.

For example I use juniversalchardet lib:

private Charset detectEncoding(byte[] fileBytes) {
   final int maxLength = 4096;
   UniversalDetector detector = new UniversalDetector(null);
   detector.handleData(fileBytes, 0, Math.min(fileBytes.length, maxLength));
   detector.dataEnd();
   String encoding = detector.getDetectedCharset();
   detector.reset();
   if (encoding != null) {
       return Charset.forName(encoding);
   }
   return Charset.forName("UTF-8");
}

Spoon api:

final Launcher launcher = new Launcher();
launcher.getEnvironment().setEncodingDetectionCallback(bytes -> detectEncoding(bytes));
// ...
CtModel model = launcher.buildModel();

It helps to have correct AST in projects with mixed encodings.

monperrus · 2018-11-27T15:23:57Z

src/test/java/spoon/test/compilationunit/TestCompilationUnit.java

+				.findFirst()
+				.get();
+
+		assertEquals("\"Привет мир\"", utf8Type.getField("s1").getAssignment().toString());


I love this expected value :-)

If you don't read Cyrillic, it means "Hello World" :)

pvojtechovsky · 2018-11-27T17:31:58Z

I would like to have available file name+path in encoding detection callback.
Because for example in Eclipse IDE the encoding of source files can be defined individually using property file with properties like:

encoding//src/main/java/spoon/support/visitor/replace/ReplacementVisitor.java=ISO-8859-1

So I can imagine that many clients might be able to select encoding depending on the file name or file path. Of course bytes of file might be good input too!

WDYT?

But before we design the final API for encoding detection, please note this bug in Spoon:

class VirtualFile implements SpoonFile {
...
	@Override
	public InputStream getContent() {
		return new ByteArrayInputStream(content.getBytes());
	}
...
}

here we convert String value to bytes using system dependent encoding ... and later we convert bytes back to chars using another encoding.
It would be nicer to be able to directly return String or char[] from the SpoonFile or at least from VirtualFile.

... but how to mix it together with encoding detector?? ;-)

@surli @monperrus Do you agree to refactor the related file API here?

pvojtechovsky · 2018-11-27T18:15:32Z

interface SpoonFile {
  default char[] getContentChars(Environment env) {
    ... move loading code from `FileCompilerConfig#initializeCompiler` here...
  }
}
class VirtualFile implements SpoonFile {
...
	@Override
	public char[] getContentChars(Environment env) {
		return content.toCharArray();
	}
...
}

WDYT?

Egor18 · 2018-11-28T06:28:06Z

I would like to have available file name+path in encoding detection callback.

We can use BiFunction<byte[], File, Charset>.

pvojtechovsky · 2018-11-28T07:40:10Z

The File should be replaced by SpoonFile, because there is no File in case of accessing of jar content.

And I personally would prefer new interface FileEncodingProvider or FileEncodingProvider or ... instead of BiFunction.

Egor18 · 2018-11-28T13:58:39Z

So, as far as I understand you want me to add char[] getContentChars(Environment env) method to the SpoonFile interface and then implement it in VirtualFile, ZipFile and FileSystemFile. Then we could use getContentChars instead of getContent in initializeCompiler like this:

for (SpoonFile f : getFiles(compiler)) {

    if (compiler.filesToBeIgnored.contains(f.getPath())) {
        continue;
    }

    String fName = f.isActualFile() ? f.getPath() : f.getName();
    Environment env = jdtCompiler.getEnvironment();
    cuList.add(new CompilationUnit(f.getContentChars(env), fName, f.getDetectedEncoding(env)));
}

Am I right?

As you can see CompilationUnit constructor takes encoding as an argument too, so I have to add something like Charset getDetectedEncoding(env) to the SpoonFile interface as well. So detectedEncoding would memorize the encoding, which was detected during the last callbak call.

WDYT?

pvojtechovsky · 2018-11-28T14:48:04Z

and then implement it in VirtualFile, ZipFile and FileSystemFile

if you add "default" implementation in SpoonFile, then it is enough to implement special code into VirtualFile only.

am I right

yes

CompilationUnit constructor takes encoding as an argument too

I believe that this encoding is ignored (The compilation unit has no access to origin byte[] so encoding is useless.), so we can simply provide default encoding from environment.

Egor18 · 2018-11-29T08:30:41Z

@pvojtechovsky, I refactored the API as you suggested.
Note: I had to make a small change in FileSystemFolderTest.java.

pvojtechovsky · 2018-11-29T08:36:55Z

src/main/java/spoon/compiler/SpoonFile.java

+		byte[] bytes;
+		try {
+			InputStream contentStream = getContent();
+			bytes = new byte[contentStream.available()];


There is no way how to get size of the input stream.
Use ByteArrayOutputStream, then IOUtils.copy to copy input to output and then ByteArrayOutputStream#toByteArray

src/main/java/spoon/support/compiler/VirtualFile.java

pvojtechovsky · 2018-11-29T09:12:40Z

src/main/java/spoon/compiler/SpoonFile.java

@@ -54,8 +56,9 @@
 		byte[] bytes;
 		try {
 			InputStream contentStream = getContent();


The contentStream must be closed. The best is:

try (InputStream contentStream = getContent()) {

Ooops, my bad, thanks.

pvojtechovsky · 2018-11-29T09:39:44Z

Looks got to me. Thank You Egor!
@monperrus will you merge?

add encoding detection callback

539d60c

monperrus reviewed Nov 27, 2018

View reviewed changes

refactor api

0ce2e1c

pvojtechovsky reviewed Nov 29, 2018

View reviewed changes

src/main/java/spoon/support/compiler/VirtualFile.java Show resolved Hide resolved

fix

6eb5039

pvojtechovsky reviewed Nov 29, 2018

View reviewed changes

close contentStream

7b417f3

monperrus merged commit e6b1b3e into INRIA:master Nov 29, 2018

Egor18 deleted the encodingDetectionCallback branch June 15, 2019 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding detection callback #2788

Add encoding detection callback #2788

Egor18 commented Nov 27, 2018 •

edited

Loading

monperrus Nov 27, 2018

monperrus Nov 29, 2018

pvojtechovsky commented Nov 27, 2018

pvojtechovsky commented Nov 27, 2018

Egor18 commented Nov 28, 2018 •

edited

Loading

pvojtechovsky commented Nov 28, 2018

Egor18 commented Nov 28, 2018

pvojtechovsky commented Nov 28, 2018

Egor18 commented Nov 29, 2018

pvojtechovsky Nov 29, 2018

pvojtechovsky Nov 29, 2018

Egor18 Nov 29, 2018

pvojtechovsky commented Nov 29, 2018

Add encoding detection callback #2788

Add encoding detection callback #2788

Conversation

Egor18 commented Nov 27, 2018 • edited Loading

monperrus Nov 27, 2018

Choose a reason for hiding this comment

monperrus Nov 29, 2018

Choose a reason for hiding this comment

pvojtechovsky commented Nov 27, 2018

pvojtechovsky commented Nov 27, 2018

Egor18 commented Nov 28, 2018 • edited Loading

pvojtechovsky commented Nov 28, 2018

Egor18 commented Nov 28, 2018

pvojtechovsky commented Nov 28, 2018

Egor18 commented Nov 29, 2018

pvojtechovsky Nov 29, 2018

Choose a reason for hiding this comment

pvojtechovsky Nov 29, 2018

Choose a reason for hiding this comment

Egor18 Nov 29, 2018

Choose a reason for hiding this comment

pvojtechovsky commented Nov 29, 2018

Egor18 commented Nov 27, 2018 •

edited

Loading

Egor18 commented Nov 28, 2018 •

edited

Loading