Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add encoding detection callback #2788

Merged
merged 4 commits into from
Nov 29, 2018

Conversation

Egor18
Copy link
Contributor

@Egor18 Egor18 commented Nov 27, 2018

This PR provides ability to specify user-defined callback, which could be used to detect encoding for each file separately. See #2781.

For example I use juniversalchardet lib:

private Charset detectEncoding(byte[] fileBytes) {
   final int maxLength = 4096;
   UniversalDetector detector = new UniversalDetector(null);
   detector.handleData(fileBytes, 0, Math.min(fileBytes.length, maxLength));
   detector.dataEnd();
   String encoding = detector.getDetectedCharset();
   detector.reset();
   if (encoding != null) {
       return Charset.forName(encoding);
   }
   return Charset.forName("UTF-8");
}

Spoon api:

final Launcher launcher = new Launcher();
launcher.getEnvironment().setEncodingDetectionCallback(bytes -> detectEncoding(bytes));
// ...
CtModel model = launcher.buildModel();

It helps to have correct AST in projects with mixed encodings.

.findFirst()
.get();

assertEquals("\"Привет мир\"", utf8Type.getField("s1").getAssignment().toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this expected value :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't read Cyrillic, it means "Hello World" :)

@pvojtechovsky
Copy link
Collaborator

I would like to have available file name+path in encoding detection callback.
Because for example in Eclipse IDE the encoding of source files can be defined individually using property file with properties like:

encoding//src/main/java/spoon/support/visitor/replace/ReplacementVisitor.java=ISO-8859-1

So I can imagine that many clients might be able to select encoding depending on the file name or file path. Of course bytes of file might be good input too!

WDYT?

But before we design the final API for encoding detection, please note this bug in Spoon:

class VirtualFile implements SpoonFile {
...
	@Override
	public InputStream getContent() {
		return new ByteArrayInputStream(content.getBytes());
	}
...
}

here we convert String value to bytes using system dependent encoding ... and later we convert bytes back to chars using another encoding.
It would be nicer to be able to directly return String or char[] from the SpoonFile or at least from VirtualFile.

... but how to mix it together with encoding detector?? ;-)

@surli @monperrus Do you agree to refactor the related file API here?

@pvojtechovsky
Copy link
Collaborator

interface SpoonFile {
  default char[] getContentChars(Environment env) {
    ... move loading code from `FileCompilerConfig#initializeCompiler` here...
  }
}
class VirtualFile implements SpoonFile {
...
	@Override
	public char[] getContentChars(Environment env) {
		return content.toCharArray();
	}
...
}

WDYT?

@Egor18
Copy link
Contributor Author

Egor18 commented Nov 28, 2018

I would like to have available file name+path in encoding detection callback.

We can use BiFunction<byte[], File, Charset>.

@pvojtechovsky
Copy link
Collaborator

The File should be replaced by SpoonFile, because there is no File in case of accessing of jar content.

And I personally would prefer new interface FileEncodingProvider or FileEncodingProvider or ... instead of BiFunction.

@Egor18
Copy link
Contributor Author

Egor18 commented Nov 28, 2018

So, as far as I understand you want me to add char[] getContentChars(Environment env) method to the SpoonFile interface and then implement it in VirtualFile, ZipFile and FileSystemFile. Then we could use getContentChars instead of getContent in initializeCompiler like this:

for (SpoonFile f : getFiles(compiler)) {

    if (compiler.filesToBeIgnored.contains(f.getPath())) {
        continue;
    }

    String fName = f.isActualFile() ? f.getPath() : f.getName();
    Environment env = jdtCompiler.getEnvironment();
    cuList.add(new CompilationUnit(f.getContentChars(env), fName, f.getDetectedEncoding(env)));
}

Am I right?

As you can see CompilationUnit constructor takes encoding as an argument too, so I have to add something like Charset getDetectedEncoding(env) to the SpoonFile interface as well. So detectedEncoding would memorize the encoding, which was detected during the last callbak call.

WDYT?

@pvojtechovsky
Copy link
Collaborator

and then implement it in VirtualFile, ZipFile and FileSystemFile

if you add "default" implementation in SpoonFile, then it is enough to implement special code into VirtualFile only.

am I right

yes

CompilationUnit constructor takes encoding as an argument too

I believe that this encoding is ignored (The compilation unit has no access to origin byte[] so encoding is useless.), so we can simply provide default encoding from environment.

@Egor18
Copy link
Contributor Author

Egor18 commented Nov 29, 2018

@pvojtechovsky, I refactored the API as you suggested.
Note: I had to make a small change in FileSystemFolderTest.java.

byte[] bytes;
try {
InputStream contentStream = getContent();
bytes = new byte[contentStream.available()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way how to get size of the input stream.
Use ByteArrayOutputStream, then IOUtils.copy to copy input to output and then ByteArrayOutputStream#toByteArray

@@ -54,8 +56,9 @@
byte[] bytes;
try {
InputStream contentStream = getContent();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contentStream must be closed. The best is:

try (InputStream contentStream = getContent()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, my bad, thanks.

@pvojtechovsky
Copy link
Collaborator

Looks got to me. Thank You Egor!
@monperrus will you merge?

@monperrus monperrus merged commit e6b1b3e into INRIA:master Nov 29, 2018
@Egor18 Egor18 deleted the encodingDetectionCallback branch June 15, 2019 19:33
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants