Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

UTF-8 BOM not accounted for in JsonLocation.getByteOffset() #533

Closed
fabienrenaud opened this issue Apr 23, 2019 · 5 comments
Closed

UTF-8 BOM not accounted for in JsonLocation.getByteOffset() #533

fabienrenaud opened this issue Apr 23, 2019 · 5 comments
Milestone

Comments

@fabienrenaud
Copy link
Contributor

Version: Jackson 2.9.8

parser.getCurrentLocation().getByteOffset() returns the wrong byte offset for the underlying byte array when the payload start with a BOM.
The json parser processes well such json payloads but the JsonLocation it returns ignores the offset introduced by the BOM.

Full standalone repro:

package test;

import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.JsonToken;

import java.io.IOException;
import java.util.Arrays;

public class Test {

    private static final JsonFactory JSON_FACTORY;

    static {
        JSON_FACTORY = new JsonFactory();
        JSON_FACTORY.setCharacterEscapes(null);
        JSON_FACTORY.disable(JsonFactory.Feature.INTERN_FIELD_NAMES);
    }

    static byte[] extract(byte[] bytes, JsonParser parser, JsonToken token) throws IOException {
        switch (token) {
            case START_OBJECT:
                int startIndex = (int) parser.getCurrentLocation().getByteOffset() - 1;
                parser.skipChildren();
                int endIndex = (int) parser.getCurrentLocation().getByteOffset();
                return Arrays.copyOfRange(bytes, startIndex, endIndex);
        }
        throw new RuntimeException();
    }

    static byte[] parseAndExtract(byte[] bytes) throws IOException {
        JsonParser parser = JSON_FACTORY.createParser(bytes);
        System.out.println("parser type: " + parser.getClass().getCanonicalName());

        parser.nextToken(); // skip start object
        switch (parser.nextToken()) {
            case FIELD_NAME:
                if ("payload".equalsIgnoreCase(parser.getCurrentName())) {
                    return extract(bytes, parser, parser.nextToken());
                }
                break;
        }
        throw new RuntimeException();
    }

    public static void main(String[] args) throws Exception {
        String json = "{\"payload\":{\"name\":\"foo\"}}";
        byte[] bytes = json.getBytes();

        System.out.println("UTF-8 no BOM:");
        byte[] result = parseAndExtract(bytes);
        System.out.println(new String(result));
        System.out.println();

        System.out.println("UTF-8 no BOM, with leading characters:");
        result = parseAndExtract(("\r\n\t\n\t\t" + json).getBytes());
        System.out.println(new String(result));
        System.out.println();

        byte[] newBytes = new byte[bytes.length + 3];
        // write BOM
        newBytes[0] = (byte) 0xEF;
        newBytes[1] = (byte) 0xBB;
        newBytes[2] = (byte) 0xBF;
        System.arraycopy(bytes, 0, newBytes, 3, bytes.length);

        System.out.println("UTF-8 BOM:");
        result = parseAndExtract(newBytes);
        System.out.println(new String(result));
        System.out.println();
    }
}

Output:

UTF-8 no BOM:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
{"name":"foo"}

UTF-8 no BOM, with leading characters:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
{"name":"foo"}

UTF-8 BOM:
parser type: com.fasterxml.jackson.core.json.UTF8StreamJsonParser
d":{"name":"fo

You can see the result for the BOM payload gets shifted to the left by exactly 3 bytes while other payloads with or without padding characters are handled as expected.

@fabienrenaud fabienrenaud changed the title UTF-8 BOM not accounted for in JsonLocation.getByteOffset() UTF-8 BOM not accounted for in JsonLocation.getByteOffset() Apr 23, 2019
@cowtowncoder
Copy link
Member

Thank you for reporting this and doing troubleshooting. Sounds like a bug.

fabienrenaud pushed a commit to fabienrenaud/jackson-core that referenced this issue Apr 24, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue Apr 24, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
@fabienrenaud
Copy link
Contributor Author

I submitted a PR a couple of weeks ago. Please let me know of any feedback or necessary changes.

cowtowncoder added a commit that referenced this issue May 16, 2019
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue May 21, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
fabienrenaud added a commit to fabienrenaud/jackson-core that referenced this issue May 21, 2019
UTF8StreamJsonParser tracks read pointer (offset) and bytes processed
separately and uses those to generate JsonLocation. When the byte
payload starts with a UTF BOM, ByteSourceJsonBootstrapper processes a
few bytes ahead of the parser, moves/increases the offset and passes the
newly computed offset to the parser without telling it some bytes have
been pre-processed.
With this change, the number of bytes pre-processed for encoding
detection is passed to the parser. JsonLocation instances returned by
the parser now point to the correct byte offset when payload has a BOM.

Issue: FasterXML#533
@cowtowncoder
Copy link
Member

Looks good. I can merge this in master, backport in 2.10.

The only thing we now need is the Contributor License Agreement, found from https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf
I think we don't yet have one for you? Only needs to be sent once for any and all contributions.
The simplest way is usually to print, fill & sign, scan, email to info at fasterxml dot com.
Once I receive it I will merge this patch.

Thank you in advance!

@fabienrenaud
Copy link
Contributor Author

Just submitted contributor agreement.
Will this make it to 2.9 as well?

@cowtowncoder cowtowncoder added 2.10 and removed 2.9 labels May 29, 2019
@cowtowncoder cowtowncoder added this to the 2.10.0 milestone May 29, 2019
cowtowncoder added a commit that referenced this issue May 29, 2019
@cowtowncoder
Copy link
Member

@fabienrenaud I think I'd rather backport it only in 2.10 (had to do that manually). But 2.10.0 should be out within month.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants