Skip to content

utf8encode #80

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
darrachequesne opened this issue Dec 16, 2016 · 8 comments
Closed

utf8encode #80

darrachequesne opened this issue Dec 16, 2016 · 8 comments
Milestone

Comments

@darrachequesne
Copy link
Member

A few questions:

  • why are there so many public methods? Shouldn't encodePayload and decodePayload be enough?

Currently, it seems that:
- encodePayload/decodePayload is used for polling transport
- encodePacket/decodePacket is used for websocket transport

Quoting socketio/engine.io-client#140 (comment):

encodePacket is more efficient in this case so we can leverage the WS framing

Is that really more efficient? Shouldn't we rather concatenate two packets instead of sending them one by one? (I think that's why, currently, sending a buffer with socket.io results in two frames: socketio/socket.io#2444)

  • according to the following comment, I understand that utf8 is needed for mixing binary/string content

The reason we are using utf8.js is that when encoding mixed binary and string contents, we need to support multibyte strings. It is easiest to encode the multibyte characters so that we can do a transformation to a binary representation one byte at a time.

Yet it seems that, since encodePayload always call encodePacket with utf8encode to true, the strings transmitted over polling transport are UTF8-encoded twice (socketio/engine.io#315):

https://github.com/socketio/engine.io-parser/blob/1.3.2/lib/index.js#L229

exports.encodePayload = function (packets, supportsBinary, callback) {
  ...
  if (supportsBinary) {
    return exports.encodePayloadAsBinary(packets, callback);
    // which call encodePacket with utf8encode = true => OK
  }
  ...
  function encodeOne(packet, doneCallback) {
    exports.encodePacket(packet, supportsBinary, true, function(message) {
      doneCallback(null, setLengthHeader(message));
    });
    // since the payload is not encoded as binary, shouldn't utf8encode be set to false here?
  }
}

cc @calzoneman @rase- @rauchg

@calzoneman
Copy link

Here are a couple interesting commits from the parser:

It looks like at first, utf8 was always used, "to support multibyte strings". Later on, it was realized that ws is already encoding the payload, so it was made optional, but still defaults to true for polling transports. This is the part that doesn't make sense to me, because with the appropriate headers set, the http server should already be handling this encoding. I don't have the full context on the issue that prompted the addition of utf8 "to support multibyte strings", so maybe that would be a good place to start.

@darrachequesne
Copy link
Member Author

@calzoneman thanks for taking the time to look into that! If I understand correctly, I think that makes sense when mixing binary/string content in the same payload, since in that case the content type will be application/octet-stream (and not text/plain; charset=UTF-8, see ref).

@calzoneman
Copy link

That's the part that I was missing (that you can send binary data and UTF-8 string data in the same polling payload). In that case, I guess what's happening is that for string-only payloads, they are being encoded twice (once by engine.io, and again when going over the wire).

It might get a bit messy to try to selectively encode strings depending on the payload's content-type. This is also going to be a backwards-incompatible change for any client code using the polling transport. Not sure what the best solution is here.

@darrachequesne
Copy link
Member Author

What do you think of #81? Does it make sense to skip utf8 encoding when dealing with payloads without binary content?

@calzoneman
Copy link

That seems reasonable to me. I'm not sure how extensive socket.io's test coverage is; it may be worth re-running the test cases from the issues that prompted the utf8 changes to ensure this doesn't break them.

@calzoneman
Copy link

  • @nuclearace since he was actually the first person to point out the double encoding to me.

@nuclearace
Copy link
Member

Yeah, I'll need to change some stuff on the Swift client if things stop getting double encoded.

@darrachequesne
Copy link
Member Author

Closed by #81 (will trigger a major release)

@darrachequesne darrachequesne added this to the 2.0.0 milestone Mar 21, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants