utf8encode #80

darrachequesne · 2016-12-16T08:14:35Z

A few questions:

why are there so many public methods? Shouldn't encodePayload and decodePayload be enough?

Currently, it seems that:
- encodePayload/decodePayload is used for polling transport
- encodePacket/decodePacket is used for websocket transport

Quoting socketio/engine.io-client#140 (comment):

encodePacket is more efficient in this case so we can leverage the WS framing

Is that really more efficient? Shouldn't we rather concatenate two packets instead of sending them one by one? (I think that's why, currently, sending a buffer with socket.io results in two frames: socketio/socket.io#2444)

according to the following comment, I understand that utf8 is needed for mixing binary/string content

The reason we are using utf8.js is that when encoding mixed binary and string contents, we need to support multibyte strings. It is easiest to encode the multibyte characters so that we can do a transformation to a binary representation one byte at a time.

Yet it seems that, since encodePayload always call encodePacket with utf8encode to true, the strings transmitted over polling transport are UTF8-encoded twice (socketio/engine.io#315):

https://github.com/socketio/engine.io-parser/blob/1.3.2/lib/index.js#L229

exports.encodePayload = function (packets, supportsBinary, callback) {
  ...
  if (supportsBinary) {
    return exports.encodePayloadAsBinary(packets, callback);
    // which call encodePacket with utf8encode = true => OK
  }
  ...
  function encodeOne(packet, doneCallback) {
    exports.encodePacket(packet, supportsBinary, true, function(message) {
      doneCallback(null, setLengthHeader(message));
    });
    // since the payload is not encoded as binary, shouldn't utf8encode be set to false here?
  }
}

cc @calzoneman @rase- @rauchg

The text was updated successfully, but these errors were encountered:

calzoneman · 2016-12-17T18:36:53Z

Here are a couple interesting commits from the parser:

It looks like at first, utf8 was always used, "to support multibyte strings". Later on, it was realized that ws is already encoding the payload, so it was made optional, but still defaults to true for polling transports. This is the part that doesn't make sense to me, because with the appropriate headers set, the http server should already be handling this encoding. I don't have the full context on the issue that prompted the addition of utf8 "to support multibyte strings", so maybe that would be a good place to start.

darrachequesne · 2016-12-17T21:50:36Z

@calzoneman thanks for taking the time to look into that! If I understand correctly, I think that makes sense when mixing binary/string content in the same payload, since in that case the content type will be application/octet-stream (and not text/plain; charset=UTF-8, see ref).

calzoneman · 2016-12-17T23:44:58Z

That's the part that I was missing (that you can send binary data and UTF-8 string data in the same polling payload). In that case, I guess what's happening is that for string-only payloads, they are being encoded twice (once by engine.io, and again when going over the wire).

It might get a bit messy to try to selectively encode strings depending on the payload's content-type. This is also going to be a backwards-incompatible change for any client code using the polling transport. Not sure what the best solution is here.

darrachequesne · 2016-12-17T23:58:08Z

What do you think of #81? Does it make sense to skip utf8 encoding when dealing with payloads without binary content?

calzoneman · 2016-12-18T01:22:42Z

That seems reasonable to me. I'm not sure how extensive socket.io's test coverage is; it may be worth re-running the test cases from the issues that prompted the utf8 changes to ensure this doesn't break them.

calzoneman · 2016-12-18T01:24:40Z

@nuclearace since he was actually the first person to point out the double encoding to me.

nuclearace · 2016-12-18T01:41:33Z

Yeah, I'll need to change some stuff on the Swift client if things stop getting double encoded.

darrachequesne · 2016-12-21T08:22:21Z

Closed by #81 (will trigger a major release)

darrachequesne mentioned this issue Dec 17, 2016

Fix double utf8 encoding for payloads #81

Merged

darrachequesne closed this as completed Dec 21, 2016

darrachequesne added this to the 2.0.0 milestone Mar 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8encode #80

utf8encode #80

darrachequesne commented Dec 16, 2016

calzoneman commented Dec 17, 2016

darrachequesne commented Dec 17, 2016

calzoneman commented Dec 17, 2016

darrachequesne commented Dec 17, 2016

calzoneman commented Dec 18, 2016

calzoneman commented Dec 18, 2016

nuclearace commented Dec 18, 2016

darrachequesne commented Dec 21, 2016

utf8encode #80

utf8encode #80

Comments

darrachequesne commented Dec 16, 2016

calzoneman commented Dec 17, 2016

darrachequesne commented Dec 17, 2016

calzoneman commented Dec 17, 2016

darrachequesne commented Dec 17, 2016

calzoneman commented Dec 18, 2016

calzoneman commented Dec 18, 2016

nuclearace commented Dec 18, 2016

darrachequesne commented Dec 21, 2016