Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

utf8.read function producing wrong strings #1473

Closed
masad-frost opened this issue Aug 11, 2020 · 4 comments · Fixed by #1486
Closed

utf8.read function producing wrong strings #1473

masad-frost opened this issue Aug 11, 2020 · 4 comments · Fixed by #1486

Comments

@masad-frost
Copy link

masad-frost commented Aug 11, 2020

protobuf.js version: 6.10.1

The utf8.read function seems to inserting extra unicode characters sometimes.

Here's a repro (https://repl.it/@masfrost/pbjs-bad-decode) where I check utf8.read against WHATWG TextEncoder. The repro file is not very minimal but this issue seems to be pretty common for us when decoding strings.

FYI for future readers, we monkey patched the library and forced it to use TextDecoder/TextEncoder here https://github.com/replit/crosis/blob/v5.0.3/src/fixUtf8.ts

I think maybe using the standard TextEncoder/Decoder might be the best thing to do here, encoding is just too complicated and I'm sure these standard libraries are faster. Happy to put up a PR if that's an option, otherwise, I don't really have enough time to go splunking into utf8 land.

turbio added a commit to turbio/protobuf.js that referenced this issue Sep 10, 2020
This fixes protobufjs#1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js
alexander-fenster added a commit that referenced this issue Oct 9, 2020
* fix utf8 -> utf16 decoding bug on surrogate pairs

This fixes #1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js

* fix lint

* add test case for surrogate pair bug

Co-authored-by: Alexander Fenster <fenster@google.com>
@daviderenger
Copy link

This is still not fixed with #1486. We got a very mysterious problem where extra � was added at almost even 8000 places sometimes in some of our messages. Turns out we only got it when we had emojis in the strings.

We went the way of using TextDecoder too and it works every time.

@masad-frost
Copy link
Author

@daviderenger do you have a test case?

@maximelkin
Copy link

maximelkin commented Jul 28, 2022

@protobufjs/utf8 is not published in npm, maybe this is why fix not working?
@alexander-fenster can you please release @protobufjs/* packages too (if any changed too)?

renawolford6 added a commit to renawolford6/protobuf-script-build-javascript that referenced this issue Nov 10, 2022
* fix utf8 -> utf16 decoding bug on surrogate pairs

This fixes protobufjs/protobuf.js#1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js

* fix lint

* add test case for surrogate pair bug

Co-authored-by: Alexander Fenster <fenster@google.com>
@Cubelrti
Copy link

Cubelrti commented Mar 20, 2023

Encountered this too, @protobufjs/utf8 is still v1.1.0 and not applying the patch.

Any update?

EDIT: for anyone encountering this, apply the following pnpm patch or patch-package to fix this temporarily.

diff --git a/index.js b/index.js
index e4ff8df9b7e83854df3ed09eb1cee8253af7c497..9dc6c05970cdf1734dddcea73ce261325f22abcc 100644
--- a/index.js
+++ b/index.js
@@ -38,36 +38,27 @@ utf8.length = function utf8_length(string) {
  * @returns {string} String read
  */
 utf8.read = function utf8_read(buffer, start, end) {
-    var len = end - start;
-    if (len < 1)
+    if (end - start < 1) {
         return "";
-    var parts = null,
-        chunk = [],
-        i = 0, // char offset
-        t;     // temporary
-    while (start < end) {
-        t = buffer[start++];
-        if (t < 128)
-            chunk[i++] = t;
-        else if (t > 191 && t < 224)
-            chunk[i++] = (t & 31) << 6 | buffer[start++] & 63;
-        else if (t > 239 && t < 365) {
-            t = ((t & 7) << 18 | (buffer[start++] & 63) << 12 | (buffer[start++] & 63) << 6 | buffer[start++] & 63) - 0x10000;
-            chunk[i++] = 0xD800 + (t >> 10);
-            chunk[i++] = 0xDC00 + (t & 1023);
-        } else
-            chunk[i++] = (t & 15) << 12 | (buffer[start++] & 63) << 6 | buffer[start++] & 63;
-        if (i > 8191) {
-            (parts || (parts = [])).push(String.fromCharCode.apply(String, chunk));
-            i = 0;
-        }
     }
-    if (parts) {
-        if (i)
-            parts.push(String.fromCharCode.apply(String, chunk.slice(0, i)));
-        return parts.join("");
+
+    var str = "";
+    for (var i = start; i < end;) {
+        var t = buffer[i++];
+        if (t <= 0x7F) {
+            str += String.fromCharCode(t);
+        } else if (t >= 0xC0 && t < 0xE0) {
+            str += String.fromCharCode((t & 0x1F) << 6 | buffer[i++] & 0x3F);
+        } else if (t >= 0xE0 && t < 0xF0) {
+            str += String.fromCharCode((t & 0xF) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F);
+        } else if (t >= 0xF0) {
+            var t2 = ((t & 7) << 18 | (buffer[i++] & 0x3F) << 12 | (buffer[i++] & 0x3F) << 6 | buffer[i++] & 0x3F) - 0x10000;
+            str += String.fromCharCode(0xD800 + (t2 >> 10));
+            str += String.fromCharCode(0xDC00 + (t2 & 0x3FF));
+        }
     }
-    return String.fromCharCode.apply(String, chunk.slice(0, i));
+
+    return str;
 };
 
 /**

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants