Skip to content

lib: ensure TextDecoder only removes utf8 BOM on utf8 encoding #42779

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 5 commits into from

Conversation

phated
Copy link
Contributor

@phated phated commented Apr 18, 2022

While working on gulpjs/remove-bom-stream#8, I noticed that our test case to ensure a UTF-8 BOM at the beginning of a UTF-16 file wouldn't be removed; however, it was.

I dug into the PR at #30132 and noticed that it lost the utf-8 and utf-16-le checks in the refactor.

This should limit the BOM removal to just the UTF-8 encoding. There should also be a follow-up PR that adds the utf-16-le BOM removal back to the code.

@nodejs-github-bot nodejs-github-bot added encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. needs-ci PRs that need a full CI run. labels Apr 18, 2022
@Trott Trott requested a review from addaleax April 18, 2022 21:30
@phated phated force-pushed the phated/textdecoder-bom branch from 3cc2e04 to d84a4eb Compare April 18, 2022 21:32
@Trott
Copy link
Member

Trott commented Apr 18, 2022

I'm guessing lib: would be the generic choice for the start of the commit message.

@phated phated force-pushed the phated/textdecoder-bom branch from d84a4eb to 96f2c21 Compare April 18, 2022 21:45
@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

Updated! I also attempted to add a test, let me know if it is wrong.

@phated phated changed the title utils: ensure TextDecoder only removes utf-8 BOM on utf-8 encoding lib: ensure TextDecoder only removes utf8 BOM on utf8 encoding Apr 18, 2022
@Trott

This comment was marked as resolved.

@phated phated force-pushed the phated/textdecoder-bom branch from 96f2c21 to 4b0cb91 Compare April 18, 2022 21:47
@Trott
Copy link
Member

Trott commented Apr 18, 2022

I also attempted to add a test, let me know if it is wrong.

Looks right to me. Anyone more encoding-knowledgable want to review?

@Trott Trott added the request-ci Add this label to start a Jenkins CI on a PR. label Apr 18, 2022
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Apr 18, 2022
@nodejs-github-bot
Copy link
Collaborator

@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

@Trott can you help me with these failures so I don't need to approve the Jenkins CI to my entire GH account?

@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Apr 18, 2022

@Trott can you help me with these failures so I don't need to approve the Jenkins CI to my entire GH account?

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ 'utf-16le'
- 'utf-16-le'
         ^
    at /home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-whatwg-encoding-custom-textdecoder.js:46:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 'utf-16le',
  expected: 'utf-16-le',
  operator: 'strictEqual'
}

@phated
Copy link
Contributor Author

phated commented Apr 18, 2022

🤦 I totally looked at those conversions too. Thanks!

@Trott Trott added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Apr 18, 2022
@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Apr 19, 2022

@phated It appears that the change here causes a failure in test/parallel/test-whatwg-encoding-custom-textdecoder.js. Is the bug in the test or in the change here?

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ '믯璿攀猀琀가�'
- 'test€'
    at /home/iojs/build/workspace/node-test-commit-linux/test/parallel/test-whatwg-encoding-custom-textdecoder.js:48:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '믯璿攀猀琀가�',
  expected: 'test€',
  operator: 'strictEqual'
}

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

@Trott thanks. It looks like a bug in my test specifically.

const dec = new TextDecoder(i);
assert.strictEqual(dec.encoding, 'utf-16le');
const res = dec.decode(buf);
assert.strictEqual(res, '믯璿攀猀琀가');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried decoding this on current node 16, the bug shows itself in an invalid character at the end: '믯璿攀猀琀가�'

@Trott Trott requested a review from lpinca April 19, 2022 21:28
@nodejs-github-bot
Copy link
Collaborator

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

If my last commit doesn't work. I might need to copy the gulp test case into here (even though it is a larger buffer)

@Trott
Copy link
Member

Trott commented Apr 19, 2022

node:assert:123
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
+ actual - expected

+ '믯璿攀猀琀가�'
- '믯璿攀猀琀가'
         ^
    at /Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-whatwg-encoding-custom-textdecoder.js:48:12
    at Array.forEach (<anonymous>)
    at Object.<anonymous> (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-whatwg-encoding-custom-textdecoder.js:39:26)
    at Module._compile (node:internal/modules/cjs/loader:1105:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1159:10)
    at Module.load (node:internal/modules/cjs/loader:981:32)
    at Module._load (node:internal/modules/cjs/loader:827:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:77:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: '믯璿攀猀琀가�',
  expected: '믯璿攀猀琀가',
  operator: 'strictEqual'
}

Node.js v19.0.0-pre

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

I don't know how to test this. In the gulp tests, we save files with different encodings and I have no idea how to craft those buffers inside a Buffer.

@phated phated closed this Apr 19, 2022
@Trott
Copy link
Member

Trott commented Apr 19, 2022

Did you mean to close this? I'd like to keep chipping away at it unless you're opening a new PR or something?

@Trott
Copy link
Member

Trott commented Apr 19, 2022

You can create a file and save it in the test temp directory. (I can write that code or point you to a sample, so you don't have to learn that API unless you think you're going to be writing more Node.js tests.)

We could also put a file permanently in the test/fixtures directory.

@phated
Copy link
Contributor Author

phated commented Apr 19, 2022

Did you mean to close this?

Yeah, I meant to close it. At this point, I can't determine if I'm crazy or not because reading the file into the repl seems to give me the expected results, but my tests fail. The furthest I got before giving up was possibly a re-encoding issue (back into a buffer).

One thing that I noticed is that \ufeff is actually the <Buffer ff fe> buffer, which is actually the utf16 BOM (not the utf8 BOM), so I'm not sure.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants