Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve UTF-8 decoding and encoding functions #410

Merged
merged 1 commit into from
May 21, 2024

Conversation

chqrlie
Copy link
Collaborator

@chqrlie chqrlie commented May 19, 2024

Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte) Individually encoded surrogate code points are accepted.

  • add utf8_scan() to analyze a byte array for UTF-8 contents detects invalid encoding, computes number of codepoints and content kind: plain ASCII, 8-bit, 16-bit or larger codepoints.
  • add utf8_encode_len(c) to compute the number of bytes to encode c
  • rename unicode_to_utf8 as utf8_encode
  • rename unicode_from_utf8 as utf8_decode
  • add utf8_decode_buf8(dest, size, src, len) to decode a UTF-8 encoded byte array known to contain only ASCII and 8-bit codepoints.
  • add utf8_decode_buf16(dest, size, src, len) to decode a UTF-8 encoded byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs for non-BMP1 codepoints.
  • add utf8_encode_buf8(dest, size, src, len) to encode an array of 8-bit codepoints as a UTF-8 encoded null terminated string
  • add utf16_encode_buf8(dest, size, src, len) to decode an array of 16-bit codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
  • detect invalid UTF-8 encoding in RegExp parser
  • simplify JS_AtomGetStrRT, JS_NewStringLen using the above functions
  • simplify UTF-8 decoding and error testing

This commit is preliminary for another PR fixing some JSAtom creation inconsistencies and inefficiencies.

Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte)
Individually encoded surrogate code points are accepted.

- add `utf8_scan()` to analyze a byte array for UTF-8 contents
  detects invalid encoding, computes number of codepoints and content kind:
  plain ASCII, 8-bit, 16-bit or larger codepoints.
- add `utf8_encode_len(c)` to compute the number of bytes to encode `c`
- rename `unicode_to_utf8` as `utf8_encode`
- rename `unicode_from_utf8` as `utf8_decode`
- add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded
  byte array known to contain only ASCII and 8-bit codepoints.
- add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded
  byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs
  for non-BMP1 codepoints.
- add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit
  codepoints as a UTF-8 encoded null terminated string
- add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit
  codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
- detect invalid UTF-8 encoding in RegExp parser
- simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions
- simplify UTF-8 decoding and error testing
@chqrlie chqrlie force-pushed the improve-utf8-functions branch from 50da583 to 1c6a98a Compare May 19, 2024 12:50
Copy link
Contributor

@saghul saghul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a shallow review, but I trust you and the tests are happy :-)

@chqrlie chqrlie merged commit 1baa676 into quickjs-ng:master May 21, 2024
47 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants