Commit 50da583

committed

Improve UTF-8 decoding and encoding functions

Ensure proper UTF-8 encoding (1 to 4 bytes). Handle invalid encodings (return 0xFFFD and consume a single byte) Individually encoded surrogate code points are accepted. - add `utf8_scan()` to analyze a byte array for UTF-8 contents detects invalid encoding, computes number of codepoints and content kind: plain ASCII, 8-bit, 16-bit or larger codepoints. - add `utf8_encode_len(c)` to compute the number of bytes to encode `c` - rename `unicode_to_utf8` as `utf8_encode` - rename `unicode_from_utf8` as `utf8_decode` - add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded byte array known to contain only ASCII and 8-bit codepoints. - add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs for non-BMP1 codepoints. - add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit codepoints as a UTF-8 encoded null terminated string - add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string - detect invalid UTF-8 encoding in RegExp parser - simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions - simplify UTF-8 decoding and error testing

1 parent f588210 commit 50da583Copy full SHA for 50da583

5 files changed

+490

-269

lines changed

cutils.c
cutils.h
libregexp.c
quickjs-libc.c
quickjs.c

5 files changed

+490

-269

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 50da583

5 files changed

5 files changed

File tree

5 files changed

5 files changed

0 commit comments