En- and decoding of characters outside the Basic Multilingual Plane #12

sternenseemann · 2022-06-18T13:10:25Z

RFC4627 prescribes that everything has to in some Unicode
encoding (which we comply with by using ASCII (UTF-8) and encoding
everything else) and that any character may be escaped. When escaping,
however, we need to take care to only escape characters in the Basic
Multilingual Plane (BMP) which is U+0000 to U+FFFF:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. [...]

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

RFC4627, p. 3

This commit implements en- and decoding of UTF-16 surrogate pairs and
the necessary error handling logic required by the ordering requirements
and the fact that a lone surrogate code unit/point may never be decoded
nor encoded.

Test cases partially taken from #3.

BREAKING CHANGES:

Note that the broken behavior can all be considered a bug insofar as it
violates the JSON spec.

A Unicode code point outside the BMP will now always be encoded as an
UTF-16 surrogate pair.
A valid UTF-16 surrogate pair will now always be decoded to a single
Unicode codepoint.
When *use-strict-json-rules* encoding a surrogate codepoint or
decoding a lone surrogate code unit will result in an error.
If *use-strict-json-rules* is NIL, it'll behave as before.

Overall this does the same as #3, but

Has more sophisticated error handling, a more lenient behavior is tied to *use-strict-json-rules*.
I tried to annotate encoding/decoding functions, so CL implementations can generate better code.
read-json-string-char is unchanged, instead the surrogate decoding logic is handled decode-json-string.

RFC4627 prescribes that everything has to in some Unicode encoding (which we comply with by using ASCII (UTF-8) and encoding everything else) and that any character may be escaped. When escaping, however, we need to take care to only escape characters in the Basic Multilingual Plane (BMP) which is U+0000 to U+FFFF: > Any character may be escaped. If the character is in the Basic > Multilingual Plane (U+0000 through U+FFFF), then it may be > represented as a six-character sequence: a reverse solidus, followed > by the lowercase letter u, followed by four hexadecimal digits that > encode the character's code point. [...] > > To escape an extended character that is not in the Basic Multilingual > Plane, the character is represented as a twelve-character sequence, > encoding the UTF-16 surrogate pair. So, for example, a string > containing only the G clef character (U+1D11E) may be represented as > "\uD834\uDD1E". > > - RFC4627, p. 3 This commit implements en- and decoding of UTF-16 surrogate pairs and the necessary error handling logic required by the ordering requirements and the fact that a lone surrogate code unit/point may never be decoded nor encoded. Test cases partially taken from sharplispers#3. BREAKING CHANGES: Note that the broken behavior can all be considered a bug insofar as it violates the JSON spec. * A Unicode code point outside the BMP will now always be encoded as an UTF-16 surrogate pair. * A valid UTF-16 surrogate pair will now always be decoded to a single Unicode codepoint. * When *use-strict-json-rules* encoding a surrogate codepoint or decoding a lone surrogate code unit will result in an error. If *use-strict-json-rules* is NIL, it'll behave as before. Co-Authored-By: Chaitanya Gupta <mail@chaitanyagupta.com>

CCL's unicode implementation doesn't allow lone surrogate code point in a string, preventing us from ever creating a string that would trigger the tested behavior here. Other CL implementations are more lenient here, whereas CCL follows the Unicode standard strictly.

sternenseemann and others added 2 commits June 18, 2022 15:08

Skip test for known bug

b2c91fa

sternenseemann force-pushed the depot branch from 4528fc3 to 4796850 Compare June 18, 2022 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

En- and decoding of characters outside the Basic Multilingual Plane #12

En- and decoding of characters outside the Basic Multilingual Plane #12

sternenseemann commented Jun 18, 2022 •

edited

Loading

En- and decoding of characters outside the Basic Multilingual Plane #12

Are you sure you want to change the base?

En- and decoding of characters outside the Basic Multilingual Plane #12

Conversation

sternenseemann commented Jun 18, 2022 • edited Loading

sternenseemann commented Jun 18, 2022 •

edited

Loading