Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

En- and decoding of characters outside the Basic Multilingual Plane #12

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sternenseemann
Copy link

@sternenseemann sternenseemann commented Jun 18, 2022

RFC4627 prescribes that everything has to in some Unicode
encoding (which we comply with by using ASCII (UTF-8) and encoding
everything else) and that any character may be escaped. When escaping,
however, we need to take care to only escape characters in the Basic
Multilingual Plane (BMP) which is U+0000 to U+FFFF:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. [...]

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

  • RFC4627, p. 3

This commit implements en- and decoding of UTF-16 surrogate pairs and
the necessary error handling logic required by the ordering requirements
and the fact that a lone surrogate code unit/point may never be decoded
nor encoded.

Test cases partially taken from #3.

BREAKING CHANGES:

Note that the broken behavior can all be considered a bug insofar as it
violates the JSON spec.

  • A Unicode code point outside the BMP will now always be encoded as an
    UTF-16 surrogate pair.
  • A valid UTF-16 surrogate pair will now always be decoded to a single
    Unicode codepoint.
  • When *use-strict-json-rules* encoding a surrogate codepoint or
    decoding a lone surrogate code unit will result in an error.
    If *use-strict-json-rules* is NIL, it'll behave as before.

Overall this does the same as #3, but

  • Has more sophisticated error handling, a more lenient behavior is tied to *use-strict-json-rules*.
  • I tried to annotate encoding/decoding functions, so CL implementations can generate better code.
  • read-json-string-char is unchanged, instead the surrogate decoding logic is handled decode-json-string.

sternenseemann and others added 2 commits June 18, 2022 15:08
RFC4627 prescribes that everything has to in some Unicode
encoding (which we comply with by using ASCII (UTF-8) and encoding
everything else) and that any character may be escaped. When escaping,
however, we need to take care to only escape characters in the Basic
Multilingual Plane (BMP) which is U+0000 to U+FFFF:

> Any character may be escaped.  If the character is in the Basic
> Multilingual Plane (U+0000 through U+FFFF), then it may be
> represented as a six-character sequence: a reverse solidus, followed
> by the lowercase letter u, followed by four hexadecimal digits that
> encode the character's code point. [...]
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair.  So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> - RFC4627, p. 3

This commit implements en- and decoding of UTF-16 surrogate pairs and
the necessary error handling logic required by the ordering requirements
and the fact that a lone surrogate code unit/point may never be decoded
nor encoded.

Test cases partially taken from sharplispers#3.

BREAKING CHANGES:

Note that the broken behavior can all be considered a bug insofar as it
violates the JSON spec.

* A Unicode code point outside the BMP will now always be encoded as an
  UTF-16 surrogate pair.
* A valid UTF-16 surrogate pair will now always be decoded to a single
  Unicode codepoint.
* When *use-strict-json-rules* encoding a surrogate codepoint or
  decoding a lone surrogate code unit will result in an error.
  If *use-strict-json-rules* is NIL, it'll behave as before.

Co-Authored-By: Chaitanya Gupta <mail@chaitanyagupta.com>
CCL's unicode implementation doesn't allow lone surrogate code point in
a string, preventing us from ever creating a string that would trigger
the tested behavior here. Other CL implementations are more lenient
here, whereas CCL follows the Unicode standard strictly.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant