Skip to content

Added ICU charset conversion implementation #64

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

GreyCat
Copy link
Member

@GreyCat GreyCat commented Jul 27, 2023

This adds another option to use ICU library for character set conversions in C++ runtime.

Known caveats so far:

  • By default, ICU substitute illegal sequences with placeholder codepoints rather than actively raising an alarm (which we can surface as exception), so all illegal sequence tests fail for now.
  • There's debug print to cout that has to be yet cleaned up.

@GreyCat GreyCat marked this pull request as ready for review April 6, 2025 14:07
@GreyCat
Copy link
Member Author

GreyCat commented Apr 6, 2025

@generalmimon If you want to take a closer look, this is now ready for review.

Should be relatively contained change. I plan to squash everything on merge.

@generalmimon
Copy link
Member

It would be good to check if we can apply something from Recommended Build Options. The How To Use ICU section states:

For C++, note that there are Recommended Build Options (both for normal use and for ICU as system-level libraries) which are not default simply for compatibility with older ICU-using code.

@GreyCat
Copy link
Member Author

GreyCat commented Apr 6, 2025

It would be good to check if we can apply something from Recommended Build Options.

As far as I can tell, this speaks to setting up some defines to make use of namespaces. Namespaces per se will be only used in C++ APIs (which by themselves need to be activated by U_SHOW_CPLUSPLUS_API define), and we only use C APIs, which seem to be not affected by those.

Can you clarify what exactly do you want to see in build options for these?

}

// Allocate buffer for UTF-16 intermediate representation
const int32_t uniStrCapacity = UCNV_GET_MAX_BYTES_FOR_STRING(src.length(), ucnv_getMaxCharSize(conv));
Copy link
Member

@generalmimon generalmimon Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ucnv_getMaxCharSize(conv) guaranteed to be sufficient here? I'm not sure, because the documentation of ucnv_getMaxCharSize() says the following:

Returns the maximum number of bytes that are output per UChar in conversion from Unicode using this converter.

The returned number can be used with UCNV_GET_MAX_BYTES_FOR_STRING to calculate the size of a target buffer for conversion from Unicode.

Whereas here we use it for a target buffer for conversion to Unicode (UTF-16), which is the opposite.

The examples of returned values also illustrate the possible problem nicely:

Examples for returned values:

  • SBCS charsets: 1
  • Shift-JIS: 2
  • (...)

SBCS apparently stands for single-byte character set (I've never heard this acronym before). This makes sense: if we take ISO-8859-1 as an example, then any UChar (UTF-16 character, also an unsigned 16-bit integer) valid in this encoding will be converted to just a single byte. However, obviously no character in this charset maps to just 1 byte in UTF-16 (in fact, none does).

It seems that this mismatch in meaning wasn't caught by our tests because you're also using UCNV_GET_MAX_BYTES_FOR_STRING incorrectly (in a way that doesn't agree with the documentation):

Calculates the size of a buffer for conversion from Unicode to a charset.

The calculated size is guaranteed to be sufficient for this conversion.

Parameters
lengthNumber of UChars to be converted.
maxCharSizeReturn value from ucnv_getMaxCharSize() for the converter that will be used.

So the first parameter of the macro is length, which should be number of UChars, i.e. UTF-16 code units or unsigned 16-bit integers. However, you pass src.length(), which is the number of bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs for ucnv_toUChars():

The maximum output buffer capacity required (barring output from callbacks) will be 2*srcLength (each char may be converted into a surrogate pair).

So we should probably use this instead.

Copy link
Member

@generalmimon generalmimon Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after thinking about it for a while, I've convinced myself that this calculated capacity should (coincidentally) always be enough, even if the calculation is logically incorrect and should definitely be fixed.

In theory, the only scenario in which it could underallocate memory would be for a SBCS (single-byte character set) with some character outside the BMP (Basic Multilingual Plane) - then src.length() would be 1 (assuming that a 1-character string with that one character is converted) and ucnv_getMaxCharSize(conv) would also be 1 (since it is a SBCS), so uniStrCapacity would be 1 and thus the uniStr array would only have space for 1 UChar (16-bit code unit). But since all Unicode characters outside the BMP (i.e. supplementary characters, U+10000 and greater) require 2 UChars to be represented in UTF-16 (see https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF), this would lead to an error.

However, such SBCS seemingly doesn't exist, because BMP pretty much covers all the common characters that any sane (real-world) SBCS would use. The most popular supplementary characters (U+10000 and above) are emojis, but of course there is no real-world SBCS with emojis, since all SBCSs predate emojis by a lot.

So underallocation isn't a problem here, but overallocation by a certain factor is quite likely. For example, if we're decoding a UTF-8 string, then ucnv_getMaxCharSize(conv) returns 3 - see ucnv_getMaxCharSize() docs:

Examples for returned values:

  • (...)
  • UTF-8: 3 (3 per BMP, 4 per surrogate pair)

So uniStr will be allocated with a capacity of at least 3 * src.length(), while 2 * src.length() is already guaranteed be enough. As mentioned above, each character of any encoding could be converted into a surrogate pair (a pair of two 16-bit code units, i.e. 2 UChars) in the worst case, but it doesn't get any worse.

}

// Configure source converter to stop on illegal sequences
err = U_ZERO_ERROR;
Copy link
Member

@generalmimon generalmimon Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these err = U_ZERO_ERROR assignments are necessary, because we check the value of err after every call to an ICU function and throw an exception if (U_FAILURE(err)). Therefore, we know that if we get past any if (U_FAILURE(err)) { ... } block, we know that err indicates success.

Strictly speaking, it might not be equal to U_ZERO_ERROR = 0, because negative err values are used for warnings and U_ZERO_ERROR is only used when there is no warnings. But warnings don't represent errors, so they will not cause ICU functions to exit if they see one - only errors do, see UErrorCode API docs:

Note: By convention, ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test:

if (U_FAILURE(errorCode)) { return immediately; }

so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operations.

Note that U_FAILURE is defined like this - icu4c/source/common/unicode/utypes.h:713-717:

    /**
     * Does the error code indicate a failure?
     * @stable ICU 2.0
     */
#   define U_FAILURE(x) ((x)>U_ZERO_ERROR)

Bottom line, I'd remove them because they are unnecessary (to be clear, we still need to initialize the variable as UErrorCode err = U_ZERO_ERROR; at the beginning, but then we don't have to worry about it anymore).

delete[] uniStr;
ucnv_close(conv);
ucnv_close(utf8Conv);
throw illegal_seq_in_encoding(u_errorName(err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where illegal_seq_in_encoding is thrown when running our test suite, and the only place where it actually makes sense to throw it. It makes no sense when handling potential errors from ucnv_setToUCallBack, ucnv_setFromUCallBack or ucnv_fromUChars - if these fail, it doesn't indicate an illegal sequence in the input string (instead, it would indicate a bug in our bytes_to_str implementation, as none of these should normally fail regardless of the user input).

Therefore, throw illegal_seq_in_encoding should only remain here and throw bytes_to_str_error should be used everywhere else.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants