Added ICU charset conversion implementation #64

GreyCat · 2023-07-27T15:52:47Z

This adds another option to use ICU library for character set conversions in C++ runtime.

~~Known caveats so far:~~

~~By default, ICU substitute illegal sequences with placeholder codepoints rather than actively raising an alarm (which we can surface as exception), so all illegal sequence tests fail for now.~~
~~There's debug print to cout that has to be yet cleaned up.~~

.github/workflows/build.yml

See #64 (comment)

GreyCat · 2025-04-06T14:40:53Z

@generalmimon If you want to take a closer look, this is now ready for review.

Should be relatively contained change. I plan to squash everything on merge.

generalmimon · 2025-04-06T15:22:00Z

It would be good to check if we can apply something from Recommended Build Options. The How To Use ICU section states:

For C++, note that there are Recommended Build Options (both for normal use and for ICU as system-level libraries) which are not default simply for compatibility with older ICU-using code.

GreyCat · 2025-04-06T22:47:23Z

It would be good to check if we can apply something from Recommended Build Options.

As far as I can tell, this speaks to setting up some defines to make use of namespaces. Namespaces per se will be only used in C++ APIs (which by themselves need to be activated by U_SHOW_CPLUSPLUS_API define), and we only use C APIs, which seem to be not affected by those.

Can you clarify what exactly do you want to see in build options for these?

generalmimon · 2025-04-06T23:47:20Z

kaitai/kaitaistream.cpp

+    }
+
+    // Allocate buffer for UTF-16 intermediate representation
+    const int32_t uniStrCapacity = UCNV_GET_MAX_BYTES_FOR_STRING(src.length(), ucnv_getMaxCharSize(conv));


Is ucnv_getMaxCharSize(conv) guaranteed to be sufficient here? I'm not sure, because the documentation of ucnv_getMaxCharSize() says the following:

Returns the maximum number of bytes that are output per UChar in conversion from Unicode using this converter.

The returned number can be used with UCNV_GET_MAX_BYTES_FOR_STRING to calculate the size of a target buffer for conversion from Unicode.

Whereas here we use it for a target buffer for conversion to Unicode (UTF-16), which is the opposite.

The examples of returned values also illustrate the possible problem nicely:

Examples for returned values:

SBCS charsets: 1

Shift-JIS: 2

(...)

SBCS apparently stands for single-byte character set (I've never heard this acronym before). This makes sense: if we take ISO-8859-1 as an example, then any UChar (UTF-16 character, also an unsigned 16-bit integer) valid in this encoding will be converted to just a single byte. However, obviously no character in this charset maps to just 1 byte in UTF-16 (in fact, none does).

It seems that this mismatch in meaning wasn't caught by our tests because you're also using UCNV_GET_MAX_BYTES_FOR_STRING incorrectly (in a way that doesn't agree with the documentation):

Calculates the size of a buffer for conversion from Unicode to a charset.

The calculated size is guaranteed to be sufficient for this conversion.

Parameters

length Number of UChars to be converted.

maxCharSize Return value from ucnv_getMaxCharSize() for the converter that will be used.

So the first parameter of the macro is length, which should be number of UChars, i.e. UTF-16 code units or unsigned 16-bit integers. However, you pass src.length(), which is the number of bytes.

According to the docs for ucnv_toUChars():

The maximum output buffer capacity required (barring output from callbacks) will be 2*srcLength (each char may be converted into a surrogate pair).

So we should probably use this instead.

Hmm, after thinking about it for a while, I've convinced myself that this calculated capacity should (coincidentally) always be enough, even if the calculation is logically incorrect and should definitely be fixed.

In theory, the only scenario in which it could underallocate memory would be for a SBCS (single-byte character set) with some character outside the BMP (Basic Multilingual Plane) - then src.length() would be 1 (assuming that a 1-character string with that one character is converted) and ucnv_getMaxCharSize(conv) would also be 1 (since it is a SBCS), so uniStrCapacity would be 1 and thus the uniStr array would only have space for 1 UChar (16-bit code unit). But since all Unicode characters outside the BMP (i.e. supplementary characters, U+10000 and greater) require 2 UChars to be represented in UTF-16 (see https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF), this would lead to an error.

However, such SBCS seemingly doesn't exist, because BMP pretty much covers all the common characters that any sane (real-world) SBCS would use. The most popular supplementary characters (U+10000 and above) are emojis, but of course there is no real-world SBCS with emojis, since all SBCSs predate emojis by a lot.

So underallocation isn't a problem here, but overallocation by a certain factor is quite likely. For example, if we're decoding a UTF-8 string, then ucnv_getMaxCharSize(conv) returns 3 - see ucnv_getMaxCharSize() docs:

Examples for returned values:

(...)

UTF-8: 3 (3 per BMP, 4 per surrogate pair)

So uniStr will be allocated with a capacity of at least 3 * src.length(), while 2 * src.length() is already guaranteed be enough. As mentioned above, each character of any encoding could be converted into a surrogate pair (a pair of two 16-bit code units, i.e. 2 UChars) in the worst case, but it doesn't get any worse.

CMakeLists.txt

generalmimon · 2025-04-14T15:18:19Z

kaitai/kaitaistream.cpp

+    }
+
+    // Configure source converter to stop on illegal sequences
+    err = U_ZERO_ERROR;


I don't think these err = U_ZERO_ERROR assignments are necessary, because we check the value of err after every call to an ICU function and throw an exception if (U_FAILURE(err)). Therefore, we know that if we get past any if (U_FAILURE(err)) { ... } block, we know that err indicates success.

Strictly speaking, it might not be equal to U_ZERO_ERROR = 0, because negative err values are used for warnings and U_ZERO_ERROR is only used when there is no warnings. But warnings don't represent errors, so they will not cause ICU functions to exit if they see one - only errors do, see UErrorCode API docs:

Note: By convention, ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test:

if (U_FAILURE(errorCode)) { return immediately; }

so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operations.

Note that U_FAILURE is defined like this - icu4c/source/common/unicode/utypes.h:713-717:

/** * Does the error code indicate a failure? * @stable ICU 2.0 */ # define U_FAILURE(x) ((x)>U_ZERO_ERROR)

Bottom line, I'd remove them because they are unnecessary (to be clear, we still need to initialize the variable as UErrorCode err = U_ZERO_ERROR; at the beginning, but then we don't have to worry about it anymore).

generalmimon · 2025-04-14T15:34:00Z

kaitai/kaitaistream.cpp

+        delete[] uniStr;
+        ucnv_close(conv);
+        ucnv_close(utf8Conv);
+        throw illegal_seq_in_encoding(u_errorName(err));


This is the only place where illegal_seq_in_encoding is thrown when running our test suite, and the only place where it actually makes sense to throw it. It makes no sense when handling potential errors from ucnv_setToUCallBack, ucnv_setFromUCallBack or ucnv_fromUChars - if these fail, it doesn't indicate an illegal sequence in the input string (instead, it would indicate a bug in our bytes_to_str implementation, as none of these should normally fail regardless of the user input).

Therefore, throw illegal_seq_in_encoding should only remain here and throw bytes_to_str_error should be used everywhere else.

See #64 (comment)

See the code comment.

Added ICU charset conversion implementation

8809cd5

GreyCat force-pushed the add_icu branch from 0a7e77a to 331dbac Compare July 27, 2023 16:01

Added ICU to build matrix on linux

10defe7

GreyCat force-pushed the add_icu branch from 331dbac to 10defe7 Compare July 27, 2023 16:02

generalmimon reviewed Jul 27, 2023

View reviewed changes

.github/workflows/build.yml Outdated Show resolved Hide resolved

generalmimon added a commit that referenced this pull request Nov 3, 2023

GH Actions: run apt-get update before installing packages

e5e45db

See #64 (comment)

GreyCat added 4 commits April 6, 2025 12:54

Better implementation of ICU with proper error reporting

827c164

Merge branch 'master' into add_icu

362f7e9

Exclude ICU+98

8b638bb

Minor formatting fixes

4e4cbfd

GreyCat marked this pull request as ready for review April 6, 2025 14:07

Merge branch 'master' into add_icu

eafadfb

generalmimon requested changes Apr 6, 2025

View reviewed changes

generalmimon reviewed Apr 14, 2025

View reviewed changes

generalmimon added 3 commits April 16, 2025 22:54

CMakeLists.txt: drop unused ICU component io

8abbee4

See #64 (comment)

CMakeLists.txt: link ICU via imported targets (not variables)

9b06bf1

See #64 (comment)

Fix ICU memory leaks detected by Valgrind

340d545

See the code comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ICU charset conversion implementation #64

Added ICU charset conversion implementation #64

GreyCat commented Jul 27, 2023 •

edited

Loading

GreyCat commented Apr 6, 2025

generalmimon commented Apr 6, 2025

GreyCat commented Apr 6, 2025

generalmimon Apr 6, 2025 •

edited

Loading

generalmimon Apr 8, 2025

generalmimon Apr 13, 2025 •

edited

Loading

generalmimon Apr 14, 2025 •

edited

Loading

generalmimon Apr 14, 2025

length	Number of UChars to be converted.
maxCharSize	Return value from ucnv_getMaxCharSize() for the converter that will be used.

Added ICU charset conversion implementation #64

Are you sure you want to change the base?

Added ICU charset conversion implementation #64

Conversation

GreyCat commented Jul 27, 2023 • edited Loading

GreyCat commented Apr 6, 2025

generalmimon commented Apr 6, 2025

GreyCat commented Apr 6, 2025

generalmimon Apr 6, 2025 • edited Loading

Choose a reason for hiding this comment

generalmimon Apr 8, 2025

Choose a reason for hiding this comment

generalmimon Apr 13, 2025 • edited Loading

Choose a reason for hiding this comment

generalmimon Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

generalmimon Apr 14, 2025

Choose a reason for hiding this comment

GreyCat commented Jul 27, 2023 •

edited

Loading

generalmimon Apr 6, 2025 •

edited

Loading

generalmimon Apr 13, 2025 •

edited

Loading

generalmimon Apr 14, 2025 •

edited

Loading