-
Notifications
You must be signed in to change notification settings - Fork 56
Added ICU charset conversion implementation #64
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
@generalmimon If you want to take a closer look, this is now ready for review. Should be relatively contained change. I plan to squash everything on merge. |
It would be good to check if we can apply something from Recommended Build Options. The How To Use ICU section states:
|
As far as I can tell, this speaks to setting up some defines to make use of namespaces. Namespaces per se will be only used in C++ APIs (which by themselves need to be activated by Can you clarify what exactly do you want to see in build options for these? |
} | ||
|
||
// Allocate buffer for UTF-16 intermediate representation | ||
const int32_t uniStrCapacity = UCNV_GET_MAX_BYTES_FOR_STRING(src.length(), ucnv_getMaxCharSize(conv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ucnv_getMaxCharSize(conv)
guaranteed to be sufficient here? I'm not sure, because the documentation of ucnv_getMaxCharSize()
says the following:
Returns the maximum number of bytes that are output per UChar in conversion from Unicode using this converter.
The returned number can be used with UCNV_GET_MAX_BYTES_FOR_STRING to calculate the size of a target buffer for conversion from Unicode.
Whereas here we use it for a target buffer for conversion to Unicode (UTF-16), which is the opposite.
The examples of returned values also illustrate the possible problem nicely:
Examples for returned values:
- SBCS charsets: 1
- Shift-JIS: 2
- (...)
SBCS apparently stands for single-byte character set (I've never heard this acronym before). This makes sense: if we take ISO-8859-1 as an example, then any UChar
(UTF-16 character, also an unsigned 16-bit integer) valid in this encoding will be converted to just a single byte. However, obviously no character in this charset maps to just 1 byte in UTF-16 (in fact, none does).
It seems that this mismatch in meaning wasn't caught by our tests because you're also using UCNV_GET_MAX_BYTES_FOR_STRING
incorrectly (in a way that doesn't agree with the documentation):
Calculates the size of a buffer for conversion from Unicode to a charset.
The calculated size is guaranteed to be sufficient for this conversion.
- Parameters
length Number of UChars to be converted. maxCharSize Return value from ucnv_getMaxCharSize() for the converter that will be used.
So the first parameter of the macro is length
, which should be number of UChar
s, i.e. UTF-16 code units or unsigned 16-bit integers. However, you pass src.length()
, which is the number of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs for ucnv_toUChars()
:
The maximum output buffer capacity required (barring output from callbacks) will be 2*srcLength (each char may be converted into a surrogate pair).
So we should probably use this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after thinking about it for a while, I've convinced myself that this calculated capacity should (coincidentally) always be enough, even if the calculation is logically incorrect and should definitely be fixed.
In theory, the only scenario in which it could underallocate memory would be for a SBCS (single-byte character set) with some character outside the BMP (Basic Multilingual Plane) - then src.length()
would be 1
(assuming that a 1-character string with that one character is converted) and ucnv_getMaxCharSize(conv)
would also be 1
(since it is a SBCS), so uniStrCapacity
would be 1
and thus the uniStr
array would only have space for 1 UChar
(16-bit code unit). But since all Unicode characters outside the BMP (i.e. supplementary characters, U+10000 and greater) require 2 UChars
to be represented in UTF-16 (see https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF), this would lead to an error.
However, such SBCS seemingly doesn't exist, because BMP pretty much covers all the common characters that any sane (real-world) SBCS would use. The most popular supplementary characters (U+10000 and above) are emojis, but of course there is no real-world SBCS with emojis, since all SBCSs predate emojis by a lot.
So underallocation isn't a problem here, but overallocation by a certain factor is quite likely. For example, if we're decoding a UTF-8 string, then ucnv_getMaxCharSize(conv)
returns 3 - see ucnv_getMaxCharSize()
docs:
Examples for returned values:
- (...)
- UTF-8: 3 (3 per BMP, 4 per surrogate pair)
So uniStr
will be allocated with a capacity of at least 3 * src.length()
, while 2 * src.length()
is already guaranteed be enough. As mentioned above, each character of any encoding could be converted into a surrogate pair (a pair of two 16-bit code units, i.e. 2 UChars
) in the worst case, but it doesn't get any worse.
} | ||
|
||
// Configure source converter to stop on illegal sequences | ||
err = U_ZERO_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these err = U_ZERO_ERROR
assignments are necessary, because we check the value of err
after every call to an ICU function and throw an exception if (U_FAILURE(err))
. Therefore, we know that if we get past any if (U_FAILURE(err)) { ... }
block, we know that err
indicates success.
Strictly speaking, it might not be equal to U_ZERO_ERROR = 0
, because negative err
values are used for warnings and U_ZERO_ERROR
is only used when there is no warnings. But warnings don't represent errors, so they will not cause ICU functions to exit if they see one - only errors do, see UErrorCode
API docs:
Note: By convention, ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test:
if (U_FAILURE(errorCode)) { return immediately; }so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operations.
Note that U_FAILURE
is defined like this - icu4c/source/common/unicode/utypes.h:713-717
:
/**
* Does the error code indicate a failure?
* @stable ICU 2.0
*/
# define U_FAILURE(x) ((x)>U_ZERO_ERROR)
Bottom line, I'd remove them because they are unnecessary (to be clear, we still need to initialize the variable as UErrorCode err = U_ZERO_ERROR;
at the beginning, but then we don't have to worry about it anymore).
delete[] uniStr; | ||
ucnv_close(conv); | ||
ucnv_close(utf8Conv); | ||
throw illegal_seq_in_encoding(u_errorName(err)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place where illegal_seq_in_encoding
is thrown when running our test suite, and the only place where it actually makes sense to throw it. It makes no sense when handling potential errors from ucnv_setToUCallBack
, ucnv_setFromUCallBack
or ucnv_fromUChars
- if these fail, it doesn't indicate an illegal sequence in the input string (instead, it would indicate a bug in our bytes_to_str
implementation, as none of these should normally fail regardless of the user input).
Therefore, throw illegal_seq_in_encoding
should only remain here and throw bytes_to_str_error
should be used everywhere else.
This adds another option to use ICU library for character set conversions in C++ runtime.
Known caveats so far:By default, ICU substitute illegal sequences with placeholder codepoints rather than actively raising an alarm (which we can surface as exception), so all illegal sequence tests fail for now.There's debug print to cout that has to be yet cleaned up.