-
Notifications
You must be signed in to change notification settings - Fork 5k
UTF8Encoding drops bytes during decoding some input sequences #29017
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Further debugging shows that the problem is with the replacement decoder fallback, not UTF-8 per se: the fallback receives all invalid bytes in two batches, first 0xED, 0xA0, then, on a separate call, 0x90. Nevertheless, only two replacement characters are produced. |
The replacement decoder fallback logic is correct, but it looks like the implementation of I mention the replacement decoder fallback logic is correct because it should be replacing each sequence fed to it with a single � character, regardless of how many bytes were in the sequence. For example, were the sample input the three-byte sequence The good news is that I confirmed that this is already fixed in the feature branch. The bad news is that once the PR comes through it'll be a breaking change from Full Framework and .NET Core 2.1. Oh what joy. :) |
In particular, for the specific part of the Unicode Standard that is relevant to this scenario, see Chapter 3 (PDF link), then scroll down to Table 3-9. |
Isn't that issue simply a result of treating characters above ASCII range as multi-byte characters in UTF-8? In your example, you have encoded a leading byte for the Hangul region, for which some continuation sequences are valid, followed by a continuation byte for
Oh, so... correct behavior, then. Everything is fine. |
Not quite. The sequence As @GrabYourPitchforks points out, in this particular case the Unicode standard (Table 3-9) recommends three replacement characters. One replacement character for two bytes is only for truncated valid sequences (Table 3-11). I am happy to read that this has been already patched. |
@BCSharp I don't have the commit in a publicly accessible location right now. But you can verify that this is fixed in the That specifically tests the condition that you care about: if we see two UTF-8 bytes that indicate they're about to encode a UTF-16 surrogate code point, we report this is a maximal invalid subsequence of length 1. That means that we'll try decoding again from the next byte (rather than skip a byte, as |
For some input byte sequences
System.Text.UTF8Encoding
looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installedDecoderFallback
.Example. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD (
'�'
).Produced output:
Expected output:
The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom
DecoderFallback
implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.Also, continuing the example, compare to the correct behaviour of the
ASCIIEncoding
, also with the default replacement fallback.Produced correct output (9 characters):
Related issue: #14785
The text was updated successfully, but these errors were encountered: