UTF8Encoding drops bytes during decoding some input sequences #29017

BCSharp · 2019-03-19T20:20:04Z

For some input byte sequences System.Text.UTF8Encoding looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installed DecoderFallback.

Example. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD ('�').

byte[] encoded = new byte[] { 
    (byte)'a', (byte)'b', (byte)'c', 
    0xED, 0xA0, 0x90, 
    (byte)'x', (byte)'y', (byte)'z' 
};
char[] decoded;
decoded = Encoding.UTF8.GetChars(encoded);
Console.WriteLine(decoded);

Produced output:

abc��xyz

Expected output:

abc���xyz

The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom DecoderFallback implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.

Also, continuing the example, compare to the correct behaviour of the ASCIIEncoding, also with the default replacement fallback.

decoded = Encoding.ASCII.GetChars(encoded);
Console.WriteLine(decoded);

Produced correct output (9 characters):

abc???xyz

Related issue: #14785

The text was updated successfully, but these errors were encountered:

BCSharp · 2019-03-20T04:31:09Z

Further debugging shows that the problem is with the replacement decoder fallback, not UTF-8 per se: the fallback receives all invalid bytes in two batches, first 0xED, 0xA0, then, on a separate call, 0x90. Nevertheless, only two replacement characters are produced.

jkotas · 2019-03-20T15:16:25Z

cc @GrabYourPitchforks

GrabYourPitchforks · 2019-03-21T04:35:08Z

The replacement decoder fallback logic is correct, but it looks like the implementation of UTF8Encoding is incorrect. The implementation should be following the Unicode best practice for maximal subpart replacement, which means that [ ED ], [ A0 ], and [ 90 ] should all be sent to the fallback mechanism separately, since they all need to be replaced separately.

I mention the replacement decoder fallback logic is correct because it should be replacing each sequence fed to it with a single � character, regardless of how many bytes were in the sequence. For example, were the sample input the three-byte sequence [ ED 9F C0 ], I'd expect the two sequences [ ED 9F ] and [ C0 ] to be fed to the replacement decoder fallback separately, resulting in the two-character output "��".

The good news is that I confirmed that this is already fixed in the feature branch. The bad news is that once the PR comes through it'll be a breaking change from Full Framework and .NET Core 2.1. Oh what joy. :)

GrabYourPitchforks · 2019-03-21T04:40:28Z

In particular, for the specific part of the Unicode Standard that is relevant to this scenario, see Chapter 3 (PDF link), then scroll down to Table 3-9.

dori4n · 2019-03-21T05:54:01Z

Isn't that issue simply a result of treating characters above ASCII range as multi-byte characters in UTF-8? In your example, you have encoded a leading byte for the Hangul region, for which some continuation sequences are valid, followed by a continuation byte for +20, and then a continuation byte for +10. This would result in one invalid code point in the Hangul table, since the first continuation byte is out of valid range already (points to range reserved for UTF-16 surrogate halves), and one unexpected continuation byte. That then, should, for a byte sequence treated as UTF-8, produce at most two replacement characters, not three. Treating the entire three byte sequence as an overlong invalid code point and replacing it with a single replacement character would also appear valid behavior to me.
Please note: I have not tried to debug the decoder myself to check if the byte array passed to the fallback is a multi-byte sequence where appropriate, such that this makes sense, but I see nothing wrong with the result in your particular test case for the bytes chosen to decode as UTF-8.

Further debugging shows that the problem is with the replacement decoder fallback, not UTF-8 per se: the fallback receives all invalid bytes in two batches, first 0xED, 0xA0, then, on a separate call, 0x90. Nevertheless, only two replacement characters are produced.

Oh, so... correct behavior, then. Everything is fine.

BCSharp · 2019-03-21T21:04:34Z

Oh, so... correct behavior, then. Everything is fine.

Not quite. The sequence [ ED 9F C0 ] is one, though invalid, UTF-8 character: U+D810, which is a high surrogate. So, I was expecting either three "��", one for each invalid byte, or one "�", for the invalid character. But producing two replacement characters seems just to expose internal workings of UTF8Encoding and the replacement encoder fallback, as the output does not relate to anything in the input bytestring.

As @GrabYourPitchforks points out, in this particular case the Unicode standard (Table 3-9) recommends three replacement characters. One replacement character for two bytes is only for truncated valid sequences (Table 3-11).

I am happy to read that this has been already patched.
@GrabYourPitchforks I would appreciate a link to the commit(s) in the feature branch that implements the fix.

GrabYourPitchforks · 2019-03-21T21:54:00Z

@BCSharp I don't have the commit in a publicly accessible location right now. But you can verify that this is fixed in the Rune type (which will serve as the basis for transcoding fallback logic) by checking this specific unit test:

https://github.com/dotnet/corefx/blob/a1027e9c932a716da65825223e5dce635dcba30f/src/System.Runtime/tests/System/Text/RuneTests.netcoreapp.cs#L215

That specifically tests the condition that you care about: if we see two UTF-8 bytes that indicate they're about to encode a UTF-16 surrogate code point, we report this is a maximal invalid subsequence of length 1. That means that we'll try decoding again from the next byte (rather than skip a byte, as UTF8Encoding currently does), resulting in the correct number of U+FFFD replacements.

BCSharp changed the title ~~UTF8Encoding drops bytes during decoding for some types of input~~ UTF8Encoding drops bytes during decoding some input sequences Mar 19, 2019

GrabYourPitchforks self-assigned this Mar 21, 2019

GrabYourPitchforks closed this as completed Apr 12, 2019

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 3.0 milestone Feb 1, 2020

ghost locked as resolved and limited conversation to collaborators Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8Encoding drops bytes during decoding some input sequences #29017

UTF8Encoding drops bytes during decoding some input sequences #29017

BCSharp commented Mar 19, 2019

BCSharp commented Mar 20, 2019

jkotas commented Mar 20, 2019

GrabYourPitchforks commented Mar 21, 2019

GrabYourPitchforks commented Mar 21, 2019

dori4n commented Mar 21, 2019

BCSharp commented Mar 21, 2019

GrabYourPitchforks commented Mar 21, 2019

UTF8Encoding drops bytes during decoding some input sequences #29017

UTF8Encoding drops bytes during decoding some input sequences #29017

Comments

BCSharp commented Mar 19, 2019

BCSharp commented Mar 20, 2019

jkotas commented Mar 20, 2019

GrabYourPitchforks commented Mar 21, 2019

GrabYourPitchforks commented Mar 21, 2019

dori4n commented Mar 21, 2019

BCSharp commented Mar 21, 2019

GrabYourPitchforks commented Mar 21, 2019