Skip to content

UTF8Encoding drops bytes during decoding some input sequences #29017

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
BCSharp opened this issue Mar 19, 2019 · 7 comments · Fixed by dotnet/coreclr#21948
Closed

UTF8Encoding drops bytes during decoding some input sequences #29017

BCSharp opened this issue Mar 19, 2019 · 7 comments · Fixed by dotnet/coreclr#21948
Assignees
Labels
area-System.Text.Encoding bug tenet-compatibility Incompatibility with previous versions or .NET Framework
Milestone

Comments

@BCSharp
Copy link

BCSharp commented Mar 19, 2019

For some input byte sequences System.Text.UTF8Encoding looses, or silently drops some bytes. That is, the bytes are neither decoded by the internal decoder nor are they passed to the installed DecoderFallback.

Example. The encoded input is 3 valid ASCII characters, 3 bytes encoding a surrogate character, and again 3 valid ASCII characters. The default encoding singleton instance uses a decoder replacement fallback, which converts every invalid byte to U+FFFD ('�').

byte[] encoded = new byte[] { 
    (byte)'a', (byte)'b', (byte)'c', 
    0xED, 0xA0, 0x90, 
    (byte)'x', (byte)'y', (byte)'z' 
};
char[] decoded;
decoded = Encoding.UTF8.GetChars(encoded);
Console.WriteLine(decoded);

Produced output:

abc��xyz

Expected output:

abc���xyz

The produced output is only 8 characters long. Although it is not visible in the example above, further debugging with a custom DecoderFallback implementation reveals that the first two invalid bytes (0xED, 0xA0) are being passed to the fallback, but the byte 0x90 is skipped.

Also, continuing the example, compare to the correct behaviour of the ASCIIEncoding, also with the default replacement fallback.

decoded = Encoding.ASCII.GetChars(encoded);
Console.WriteLine(decoded);

Produced correct output (9 characters):

abc???xyz

Related issue: #14785

@BCSharp BCSharp changed the title UTF8Encoding drops bytes during decoding for some types of input UTF8Encoding drops bytes during decoding some input sequences Mar 19, 2019
@BCSharp
Copy link
Author

BCSharp commented Mar 20, 2019

Further debugging shows that the problem is with the replacement decoder fallback, not UTF-8 per se: the fallback receives all invalid bytes in two batches, first 0xED, 0xA0, then, on a separate call, 0x90. Nevertheless, only two replacement characters are produced.

@jkotas
Copy link
Member

jkotas commented Mar 20, 2019

cc @GrabYourPitchforks

@GrabYourPitchforks GrabYourPitchforks self-assigned this Mar 21, 2019
@GrabYourPitchforks
Copy link
Member

The replacement decoder fallback logic is correct, but it looks like the implementation of UTF8Encoding is incorrect. The implementation should be following the Unicode best practice for maximal subpart replacement, which means that [ ED ], [ A0 ], and [ 90 ] should all be sent to the fallback mechanism separately, since they all need to be replaced separately.

I mention the replacement decoder fallback logic is correct because it should be replacing each sequence fed to it with a single � character, regardless of how many bytes were in the sequence. For example, were the sample input the three-byte sequence [ ED 9F C0 ], I'd expect the two sequences [ ED 9F ] and [ C0 ] to be fed to the replacement decoder fallback separately, resulting in the two-character output "��".

The good news is that I confirmed that this is already fixed in the feature branch. The bad news is that once the PR comes through it'll be a breaking change from Full Framework and .NET Core 2.1. Oh what joy. :)

@GrabYourPitchforks
Copy link
Member

In particular, for the specific part of the Unicode Standard that is relevant to this scenario, see Chapter 3 (PDF link), then scroll down to Table 3-9.

@dori4n
Copy link

dori4n commented Mar 21, 2019

Isn't that issue simply a result of treating characters above ASCII range as multi-byte characters in UTF-8? In your example, you have encoded a leading byte for the Hangul region, for which some continuation sequences are valid, followed by a continuation byte for +20, and then a continuation byte for +10. This would result in one invalid code point in the Hangul table, since the first continuation byte is out of valid range already (points to range reserved for UTF-16 surrogate halves), and one unexpected continuation byte. That then, should, for a byte sequence treated as UTF-8, produce at most two replacement characters, not three. Treating the entire three byte sequence as an overlong invalid code point and replacing it with a single replacement character would also appear valid behavior to me.
Please note: I have not tried to debug the decoder myself to check if the byte array passed to the fallback is a multi-byte sequence where appropriate, such that this makes sense, but I see nothing wrong with the result in your particular test case for the bytes chosen to decode as UTF-8.

Further debugging shows that the problem is with the replacement decoder fallback, not UTF-8 per se: the fallback receives all invalid bytes in two batches, first 0xED, 0xA0, then, on a separate call, 0x90. Nevertheless, only two replacement characters are produced.

Oh, so... correct behavior, then. Everything is fine.

@BCSharp
Copy link
Author

BCSharp commented Mar 21, 2019

Oh, so... correct behavior, then. Everything is fine.

Not quite. The sequence [ ED 9F C0 ] is one, though invalid, UTF-8 character: U+D810, which is a high surrogate. So, I was expecting either three "���", one for each invalid byte, or one "�", for the invalid character. But producing two replacement characters seems just to expose internal workings of UTF8Encoding and the replacement encoder fallback, as the output does not relate to anything in the input bytestring.

As @GrabYourPitchforks points out, in this particular case the Unicode standard (Table 3-9) recommends three replacement characters. One replacement character for two bytes is only for truncated valid sequences (Table 3-11).

I am happy to read that this has been already patched.
@GrabYourPitchforks I would appreciate a link to the commit(s) in the feature branch that implements the fix.

@GrabYourPitchforks
Copy link
Member

@BCSharp I don't have the commit in a publicly accessible location right now. But you can verify that this is fixed in the Rune type (which will serve as the basis for transcoding fallback logic) by checking this specific unit test:

https://github.com/dotnet/corefx/blob/a1027e9c932a716da65825223e5dce635dcba30f/src/System.Runtime/tests/System/Text/RuneTests.netcoreapp.cs#L215

That specifically tests the condition that you care about: if we see two UTF-8 bytes that indicate they're about to encode a UTF-16 surrogate code point, we report this is a maximal invalid subsequence of length 1. That means that we'll try decoding again from the next byte (rather than skip a byte, as UTF8Encoding currently does), resulting in the correct number of U+FFFD replacements.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
area-System.Text.Encoding bug tenet-compatibility Incompatibility with previous versions or .NET Framework
Projects
None yet
5 participants