Add Utf8String type #23209

GrabYourPitchforks · 2019-03-12T16:37:42Z

This PR contains the type itself and enlightens the VM, the GC, reflection, and certain APIs (Memory / Span) about the new type. Functionality like culture-aware or case-insensitive comparisons are not yet hooked up. Those are implemented in the feature branch, but I didn't include them here because I'm still working on increasing the test coverage.

All Utf8String-related code is currently surrounded by #ifdef statements to allow easy removal from release branches.

All Utf8String-related code is currently surrounded by ifdefs to allow easy removal from release branches

GrabYourPitchforks · 2019-03-12T16:40:05Z

Experimental package, ref asm, and unit tests at dotnet/corefx#35989.

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8.cs

jkotas · 2019-03-12T18:32:15Z

clr.defines.targets

@@ -2,6 +2,7 @@
    <!-- Features we're currently flighting, but don't intend to ship in officially supported releases -->
    <PropertyGroup Condition="'$(IsPrerelease)' == 'true'">


Could you please make sure that this is not set in the Release branch? (It is not the case today.)

It is still true in the release/3.0 branch currently, but that's because there's not yet an officially shipping release coming from that branch. This was set correctly to false in the release/2.x branches (see 5b5dca3).

The previews shipping out of release/3.0 are proxies for officially shipping releases. We should flip that in release/3.0 now. It has not been flipped because of we do not have anything really depending on it for 3.0.

I also kicked off a local build with IsPrerelease=false and ensured that no part of the feature inadvertently makes its way in to System.Private.CoreLib.dll.

src/System.Private.CoreLib/src/System/Utf8String.Construction.cs

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8.cs

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8Utility.cs

GrabYourPitchforks · 2019-03-13T01:30:02Z

I reran the perf benchmarks for the Memory<byte>.Span property getter since that code path was modified. Results are below.

Method	Toolchain	Mean	Error	StdDev	Ratio	RatioSD
GetSpan_MemOfBytesFromArray	2.2-release	4.244 ms	0.0882 ms	0.1474 ms	1.00	0.00
GetSpan_MemOfBytesFromArray	3.0-master	1.933 ms	0.0373 ms	0.0366 ms	0.45	0.02
GetSpan_MemOfBytesFromArray	dev-utf8string	2.268 ms	0.0280 ms	0.0262 ms	0.53	0.03

GetSpan_MemOfBytesFromMemMgr	2.2-release	5.306 ms	0.0996 ms	0.0978 ms	1.00	0.00
GetSpan_MemOfBytesFromMemMgr	3.0-master	3.150 ms	0.0719 ms	0.0769 ms	0.59	0.02
GetSpan_MemOfBytesFromMemMgr	dev-utf8string	3.101 ms	0.0270 ms	0.0239 ms	0.58	0.01

GetSpan_MemOfBytesEmpty	2.2-release	1.393 ms	0.0163 ms	0.0153 ms	1.00	0.00
GetSpan_MemOfBytesEmpty	3.0-master	1.108 ms	0.0208 ms	0.0240 ms	0.79	0.02
GetSpan_MemOfBytesEmpty	dev-utf8string	1.141 ms	0.0227 ms	0.0213 ms	0.82	0.02

Overall the results are about what I'd expect. The byte[] case sees the largest regression because there's now an additional "are you Utf8String?" check where there was none before. However, due to the optimizations introduced earlier in the 3.0 timeframe (see #20386) we're still considerably better than 2.2. The MemoryManager<byte> case sees essentially no impact because that scenario was already expensive to begin with, and the extra few instructions are within the realm of noise.

src/System.Private.CoreLib/src/System/Char8.cs

src/System.Private.CoreLib/src/System/Utf8String.cs

src/System.Private.CoreLib/shared/System/Text/UnicodeUtility.cs

src/System.Private.CoreLib/src/System/Utf8Extensions.cs

src/System.Private.CoreLib/src/System/Utf8String.Manipulation.cs

GrabYourPitchforks · 2019-03-13T23:32:43Z

(In the latest iteration I merged master manually because GitHub was having trouble auto-resolving merge conflicts.)

yahorsi · 2019-03-14T06:11:29Z

May I ask silly question, are there any perf tests showing benefits of having special Utf8String string?

GrabYourPitchforks · 2019-03-14T06:33:04Z

@yahorsi - Not a silly question at all. :) I have some rough numbers showing this, but they're from an earlier iteration of this prototype. Off the top of my head here are the quick vitals:

Given a binary buffer that represents UTF-8 data, creating a Utf8String instance from that buffer is approx. 5x faster than creating a String instance.
For most payloads, UTF-8 consumes less memory than UTF-16. Our data show that most strings still consist of mainly ASCII data, even in programs used primarily by East Asian users. (East Asian is particularly interesting here because UTF-8 representation tends to expand it, not compact it, when compared to UTF-16.) Assuming you could flow Utf8String instances instead of String instances throughout your applications, you should expect to see a double-digit percentage decrease in overall memory utilization.
Utf8String is slightly more compact than using byte[] for the same data. It's only 4 bytes or so on average, but these add up quickly, especially when that represents 10 - 15% of the total object size of a small object.
Projecting from Utf8String back to ROS<byte> or ROM<byte> so that the data can be sent across the wire is an O(1) operation, compared to an O(n) operation for regular String.
Given ASCII Utf8String instances, certain text operations (ToUpperInvariant and friends) perform at around 2x the throughput of those same operations on String. This makes sense: they contain the same amount of information in half the memory.

There are also some drawbacks.

ToUpperInvariant on non-ASCII inputs (or ToUpper in locales like tr-TR, even over ASCII input) are much more expensive on Utf8String than they are on String. This is because the localization tables use UTF-16 representation natively, and calling into them requires temporarily transcoding the input data.
If you find yourself switching between Utf8String and String frequently, there will be considerable overhead to changing the representation. The best course of action is to stick with one representation for as long as you can, only switching between the two if / when absolutely necessary. This could make the type a bit rough to use until there's a healthy ecosystem of APIs built up around it. (And most APIs aren't going to be enlightened about this type since there's no real need to do so.)

In short, the best performance and memory utilization is going to be seen when reading data from the wire, keeping it in UTF-8 representation (instead of converting to UTF-16), performing operations on it while remaining in the UTF-8 world, then sending it back out again to wherever it needs to go.

yahorsi · 2019-03-14T06:49:05Z

Am I right ASP.NET/Kestrel guys didn't try to utilise it yet?

benaadams · 2019-03-14T19:28:46Z

Am I right ASP.NET/Kestrel guys didn't try to utilise it yet?

Its not been added yet so ASP.NET/Kestrel can't currently use it

jkotas · 2019-03-14T21:07:30Z

ASP.NET/Kestrel

It is unclear whether this type would help Kestrel much. If you are just using UTF8 as an internal implementation detail, you can get all the performance benefits by simply using byte arrays and Spans.

We do not have data and agreement that it is a good idea to add this type. New string exchange type comes with extremely large cost to the ecosystem. It is why this is added as experimental feature under ifdef that is not going to ship in 3.0.

We do have an agreement that it is a good idea to add various Utf8 helpers that operate on Spans and support using UTF8 as an internal implementation detail. Those helpers are gradually being added.

yahorsi · 2019-03-14T21:20:06Z

If we think about applications of the Utf8 there are a lot in the WEB world. But, currently, ASP.NET Core is defined using strings everywhere in public interfaces and that cannot be changed. So my bet is that unfortunately we won't be able to get enough benefits as in the final end we'll have to convert Utf8String anyway to just String. So, do you think it make sense to run another research to have same type String but have an option to still store Utf8 internally? Compiler could do something when he sees fixed & string, like call something to convert to utf16 first, or even additionally.

jkotas · 2019-03-14T21:43:19Z

ASP.NET Core is defined using strings everywhere in public interfaces

It is not just ASP.NET Core. It is everything in .NET.

it make sense to run another research to have same type String but have an option to still store Utf8 internally

https://github.com/dotnet/coreclr/issues/7083 has discussions about potential designs and tradeoffs. It is a ton of complex work to make something like this work and the estimated returns are not that big (can get better returns on investment elsewhere)...

GrabYourPitchforks · 2019-03-14T22:33:55Z

CI was clogged due to some outages from yesterday. I'm hoping that the latest iteration will make it through unscathed.

GrabYourPitchforks · 2019-03-14T22:42:27Z

@yahorsi This is an area of active investigation runtime-wide and Framework-wide. The Utf8String type is one potential experiment, and it's possible this won't pan out. As @jkotas said this is all being done under an ifdef so that it can be ripped out easily if necessary.

There are discussions about the merits of the various proposals in existing tracking issues (see https://github.com/dotnet/coreclr/issues/7083, https://github.com/dotnet/corefx/issues/34094, https://github.com/dotnet/corefx/issues/30503, dotnet/corefxlab#2350, and dotnet/corefxlab#2368 for various discussions). It would help keep things more focused and organized if we reuse those existing issues for feature discussion and instead limit the conversation here on the actual code under review.

gfoidl · 2019-03-15T12:58:26Z

src/System.Private.CoreLib/src/System/Utf8String.Searching.cs

+            // One which operates on a single-byte (ASCII) search value,
+            // the other which operates on a multi-byte (non-ASCII) search value.
+
+            Span<byte> runeBytes = stackalloc byte[Utf8Utility.MaxBytesPerScalar];


As MaxBytesPerScalar is constant this method could be written without stackalloc as:

public unsafe bool Contains(Rune value) { Debug.Assert(sizeof(int) == Utf8Utility.MaxBytesPerScalar); int tmp; var runeBytes = new Span<byte>(&tmp, Utf8Utility.MaxBytesPerScalar); value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten); return SpanHelpers.IndexOf( ref DangerousGetMutableReference(), Length, ref MemoryMarshal.GetReference(runeBytes), runeBytesWritten) >= 0; }

Results in better codegen, and ~5% perf-improvement in a benchmark based on the example-code below.

Contrieved example

using System; using System.Diagnostics; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; namespace ConsoleApp3 { class Program { static unsafe void Main(string[] args) { var rune = new Rune('b'); var utf8String = new Utf8String(); bool b1 = utf8String.Contains1(rune); bool b2 = utf8String.Contains2(rune); if (b1 != b2) Environment.Exit(1); } } internal static class Utf8Utility { internal const int MaxBytesPerScalar = 4; } public sealed class Utf8String { private byte _reference = (byte)'a'; public bool Contains1(Rune value) { Span<byte> runeBytes = stackalloc byte[Utf8Utility.MaxBytesPerScalar]; value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten); return runeBytes.IndexOf(_reference) >= 0; } public unsafe bool Contains2(Rune value) { Debug.Assert(sizeof(int) == Utf8Utility.MaxBytesPerScalar); int tmp; var runeBytes = new Span<byte>(&tmp, Utf8Utility.MaxBytesPerScalar); value.TryEncodeToUtf8Bytes(runeBytes, out int runeBytesWritten); return runeBytes.IndexOf(_reference) >= 0; } } public readonly struct Rune { private readonly char _value; public Rune(char value) => _value = value; [MethodImpl(MethodImplOptions.NoInlining)] public bool TryEncodeToUtf8Bytes(Span<byte> runeBytes, out int runeBytesWritten) { if (runeBytes.Length < sizeof(char)) { runeBytesWritten = 0; return false; } runeBytesWritten = sizeof(char); char value = _value; MemoryMarshal.Write(runeBytes, ref value); return true; } } }

; Assembly listing for method Utf8String:Contains1(struct):bool:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix ; optimized code ; rbp based frame ; partially interruptible ; Final local variable assignments ; ; V00 this [V00,T01] ( 3, 3 ) ref -> rdi this class-hnd ; V01 arg1 [V01 ] ( 3, 3 ) struct ( 8) [rbp-0x30] do-not-enreg[XS] addr-exposed ld-addr-op ptr ;* V02 loc0 [V02 ] ( 0, 0 ) struct (16) zero-ref ; V03 loc1 [V03 ] ( 1, 1 ) int -> [rbp-0x38] do-not-enreg[X] must-init addr-exposed ld-addr-op ;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; V05 tmp1 [V05 ] ( 1, 1 ) blk ( 8) [rbp-0x28] do-not-enreg[X] must-init addr-exposed ptr unsafe-buffer "stackallocLocal" ;* V06 tmp2 [V06 ] ( 0, 0 ) struct (16) zero-ref "NewObj constructor temp" ;* V07 tmp3 [V07 ] ( 0, 0 ) struct ( 8) zero-ref "NewObj constructor temp" ; V08 tmp4 [V08,T11] ( 2, 2 ) int -> rax "Inline return value spill temp" ;* V09 tmp5 [V09 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inlining Arg" ; V10 tmp6 [V10,T03] ( 2, 4 ) ubyte -> rsi ld-addr-op "Inlining Arg" ;* V11 tmp7 [V11 ] ( 0, 0 ) byref -> zero-ref ptr "impAppendStmt" ; V12 tmp8 [V12,T04] ( 2, 4 ) int -> rsi "impAppendStmt" ;* V13 tmp9 [V13 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inlining Arg" ;* V14 tmp10 [V14 ] ( 0, 0 ) byref -> zero-ref ptr "Inlining Arg" ; V15 tmp11 [V15,T05] ( 3, 3 ) byref -> r14 V02._pointer(offs=0x00) P-INDEP "field V02._pointer (fldOffset=0x0)" ; V16 tmp12 [V16,T06] ( 2, 2 ) int -> rdi V02._length(offs=0x08) P-INDEP "field V02._length (fldOffset=0x8)" ; V17 tmp13 [V17,T07] ( 2, 2 ) byref -> r14 V06._pointer(offs=0x00) P-INDEP "field V06._pointer (fldOffset=0x0)" ; V18 tmp14 [V18,T12] ( 2, 2 ) int -> rdi V06._length(offs=0x08) P-INDEP "field V06._length (fldOffset=0x8)" ; V19 tmp15 [V19,T08] ( 2, 2 ) byref -> r14 V07._value(offs=0x00) P-INDEP "field V07._value (fldOffset=0x0)" ; V20 tmp16 [V20,T09] ( 2, 2 ) byref -> rdi V09._pointer(offs=0x00) P-INDEP "field V09._pointer (fldOffset=0x0)" ;* V21 tmp17 [V21,T13] ( 0, 0 ) int -> zero-ref ptr V09._length(offs=0x08) P-INDEP "field V09._length (fldOffset=0x8)" ;* V22 tmp18 [V22 ] ( 0, 0 ) byref -> zero-ref ptr V13._pointer(offs=0x00) P-INDEP "field V13._pointer (fldOffset=0x0)" ;* V23 tmp19 [V23 ] ( 0, 0 ) int -> zero-ref ptr V13._length(offs=0x08) P-INDEP "field V13._length (fldOffset=0x8)" ; V24 tmp20 [V24 ] ( 3, 6 ) struct (16) [rbp-0x48] do-not-enreg[XSFBA] multireg-arg must-init addr-exposed ptr "by-value struct argument" ; V25 tmp21 [V25,T00] ( 3, 6 ) byref -> rdx stack-byref "BlockOp address local" ; V26 tmp22 [V26,T02] ( 2, 4 ) byref -> rsi "argument with side effect" ;* V27 GsCookie [V27 ] ( 0, 0 ) long -> zero-ref do-not-enreg[X] must-init addr-exposed "GSSecurityCookie" ; V28 tmp24 [V28,T10] ( 2, 2 ) ref -> rbx this "shadowVar" ; V29 tmp25 [V29 ] ( 2, 2 ) struct ( 8) [rbp-0x50] do-not-enreg[XSF] must-init addr-exposed ptr "shadowVar" ; ; Lcl frame size = 56 G_M25080_IG01: 55 push rbp 4156 push r14 4155 push r13 53 push rbx 4883EC38 sub rsp, 56 488D6C2450 lea rbp, [rsp+50H] 4C8BEF mov r13, rdi 488D7DB0 lea rdi, [rbp-50H] B90E000000 mov ecx, 14 33C0 xor rax, rax F3AB rep stosd 498BFD mov rdi, r13 B8DC125B00 mov eax, 0x5B12DC 488945E0 mov qword ptr [rbp-20H], rax 8975D0 mov dword ptr [rbp-30H], esi G_M25080_IG02: 480FBF75D0 movsx rsi, word ptr [rbp-30H] 668975B0 mov word ptr [rbp-50H], si 488BDF mov rbx, rdi G_M25080_IG03: 4C8D75D8 lea r14, bword ptr [rbp-28H] BF04000000 mov edi, 4 488D75B0 lea rsi, bword ptr [rbp-50H] 488D55B8 lea rdx, bword ptr [rbp-48H] 4C8932 mov bword ptr [rdx], r14 897A08 mov dword ptr [rdx+8], edi 488BFE mov rdi, rsi 488B75B8 mov rsi, bword ptr [rbp-48H] 488B55C0 mov rdx, qword ptr [rbp-40H] 488D4DC8 lea rcx, bword ptr [rbp-38H] E8F3E8FFFF call Rune:TryEncodeToUtf8Bytes(struct,byref):bool:this 498BFE mov rdi, r14 400FB67308 movzx rsi, byte ptr [rbx+8] BA04000000 mov edx, 4 E8B170FFFF call SpanHelpers:IndexOf(byref,ubyte,int):int 85C0 test eax, eax 0F9DC0 setge al 0FB6C0 movzx rax, al 48817DE0DC125B00 cmp qword ptr [rbp-20H], 0x5B12DC 7405 je SHORT G_M25080_IG04 E8F248B778 call CORINFO_HELP_FAIL_FAST G_M25080_IG04: 90 nop G_M25080_IG05: 488D65E8 lea rsp, [rbp-18H] 5B pop rbx 415D pop r13 415E pop r14 5D pop rbp C3 ret ; Total bytes of code 154, prolog size 43 for method Utf8String:Contains1(struct):bool:this ; ============================================================ ; Assembly listing for method Utf8String:Contains2(struct):bool:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix ; optimized code ; rbp based frame ; partially interruptible ; Final local variable assignments ; ; V00 this [V00,T01] ( 3, 3 ) ref -> rbx this class-hnd ; V01 arg1 [V01 ] ( 4, 4 ) struct ( 8) [rbp-0x20] do-not-enreg[XS] addr-exposed ld-addr-op ; V02 loc0 [V02 ] ( 1, 1 ) int -> [rbp-0x24] do-not-enreg[X] must-init addr-exposed ld-addr-op ;* V03 loc1 [V03 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op ; V04 loc2 [V04 ] ( 1, 1 ) int -> [rbp-0x30] do-not-enreg[X] must-init addr-exposed ld-addr-op ;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ;* V06 tmp1 [V06 ] ( 0, 0 ) struct ( 8) zero-ref "NewObj constructor temp" ; V07 tmp2 [V07,T09] ( 2, 2 ) int -> rax "Inline return value spill temp" ;* V08 tmp3 [V08 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inlining Arg" ; V09 tmp4 [V09,T03] ( 2, 4 ) ubyte -> rsi ld-addr-op "Inlining Arg" ;* V10 tmp5 [V10 ] ( 0, 0 ) byref -> zero-ref "impAppendStmt" ; V11 tmp6 [V11,T04] ( 2, 4 ) int -> rsi "impAppendStmt" ;* V12 tmp7 [V12 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "Inlining Arg" ;* V13 tmp8 [V13 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ; V14 tmp9 [V14 ] ( 4, 4 ) ushort -> [rbp-0x20] do-not-enreg[X] addr-exposed V01._value(offs=0x00) P-DEP "field V01._value (fldOffset=0x0)" ; V15 tmp10 [V15,T05] ( 3, 3 ) byref -> r14 V03._pointer(offs=0x00) P-INDEP "field V03._pointer (fldOffset=0x0)" ; V16 tmp11 [V16,T06] ( 2, 2 ) int -> rdi V03._length(offs=0x08) P-INDEP "field V03._length (fldOffset=0x8)" ; V17 tmp12 [V17,T07] ( 2, 2 ) byref -> r14 V06._value(offs=0x00) P-INDEP "field V06._value (fldOffset=0x0)" ; V18 tmp13 [V18,T08] ( 2, 2 ) byref -> rdi V08._pointer(offs=0x00) P-INDEP "field V08._pointer (fldOffset=0x0)" ;* V19 tmp14 [V19,T10] ( 0, 0 ) int -> zero-ref V08._length(offs=0x08) P-INDEP "field V08._length (fldOffset=0x8)" ;* V20 tmp15 [V20 ] ( 0, 0 ) byref -> zero-ref V12._pointer(offs=0x00) P-INDEP "field V12._pointer (fldOffset=0x0)" ;* V21 tmp16 [V21 ] ( 0, 0 ) int -> zero-ref V12._length(offs=0x08) P-INDEP "field V12._length (fldOffset=0x8)" ; V22 tmp17 [V22 ] ( 3, 6 ) struct (16) [rbp-0x40] do-not-enreg[XSFBA] multireg-arg must-init addr-exposed "by-value struct argument" ; V23 tmp18 [V23,T00] ( 3, 6 ) byref -> rdx stack-byref "BlockOp address local" ; V24 tmp19 [V24,T02] ( 2, 4 ) byref -> rsi "argument with side effect" ; ; Lcl frame size = 40 G_M25083_IG01: 55 push rbp 4156 push r14 4155 push r13 53 push rbx 4883EC28 sub rsp, 40 488D6C2440 lea rbp, [rsp+40H] 4C8BEF mov r13, rdi 488D7DC0 lea rdi, [rbp-40H] B908000000 mov ecx, 8 33C0 xor rax, rax F3AB rep stosd 498BFD mov rdi, r13 8975E0 mov dword ptr [rbp-20H], esi 488BDF mov rbx, rdi G_M25083_IG02: 4C8D75DC lea r14, bword ptr [rbp-24H] BF04000000 mov edi, 4 488D75E0 lea rsi, bword ptr [rbp-20H] 488D55C0 lea rdx, bword ptr [rbp-40H] 4C8932 mov bword ptr [rdx], r14 897A08 mov dword ptr [rdx+8], edi 488BFE mov rdi, rsi 488B75C0 mov rsi, bword ptr [rbp-40H] 488B55C8 mov rdx, qword ptr [rbp-38H] 488D4DD0 lea rcx, bword ptr [rbp-30H] E87DC7FFFF call Rune:TryEncodeToUtf8Bytes(struct,byref):bool:this 498BFE mov rdi, r14 400FB67308 movzx rsi, byte ptr [rbx+8] BA04000000 mov edx, 4 E8BBFCFFFF call SpanHelpers:IndexOf(byref,ubyte,int):int 85C0 test eax, eax 0F9DC0 setge al 0FB6C0 movzx rax, al G_M25083_IG03: 488D65E8 lea rsp, [rbp-18H] 5B pop rbx 415D pop r13 415E pop r14 5D pop rbp C3 ret ; Total bytes of code 120, prolog size 34 for method Utf8String:Contains2(struct):bool:this ; ============================================================

But I'd like to see the JIT recognizing this and doing this optimization for us.

@gfoidl I was experimenting in a separate branch with moving some of this into the JIT if possible, using a technique very similar to what you had done. But it'll be a while before I get around to performing these sorts of optimizations. Pseudocode:

public bool Contains(Rune value) { if (JitHelpers.TryConstRuneToAscii(value, out byte valueAscii) { // if the JIT knows that 'value' is const and ASCII, go down this code path // (valueAscii would've gotten baked into the codegen as a constant) return SpanHelpers.IndexOf(..., valueAscii) >= 0 } else if (JitHelpers.TryConstRuneToUtf8(value, out int data, out int dataLength)) { // if the JIT knows that 'value' is const and not ASCII, go down this code path // (data and dataLength would've gotten baked into the codegen as constants) return SpanHelpers.IndexOf(..., ref Unsafe.As<int, byte>(ref data), dataLength) >= 0; } else { // if the JIT can't reason about the const-ness of 'value', go down this code path // stackalloc and convert at runtime } }

(In the above example it would be perfectly valid for the TryConstXyz methods to be hardcoded to always return false, and the first two branches would be folded away by the JIT. The end result is that the code would still work; it just wouldn't be as optimized.)

Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16. This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Utf8String is an experimental type that is string-like (heap-allocated, immutable, variable-length, null-terminated) but whose inner representation is UTF-8, not UTF-16. This is a skeleton implementation of the basic API shape. The ecosystem of APIs has not yet been built around it. All Utf8String-related code is currently surrounded by ifdefs to allow easy identification and removal from release branches. Commit migrated from dotnet/coreclr@1f3f474

Add Utf8String type

aeb0d49

All Utf8String-related code is currently surrounded by ifdefs to allow easy removal from release branches

GrabYourPitchforks mentioned this pull request Mar 12, 2019

Add Utf8String package, reference APIs, and unit tests dotnet/corefx#35989

Merged

jkotas reviewed Mar 12, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 12, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Utf8String.Construction.cs Show resolved Hide resolved

AaronRobinsonMSFT reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8.cs Outdated Show resolved Hide resolved

src/System.Private.CoreLib/shared/System/Text/Unicode/Utf8Utility.cs Outdated Show resolved Hide resolved

justinvp reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Char8.cs Outdated Show resolved Hide resolved

justinvp reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Char8.cs Outdated Show resolved Hide resolved

justinvp reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Utf8String.cs Outdated Show resolved Hide resolved

gfoidl reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Text/UnicodeUtility.cs Outdated Show resolved Hide resolved

src/System.Private.CoreLib/src/System/Utf8Extensions.cs Show resolved Hide resolved

pentp reviewed Mar 13, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Utf8String.Manipulation.cs Outdated Show resolved Hide resolved

gfoidl mentioned this pull request Mar 13, 2019

Fewer instructions for check if either start or length != 0 #23239

Closed

GrabYourPitchforks added 5 commits March 13, 2019 14:13

Address PR feedback for Utf8String

36f5c93

Move instructions for how to add new String ctors

2166200

Revert PR changes to Utf8Extensions

192de76

PR feedback for ThrowHelper

f900858

Merge remote-tracking branch 'origin/master' into dev_utf8string_2

d356d07

Add Utf8String.Contains and related methods

61b427e

Merge remote-tracking branch 'origin/master' into dev_utf8string_2

19e79be

gfoidl reviewed Mar 15, 2019

View reviewed changes

GrabYourPitchforks added 3 commits March 18, 2019 09:55

Merge remote-tracking branch 'origin/master' into dev/utf8string

13c7094

Enlighten RuntimeHelpers.IsBitwiseEquatable about Char8

1a46000

Merge remote-tracking branch 'origin/master' into dev/utf8string

80ef69f

GrabYourPitchforks merged commit 1f3f474 into master Mar 19, 2019

jkotas deleted the dev/utf8string branch March 19, 2019 19:27

jkotas restored the dev/utf8string branch March 19, 2019 19:27

roji mentioned this pull request Mar 26, 2019

Support for the new Utf8String npgsql/npgsql#2305

Closed

stephentoub deleted the dev/utf8string branch June 12, 2019 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Utf8String type #23209

Add Utf8String type #23209

GrabYourPitchforks commented Mar 12, 2019

GrabYourPitchforks commented Mar 12, 2019

jkotas Mar 12, 2019

GrabYourPitchforks Mar 12, 2019

jkotas Mar 12, 2019

GrabYourPitchforks Mar 14, 2019

GrabYourPitchforks commented Mar 13, 2019

GrabYourPitchforks commented Mar 13, 2019

yahorsi commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

yahorsi commented Mar 14, 2019

benaadams commented Mar 14, 2019

jkotas commented Mar 14, 2019

yahorsi commented Mar 14, 2019

jkotas commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

gfoidl Mar 15, 2019

GrabYourPitchforks Mar 15, 2019

GrabYourPitchforks Mar 15, 2019

		@@ -2,6 +2,7 @@
		<!-- Features we're currently flighting, but don't intend to ship in officially supported releases -->
		<PropertyGroup Condition="'$(IsPrerelease)' == 'true'">

Add Utf8String type #23209

Add Utf8String type #23209

Conversation

GrabYourPitchforks commented Mar 12, 2019

GrabYourPitchforks commented Mar 12, 2019

jkotas Mar 12, 2019

Choose a reason for hiding this comment

GrabYourPitchforks Mar 12, 2019

Choose a reason for hiding this comment

jkotas Mar 12, 2019

Choose a reason for hiding this comment

GrabYourPitchforks Mar 14, 2019

Choose a reason for hiding this comment

GrabYourPitchforks commented Mar 13, 2019

GrabYourPitchforks commented Mar 13, 2019

yahorsi commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

yahorsi commented Mar 14, 2019

benaadams commented Mar 14, 2019

jkotas commented Mar 14, 2019

yahorsi commented Mar 14, 2019

jkotas commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

GrabYourPitchforks commented Mar 14, 2019

gfoidl Mar 15, 2019

Choose a reason for hiding this comment

GrabYourPitchforks Mar 15, 2019

Choose a reason for hiding this comment

GrabYourPitchforks Mar 15, 2019

Choose a reason for hiding this comment