Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

adrian17 · 2020-12-21T23:51:14Z

This basically started with investigating byteorder's read_u16-style extensions. They are more or less meant to be used like this:

let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().unwrap());
assert_eq!(768, rdr.read_u16::<BigEndian>().unwrap());

You can see that they are implemented in terms of Read's read_exact method:
https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7

However, it seems like when used with Cursor<&[u8]> or just &[u8], they are way slower than I'd expect them to be.

When I build a test program, I encounter this bug: #47321 which is pretty bad, especially as read_exact will call to memcpy for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to read_exact, and there are no calls in inlined code except to io::Error::new. However, even when inlined, it appears that code is still suboptimal.
Normally, I'd assume that cursor.read_16() does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.

Here is my base test case:

#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u16::<LittleEndian>()? as u64;
    ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}

In my benchmarks, it spends 7.5ns/call.

But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = &cursor.get_ref()[cursor.position() as usize..cursor.position() as usize + 2];
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).

This still generated range checks that seemed redundant (I may be wrong here), so after introducing unsafe:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = unsafe { cursor.get_ref().get_unchecked(cursor.position() as usize..cursor.position() as usize + 2) };
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every read_u* call)

Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:

// data: &mut &[u8]
ret += data.read_u16::<LittleEndian>()? as u64;

->

if data.len() < 2 {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
ret += data.read_u16::<LittleEndian>()? as u64;

The iteration speed improves by nearly 2x.

Now I don't really know how much this has to do with #47321 , and how much with other optimizations. It's also a bit awkward that #47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.

I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .

The text was updated successfully, but these errors were encountered:

wecing · 2020-12-22T06:08:10Z

Hi! I left a comment in the other issue -- seems like that issue will not appear when using --release.

I tried your benchmark and here are the results I got:

test tests::bench_byteorder_cursor           ... bench:      47,294 ns/iter (+/- 1,410)
test tests::bench_byteorder_cursor_v2        ... bench:      39,374 ns/iter (+/- 2,657)
test tests::bench_byteorder_cursor_v2_unsafe ... bench:      36,669 ns/iter (+/- 1,620)
test tests::bench_byteorder_slice            ... bench:      53,106 ns/iter (+/- 3,186)
test tests::bench_byteorder_slice_v2         ... bench:      42,819 ns/iter (+/- 1,803)

However, after changing the three sum_byteorder_cursor_impl* to let them always read u8 4 times, e.g.:

#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    // ret += cursor.read_u16::<LittleEndian>()? as u64;
    // ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}

I got:

test tests::bench_byteorder_cursor           ... bench:      42,126 ns/iter (+/- 2,335)
test tests::bench_byteorder_cursor_v2        ... bench:      36,841 ns/iter (+/- 1,431)
test tests::bench_byteorder_cursor_v2_unsafe ... bench:      29,907 ns/iter (+/- 1,424)
test tests::bench_byteorder_slice            ... bench:      54,108 ns/iter (+/- 16,327)
test tests::bench_byteorder_slice_v2         ... bench:      42,847 ns/iter (+/- 1,505)

I think the change is pretty significant compared to the two tests I did not change.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to read_exact

I got different results with cargo +nightly rustc --release -- --emit asm using cargo 1.50.0-nightly (bfca1cd22 2020-11-24); the generated assembly does contain 4 calls to read_exact:

        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzbl  1(%rsp), %r12d
        movb    $0, 1(%rsp)
        leaq    1(%rsp), %rsi
        movl    $1, %r15d
        movl    $1, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzbl  1(%rsp), %r13d
        movw    $0, 2(%rsp)
        leaq    2(%rsp), %rsi
        movl    $2, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzwl  2(%rsp), %ebp
        movl    $0, 4(%rsp)
        leaq    4(%rsp), %rsi
        movl    $4, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        addq    %r12, %r13
        addq    %r13, %rbp
        movl    4(%rsp), %eax
        addq    %rbp, %rax
        movq    %rax, 8(%rbx)
        xorl    %r15d, %r15d
        jmp     .LBB12_6

but they are gone after making the above change (doing 4 u8 reads).

So, in short: I think this may be related to the inlining strategy that LLVM takes; reducing the number of different functions called helps.

adrian17 · 2020-12-22T12:11:21Z

I got different results with cargo +nightly rustc --release -- --emit asm using cargo 1.50.0-nightly (bfca1cd22 2020-11-24); the generated assembly does contain 4 calls to read_exact:

Weird... calls were indeed appearing in .s, but not when I objdump'd the built rlib. After I deleted the target directory and rebuilt, now the calls are in the built rlib too. Anyway, this is #47321 .

But you can notice that even you made a change that made the compiler inline them, the code is still slow, which is what this issue is about.

adrian17 · 2021-07-04T14:28:01Z

After half a year, looks like the generated code looks much better, and most benchmarks I mentioned above are nearly equal - in particular, my "hand-optimizations" do not appear to be necessary anymore.

The idiomatic Cursor code is still ~20% slower than the Slice code in my microbenchmark, but I'm guessing that's just unavoidable with an extra abstraction layer, so I'm gonna close this issue.

adrian17 added the C-bug Category: This is a bug. label Dec 21, 2020

adrian17 mentioned this issue Jan 27, 2021

Swf benchmarking ruffle-rs/ruffle#2907

Closed

adrian17 closed this as completed Jul 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

adrian17 commented Dec 21, 2020 •

edited

Loading

wecing commented Dec 22, 2020

adrian17 commented Dec 22, 2020

adrian17 commented Jul 4, 2021

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Comments

adrian17 commented Dec 21, 2020 • edited Loading

wecing commented Dec 22, 2020

adrian17 commented Dec 22, 2020

adrian17 commented Jul 4, 2021

adrian17 commented Dec 21, 2020 •

edited

Loading