-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Hi! I left a comment in the other issue -- seems like that issue will not appear when using I tried your benchmark and here are the results I got:
However, after changing the three #[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
let mut ret = 0;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
// ret += cursor.read_u16::<LittleEndian>()? as u64;
// ret += cursor.read_u32::<LittleEndian>()? as u64;
Ok(ret)
} I got:
I think the change is pretty significant compared to the two tests I did not change.
I got different results with
but they are gone after making the above change (doing 4 u8 reads). So, in short: I think this may be related to the inlining strategy that LLVM takes; reducing the number of different functions called helps. |
Weird... calls were indeed appearing in But you can notice that even you made a change that made the compiler inline them, the code is still slow, which is what this issue is about. |
After half a year, looks like the generated code looks much better, and most benchmarks I mentioned above are nearly equal - in particular, my "hand-optimizations" do not appear to be necessary anymore. The idiomatic Cursor code is still ~20% slower than the Slice code in my microbenchmark, but I'm guessing that's just unavoidable with an extra abstraction layer, so I'm gonna close this issue. |
This basically started with investigating byteorder's
read_u16
-style extensions. They are more or less meant to be used like this:You can see that they are implemented in terms of
Read
'sread_exact
method:https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7
However, it seems like when used with
Cursor<&[u8]>
or just&[u8]
, they are way slower than I'd expect them to be.When I build a test program, I encounter this bug: #47321 which is pretty bad, especially as
read_exact
will call tomemcpy
for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to
read_exact
, and there are no calls in inlined code except toio::Error::new
. However, even when inlined, it appears that code is still suboptimal.Normally, I'd assume that
cursor.read_16()
does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.Here is my base test case:
In my benchmarks, it spends 7.5ns/call.
But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:
->
The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).
This still generated range checks that seemed redundant (I may be wrong here), so after introducing
unsafe
:->
Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every
read_u*
call)Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:
->
The iteration speed improves by nearly 2x.
Now I don't really know how much this has to do with #47321 , and how much with other optimizations. It's also a bit awkward that #47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.
I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .
The text was updated successfully, but these errors were encountered: