Optimize zero reads #34

nirs · 2024-10-17T19:29:38Z

Use optimized range loop[1], optimized by the compiler to single memclr
call. This dramatically speeds up zero reads.

format	compression	utilization	speedup
qcow2	-	0%	28.72
qcow2	zlib	0%	28.04
qcow2	-	50%	4.54
qcow2	zlib	50%	1.03
qcow2	-	100%	1.01
qcow2	zlib	100%	1.00

Before:

% go test -bench Read
BenchmarkRead0p/qcow2-12           14      77515735 ns/op     3462.98 MB/s      1050518 B/op        39 allocs/op
BenchmarkRead0p/qcow2_zlib-12      14      77823402 ns/op     3449.29 MB/s      1050504 B/op        39 allocs/op
BenchmarkRead50p/qcow2-12          24      48812158 ns/op     5499.36 MB/s      1181856 B/op        45 allocs/op
BenchmarkRead50p/qcow2_zlib-12      2     899659187 ns/op      298.37 MB/s    184996316 B/op     43247 allocs/op
BenchmarkRead100p/qcow2-12         61      19306020 ns/op    13904.24 MB/s      1181854 B/op        45 allocs/op
BenchmarkRead100p/qcow2_zlib-12     1    1732168542 ns/op      154.97 MB/s    368850952 B/op     86460 allocs/op

After:

% go test -bench Read
BenchmarkRead0p/qcow2-12          471       2698377 ns/op    99480.34 MB/s      1050514 B/op        39 allocs/op
BenchmarkRead0p/qcow2_zlib-12     468       2774952 ns/op    96735.15 MB/s      1050511 B/op        39 allocs/op
BenchmarkRead50p/qcow2-12         100      10735870 ns/op    25003.61 MB/s      1181854 B/op        45 allocs/op
BenchmarkRead50p/qcow2_zlib-12      2     868310583 ns/op      309.15 MB/s    185038456 B/op     43263 allocs/op
BenchmarkRead100p/qcow2-12         63      18977718 ns/op    14144.77 MB/s      1181851 B/op        45 allocs/op
BenchmarkRead100p/qcow2_zlib-12     1    1727832917 ns/op      155.36 MB/s    368886656 B/op     86471 allocs/op

Comparing with qemu-img show that we match qemu-img performance for
uncompressed version of the lima default image:

% time ./go-qcow2reader-example /tmp/test.qcow2 > /tmp/tmp.img
./go-qcow2reader-example /tmp/test.qcow2 > /tmp/tmp.img  0.06s user 0.73s system 93% cpu 0.854 total

% time qemu-img convert -O raw /tmp/test.qcow2 /tmp/tmp.img
qemu-img convert -O raw /tmp/test.qcow2 /tmp/tmp.img  0.04s user 0.70s system 98% cpu 0.756 total

[1] https://go-review.googlesource.com/c/go/+/2520

Part-of #32

Based on #35 for more correct benchmarks.

image/qcow2/qcow2.go

nirs · 2024-10-19T21:02:20Z

@AkihiroSuda current version is simpler, needs no temporary buffer, and much faster.

Use optimized range loop[1], optimized by the compiler to single memclr call. This dramatically speeds up zero reads. | format | compression | utilization | speedup | |--------|-------------|-------------|---------| | qcow2 | - | 0% | 28.72 | | qcow2 | zlib | 0% | 28.04 | | qcow2 | - | 50% | 4.54 | | qcow2 | zlib | 50% | 1.03 | | qcow2 | - | 100% | 1.01 | | qcow2 | zlib | 100% | 1.00 | Before: % go test -bench Read BenchmarkRead0p/qcow2-12 14 77515735 ns/op 3462.98 MB/s 1050518 B/op 39 allocs/op BenchmarkRead0p/qcow2_zlib-12 14 77823402 ns/op 3449.29 MB/s 1050504 B/op 39 allocs/op BenchmarkRead50p/qcow2-12 24 48812158 ns/op 5499.36 MB/s 1181856 B/op 45 allocs/op BenchmarkRead50p/qcow2_zlib-12 2 899659187 ns/op 298.37 MB/s 184996316 B/op 43247 allocs/op BenchmarkRead100p/qcow2-12 61 19306020 ns/op 13904.24 MB/s 1181854 B/op 45 allocs/op BenchmarkRead100p/qcow2_zlib-12 1 1732168542 ns/op 154.97 MB/s 368850952 B/op 86460 allocs/op After: % go test -bench Read BenchmarkRead0p/qcow2-12 471 2698377 ns/op 99480.34 MB/s 1050514 B/op 39 allocs/op BenchmarkRead0p/qcow2_zlib-12 468 2774952 ns/op 96735.15 MB/s 1050511 B/op 39 allocs/op BenchmarkRead50p/qcow2-12 100 10735870 ns/op 25003.61 MB/s 1181854 B/op 45 allocs/op BenchmarkRead50p/qcow2_zlib-12 2 868310583 ns/op 309.15 MB/s 185038456 B/op 43263 allocs/op BenchmarkRead100p/qcow2-12 63 18977718 ns/op 14144.77 MB/s 1181851 B/op 45 allocs/op BenchmarkRead100p/qcow2_zlib-12 1 1727832917 ns/op 155.36 MB/s 368886656 B/op 86471 allocs/op Comparing with qemu-img show that we match qemu-img performance for uncompressed version of the lima default image: % time ./go-qcow2reader-example /tmp/test.qcow2 > /tmp/tmp.img ./go-qcow2reader-example /tmp/test.qcow2 > /tmp/tmp.img 0.06s user 0.73s system 93% cpu 0.854 total % time qemu-img convert -O raw /tmp/test.qcow2 /tmp/tmp.img qemu-img convert -O raw /tmp/test.qcow2 /tmp/tmp.img 0.04s user 0.70s system 98% cpu 0.756 total [1] https://go-review.googlesource.com/c/go/+/2520 Signed-off-by: Nir Soffer <nirsof@gmail.com>

AkihiroSuda

Thanks

nirs mentioned this pull request Oct 17, 2024

Poor performance compared with qemu-img convert #32

Closed

6 tasks

AkihiroSuda reviewed Oct 18, 2024

View reviewed changes

image/qcow2/qcow2.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Oct 18, 2024

View reviewed changes

image/qcow2/qcow2.go Outdated Show resolved Hide resolved

nirs force-pushed the zero-reads branch 2 times, most recently from d8933ae to 82f0931 Compare October 18, 2024 06:05

nirs requested a review from AkihiroSuda October 18, 2024 06:10

AkihiroSuda requested a review from a team October 18, 2024 06:43

nirs force-pushed the zero-reads branch from 82f0931 to 13ea80d Compare October 19, 2024 20:58

nirs force-pushed the zero-reads branch from 13ea80d to 5f0db7b Compare October 20, 2024 16:36

AkihiroSuda approved these changes Oct 20, 2024

View reviewed changes

AkihiroSuda merged commit 5f33c4a into lima-vm:master Oct 20, 2024
2 checks passed

nirs deleted the zero-reads branch October 20, 2024 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize zero reads #34

Optimize zero reads #34

nirs commented Oct 17, 2024 •

edited

Loading

nirs commented Oct 19, 2024

AkihiroSuda left a comment

Optimize zero reads #34

Optimize zero reads #34

Conversation

nirs commented Oct 17, 2024 • edited Loading

nirs commented Oct 19, 2024

AkihiroSuda left a comment

Choose a reason for hiding this comment

nirs commented Oct 17, 2024 •

edited

Loading