-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Finish removing the BigInts from * for FD{Int128}! #94
base: master
Are you sure you want to change the base?
Conversation
We do not here explicitly introduce support for FD{BitIntegers.Int256}, though that should work out of the box both before and after this PR. Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations from Int128 widening to BigInt in FD operations.
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for See here, for my note on FixedDecimal{Int64} support: |
Now that there's julia's effects system, this doesn't need |
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
82eb32d
to
f2958ba
Compare
Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call. julia> function bench(fd)
for _ in 1:10000
fd = fd * fd
end
fd
end
bench (generic function with 1 method)
# Before:
julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1163 ; -- Begin function julia_bench_1163
.p2align 2
_julia_bench_1163: ; @julia_bench_1163
; %bb.0: ; %guard_exit4
sub sp, sp, #48
stp x20, x19, [sp, #16] ; 16-byte Folded Spill
stp x29, x30, [sp, #32] ; 16-byte Folded Spill
ldr w0, [x0]
mov w19, #10000
Lloh0:
adrp x20, "_j_*_1165"@GOTPAGE
Lloh1:
ldr x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
str w0, [sp, #12]
add x0, sp, #12
add x1, sp, #12
blr x20
subs x19, x19, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit16
ldp x29, x30, [sp, #32] ; 16-byte Folded Reload
ldp x20, x19, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #48
ret
.loh AdrpLdrGot Lloh0, Lloh1
; -- End function
.subsections_via_symbols After: julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1161 ; -- Begin function julia_bench_1161
.p2align 2
_julia_bench_1161: ; @julia_bench_1161
; %bb.0: ; %guard_exit7
ldr w0, [x0]
mov w8, #10000
mov x9, #57148
movk x9, #36175, lsl #16
movk x9, #28311, lsl #32
movk x9, #33554, lsl #48
mov x10, #-1000
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
umull x11, w0, w0
umulh x11, x11, x9
lsr x12, x11, #9
mul x13, x12, x10
umaddl x13, w0, w0, x13
ubfx x11, x11, #9, #1
cmp x13, #500
cset w13, hi
csel w11, w11, w13, eq
add w0, w11, w12
subs x8, x8, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit19
ret
; -- End function
.subsections_via_symbols EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code: julia> @noinline function bench(fd, N)
for _ in 1:N
fd = fd * fd
end
fd
end
bench (generic function with 2 methods)
julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
I think Tomas should be able to review when he's back from vacation. He's out til thursday. :) |
…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.
44686f3
to
ac3302e
Compare
We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.
ac3302e
to
e4cb73b
Compare
LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.
… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9
Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks! |
I also updated the PR comment with the final perf numbers, which look 👌 |
By the way, |
Oh wow, very nice! Thanks @EdsterG. 💪 This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪 |
Finally implements the fast-multiplication optimization from #45! :)
This PR optimizes multiplication for
FixedDecimal{Int64}
andFixedDecimal{Int128}
. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.
Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.
After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.
master:
After PR #93:
After this PR: