Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Finish removing the BigInts from * for FD{Int128}! #94

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Jun 13, 2024

Finally implements the fast-multiplication optimization from #45! :)

This PR optimizes multiplication for FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉

This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.

After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.

master:

julia> using FixedPointDecimals, BenchmarkTools

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After PR #93:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After this PR:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

NHDaly added 6 commits June 12, 2024 13:11
We do not here explicitly introduce support for FD{BitIntegers.Int256},
though that should work out of the box both before and after this PR.

Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations
from Int128 widening to BigInt in FD operations.
Finally implements the fast-multiplication optimization from
#45, but this
time for 128-bit FixedDecimals! :)

This is a follow-up to
#93, which
introduces an Int256 type for widemul. However, the fldmod still
required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const
optimization for (U)Int256, which briefly widens to Int512 (😅) to
perform the fldmod by the constant 10^f coefficient.

This brings 128-bit FD multiply to the same performance as 64-bit. :)
@NHDaly NHDaly requested review from Drvi and TotalVerb June 13, 2024 02:28
@NHDaly
Copy link
Member Author

NHDaly commented Jun 13, 2024

Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)

See here, for my note on FixedDecimal{Int64} support:
#45 (comment)

@NHDaly
Copy link
Member Author

NHDaly commented Jun 13, 2024

Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this!
And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!

test/runtests.jl Outdated Show resolved Hide resolved
src/FixedPointDecimals.jl Outdated Show resolved Hide resolved
src/fldmod-by-const.jl Outdated Show resolved Hide resolved
@NHDaly NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 82eb32d to f2958ba Compare June 13, 2024 19:59
@NHDaly NHDaly requested a review from omus June 13, 2024 20:06
@NHDaly
Copy link
Member Author

NHDaly commented Jun 13, 2024

Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.

Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt:
Screenshot 2024-06-13 at 2 10 20 PM
And yet somehow it's 25% faster (80µs -> 60µs).

@NHDaly
Copy link
Member Author

NHDaly commented Jun 13, 2024

Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call.
Now, it's inlined, and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅
But still, allowing that optimization seems like a positive thing!
. The inlining is the main change, i think, for FD{Int32}.

julia> function bench(fd) 
           for _ in 1:10000
               fd = fd * fd
           end
           fd
       end
bench (generic function with 1 method)

# Before:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1163               ; -- Begin function julia_bench_1163
        .p2align        2
_julia_bench_1163:                      ; @julia_bench_1163
; %bb.0:                                ; %guard_exit4
        sub     sp, sp, #48
        stp     x20, x19, [sp, #16]             ; 16-byte Folded Spill
        stp     x29, x30, [sp, #32]             ; 16-byte Folded Spill
        ldr     w0, [x0]
        mov     w19, #10000
Lloh0:
        adrp    x20, "_j_*_1165"@GOTPAGE
Lloh1:
        ldr     x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        str     w0, [sp, #12]
        add     x0, sp, #12
        add     x1, sp, #12
        blr     x20
        subs    x19, x19, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit16
        ldp     x29, x30, [sp, #32]             ; 16-byte Folded Reload
        ldp     x20, x19, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #48
        ret
        .loh AdrpLdrGot Lloh0, Lloh1
                                        ; -- End function
.subsections_via_symbols

After:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1161               ; -- Begin function julia_bench_1161
        .p2align        2
_julia_bench_1161:                      ; @julia_bench_1161
; %bb.0:                                ; %guard_exit7
        ldr     w0, [x0]
        mov     w8, #10000
        mov     x9, #57148
        movk    x9, #36175, lsl #16
        movk    x9, #28311, lsl #32
        movk    x9, #33554, lsl #48
        mov     x10, #-1000
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        umull   x11, w0, w0
        umulh   x11, x11, x9
        lsr     x12, x11, #9
        mul     x13, x12, x10
        umaddl  x13, w0, w0, x13
        ubfx    x11, x11, #9, #1
        cmp     x13, #500
        cset    w13, hi
        csel    w11, w11, w13, eq
        add     w0, w11, w12
        subs    x8, x8, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit19
        ret
                                        ; -- End function
.subsections_via_symbols

EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:

julia> @noinline function bench(fd, N) 
           for _ in 1:N
               fd = fd * fd
           end
           fd
       end
bench (generic function with 2 methods)

julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
  48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

Copy link
Contributor

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
@NHDaly
Copy link
Member Author

NHDaly commented Jun 17, 2024

I think Tomas should be able to review when he's back from vacation. He's out til thursday. :)
Thanks for the offer, @omus! :)

NHDaly added 2 commits June 19, 2024 11:51
…change

Tests all FD{(U)Int16} values.
Tests most corner cases for FD{(U)Int128} values.
@NHDaly NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 44686f3 to ac3302e Compare June 19, 2024 19:16
We must use the precise value of 2^nbits(T) in order to get the correct
division in all cases.

....... UGH except now the Int64 tests aren't passing.
@NHDaly NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from ac3302e to e4cb73b Compare June 19, 2024 19:16
@NHDaly
Copy link
Member Author

NHDaly commented Jul 10, 2024

Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!

@NHDaly
Copy link
Member Author

NHDaly commented Jul 10, 2024

I also updated the PR comment with the final perf numbers, which look 👌

Base automatically changed from nhd-overflow-Int128 to master August 12, 2024 17:49
@EdsterG
Copy link

EdsterG commented Sep 16, 2024

By the way, div/mod allocations are fixed in the most recent version of BitIntegers when running on 1.11+, see #48.

@NHDaly
Copy link
Member Author

NHDaly commented Sep 26, 2024

Oh wow, very nice! Thanks @EdsterG. 💪

This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants