Finish removing the BigInts from * for FD{Int128}! #94

NHDaly · 2024-06-13T02:27:53Z

Finally implements the fast-multiplication optimization from #45! :)

This PR optimizes multiplication for FixedDecimal{Int64} and FixedDecimal{Int128}. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉

This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.

Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.

After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.

master:

julia> using FixedPointDecimals, BenchmarkTools

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  84.916 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  249.083 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.660 ms (248829 allocations: 4.70 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  172.958 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  5.408 ms (308621 allocations: 6.14 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After PR #93:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  83.750 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  85.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  248.958 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  4.673 ms (160798 allocations: 3.22 MiB)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  56.791 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  60.041 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  173.000 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  4.750 ms (190708 allocations: 4.82 MiB)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

After this PR:

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int16,3}(1.234))
  48.458 μs (0 allocations: 0 bytes)
FixedDecimal{Int16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int32,3}(1.234))
  57.000 μs (0 allocations: 0 bytes)
FixedDecimal{Int32,3}(1700943.280)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int64,3}(1.234))
  90.708 μs (0 allocations: 0 bytes)
FixedDecimal{Int64,3}(4230510070790917.029)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{Int128,3}(1.234))
  180.125 μs (0 allocations: 0 bytes)
FixedDecimal{Int128,3}(-66726338547984585007169386718143307.324)

# unsigned

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt16,3}(1.234))
  42.708 μs (0 allocations: 0 bytes)
FixedDecimal{UInt16,3}(0.000)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt32,3}(1.234))
  51.250 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt64,3}(1.234))
  80.042 μs (0 allocations: 0 bytes)
FixedDecimal{UInt64,3}(16576189118051436.703)

julia> @btime for _ in 1:10000 fd = fd * fd end setup = (fd = FixedDecimal{UInt128,3}(1.234))
  162.417 μs (0 allocations: 0 bytes)
FixedDecimal{UInt128,3}(303384805088638153637410092093845905.434)

We do not here explicitly introduce support for FD{BitIntegers.Int256}, though that should work out of the box both before and after this PR. Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations from Int128 widening to BigInt in FD operations.

Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)

NHDaly · 2024-06-13T02:29:53Z

Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for FixedDecimal{Int64} is already fast). But this PR adds support for FixedDecimal{Int128}, which needed a custom function for Int256 div by const. :)

See here, for my note on FixedDecimal{Int64} support:
#45 (comment)

NHDaly · 2024-06-13T02:39:38Z

Now that there's julia's effects system, this doesn't need @pure, so it's actually possible to do this!
And now that we've introduced BitIntegers.jl, the code is way simpler. 😊 woohoo!

test/runtests.jl

src/FixedPointDecimals.jl

src/fldmod-by-const.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly · 2024-06-13T20:12:20Z

Update 1: It turned out that (of course) * is too fast for @btime to get a good measurement of on its own. I've adjusted the benchmark to perform repeated multiplies, and the true performance is now included in the original comment.

Update 2: It turns out that this is actually faster for all int types (>2x faster for FD{Int64}), even though LLVM was doing a version of this optimization on its own. I'm honestly pretty surprised by that... From what I can tell, for FD{Int32}, the code is identical except for two instruction changes: one changing the order of registers as inputs to add (shouldn't matter), and one changing from comparing hi to comparing gt:

And yet somehow it's 25% faster (80µs -> 60µs).

NHDaly · 2024-06-13T20:18:42Z

Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call.
Now, it's inlined, and i think the whole loop is constant-folded away, so it's actually sort of cheating. 😅
But still, allowing that optimization seems like a positive thing!. The inlining is the main change, i think, for FD{Int32}.

julia> function bench(fd) 
           for _ in 1:10000
               fd = fd * fd
           end
           fd
       end
bench (generic function with 1 method)

# Before:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1163               ; -- Begin function julia_bench_1163
        .p2align        2
_julia_bench_1163:                      ; @julia_bench_1163
; %bb.0:                                ; %guard_exit4
        sub     sp, sp, #48
        stp     x20, x19, [sp, #16]             ; 16-byte Folded Spill
        stp     x29, x30, [sp, #32]             ; 16-byte Folded Spill
        ldr     w0, [x0]
        mov     w19, #10000
Lloh0:
        adrp    x20, "_j_*_1165"@GOTPAGE
Lloh1:
        ldr     x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        str     w0, [sp, #12]
        add     x0, sp, #12
        add     x1, sp, #12
        blr     x20
        subs    x19, x19, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit16
        ldp     x29, x30, [sp, #32]             ; 16-byte Folded Reload
        ldp     x20, x19, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #48
        ret
        .loh AdrpLdrGot Lloh0, Lloh1
                                        ; -- End function
.subsections_via_symbols

After:

julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0
        .globl  _julia_bench_1161               ; -- Begin function julia_bench_1161
        .p2align        2
_julia_bench_1161:                      ; @julia_bench_1161
; %bb.0:                                ; %guard_exit7
        ldr     w0, [x0]
        mov     w8, #10000
        mov     x9, #57148
        movk    x9, #36175, lsl #16
        movk    x9, #28311, lsl #32
        movk    x9, #33554, lsl #48
        mov     x10, #-1000
LBB0_1:                                 ; %L2
                                        ; =>This Inner Loop Header: Depth=1
        umull   x11, w0, w0
        umulh   x11, x11, x9
        lsr     x12, x11, #9
        mul     x13, x12, x10
        umaddl  x13, w0, w0, x13
        ubfx    x11, x11, #9, #1
        cmp     x13, #500
        cset    w13, hi
        csel    w11, w11, w13, eq
        add     w0, w11, w12
        subs    x8, x8, #1
        b.ne    LBB0_1
; %bb.2:                                ; %guard_exit19
        ret
                                        ; -- End function
.subsections_via_symbols

EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code:

julia> @noinline function bench(fd, N) 
           for _ in 1:N
               fd = fd * fd
           end
           fd
       end
bench (generic function with 2 methods)

julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
  48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283)

omus

Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.

src/FixedPointDecimals.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly · 2024-06-17T03:49:53Z

I think Tomas should be able to review when he's back from vacation. He's out til thursday. :)
Thanks for the offer, @omus! :)

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

NHDaly · 2024-07-10T01:31:34Z

Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks!

NHDaly · 2024-07-10T01:40:01Z

I also updated the PR comment with the final perf numbers, which look 👌

EdsterG · 2024-09-16T22:31:41Z

By the way, div/mod allocations are fixed in the most recent version of BitIntegers when running on 1.11+, see #48.

NHDaly · 2024-09-26T20:57:43Z

Oh wow, very nice! Thanks @EdsterG. 💪

This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪

NHDaly added 6 commits June 12, 2024 13:11

Further reduce BigInts by skipping a rem() in iseven

7756238

Fix ambiguity in _widemul(Int256, UInt256)

78e45dc

Bump patch version number

879c602

Add compat for BitIntegers

a245651

NHDaly requested review from Drvi and TotalVerb June 13, 2024 02:28

Support older versions of julia

4e53f3d

NHDaly added 4 commits June 12, 2024 20:42

Comments

dfd41b1

Disable fldmod-by-const tests on older julia

efee91b

Fix one other case of iseven allocating a BigInt

a03d754

Apply this optimization to FD{Int64} as well.

4ed8ebf

omus reviewed Jun 13, 2024

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

src/fldmod-by-const.jl Outdated Show resolved Hide resolved

NHDaly and others added 3 commits June 13, 2024 13:49

Adjust to run for all integer types!

3f39b8a

Clarify the _unsigned(x) methods with comments

20c66f2

Apply suggestions from code review

f2958ba

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 82eb32d to f2958ba Compare June 13, 2024 19:59

NHDaly requested a review from omus June 13, 2024 20:06

omus reviewed Jun 14, 2024

View reviewed changes

src/FixedPointDecimals.jl Outdated Show resolved Hide resolved

Update src/FixedPointDecimals.jl

0019bb0

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

NHDaly added 2 commits June 19, 2024 11:51

Add extensive tests for multiplication correctness, to cover the new …

4a53703

…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.

Named testsets to make failures easier to identify

27a3e0f

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from 44686f3 to ac3302e Compare June 19, 2024 19:16

Fix off-by-one error in rounding truncation in calculate_inverse_coeff()

e4cb73b

We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.

NHDaly force-pushed the nhd-Int128-fastmul-noallocs branch from ac3302e to e4cb73b Compare June 19, 2024 19:16

NHDaly and others added 10 commits June 19, 2024 14:46

Add some comments and requires

73f6547

Copy/pasted definition for unsigned numbers straight from the book

4e9fdd6

Have magicgu support arbitrary integer sizes

188933d

Use the formulas from Hacker's delight for both signed and unsigned Ints

f6d375c

.

b79c873

.

4f4d17a

Restrict back to just Int128 & Int256 for custom div_by_const

eaeaddf

LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.

It turns out that in newer versions of julia, you should call fldmod,…

a2dcf56

… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9

Reorganize the functions to be top-down

ef578e9

More thorough tests for flmdod_by_const

8df68b5

Base automatically changed from nhd-overflow-Int128 to master August 12, 2024 17:49

NHDaly added 5 commits August 12, 2024 11:51

Merge branch 'master' into nhd-Int128-fastmul-noallocs

66e1ecb

Bump patch version number

fd096ac

Add _widemul unit test

e147f5c

Change "unreachable" comment to an @assert false

f830c3c

add test comment

71ba82e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish removing the BigInts from * for FD{Int128}! #94

Finish removing the BigInts from * for FD{Int128}! #94

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

omus left a comment

NHDaly commented Jun 17, 2024 •

edited

Loading

NHDaly commented Jul 10, 2024

NHDaly commented Jul 10, 2024

EdsterG commented Sep 16, 2024

NHDaly commented Sep 26, 2024

Finish removing the BigInts from * for FD{Int128}! #94

Are you sure you want to change the base?

Finish removing the BigInts from * for FD{Int128}! #94

Conversation

NHDaly commented Jun 13, 2024 • edited Loading

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024

NHDaly commented Jun 13, 2024 • edited Loading

NHDaly commented Jun 13, 2024 • edited Loading

omus left a comment

Choose a reason for hiding this comment

NHDaly commented Jun 17, 2024 • edited Loading

NHDaly commented Jul 10, 2024

NHDaly commented Jul 10, 2024

EdsterG commented Sep 16, 2024

NHDaly commented Sep 26, 2024

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 13, 2024 •

edited

Loading

NHDaly commented Jun 17, 2024 •

edited

Loading