-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Speed up * for FD{Int64} via custom fldmod_by_const
#45
Speed up * for FD{Int64} via custom fldmod_by_const
#45
Conversation
Add a custom implementation of `fldmod(x::T,y)` for when `y` is a const, and call it whenever T >= Int128. This improves performance for `*(a::FD{Int64}, b::FD{Int64})`, b/c `*` passes the result of `widemul(a,b)` into `fldmod` -- and LLVM isn't able to optimize `fldmodinline(::Int128,::Const{Int128})` as well as we'd like. Here, we opt to use a custom implementation of `fldmod` in that case, which runs significantly faster, making the multiply faster. -------- In this commit, Int128 still widens to BigInt, and we do not use the custom implementation in that case. In a follow-up commit, we'll change Int128 to widen to a simple 256-bit integer implementation, so that we can use the custom fldmod.
Unfortunately I still haven't found a good solution for the automated benchmarks (i'm working on it!), but in the meantime, i've re-run the benchmark locally. Here's the updated timings for just the FixedDecimal multiplication:
Compared to master (after merging #43), this PR reduces the time for |
Codecov Report
@@ Coverage Diff @@
## master #45 +/- ##
==========================================
- Coverage 96.49% 93.36% -3.13%
==========================================
Files 1 2 +1
Lines 171 226 +55
==========================================
+ Hits 165 211 +46
- Misses 6 15 +9
Continue to review full report at Codecov.
|
In this PR, we used the following check so that we'll only use the custom implementation of It might look a little weird as it's currently phrased, because only In the next PR (RelationalAI-oss#7), though, |
src/fldmod_by_const.jl
Outdated
return invcoeff, toshift | ||
end | ||
# These are needed to handle Int128, which widens to BigInt, since BigInt doesn't have typemax | ||
twoToTheSizeOf(::Type{T}) where {T} = typemax(widen(unsigned(T))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns 2^n - 1
. If f
is a power of 2, wouldn't twoToTheSizeOf(UT)÷f
be incorrect? I will take a more detailed look at the math later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so, it never will be, because this argument isn't actually f
, it's 10^f
. Sorry, I did a bad job naming it. I've just renamed it to C
, to match the rest of the comment.
So in practice, the inputs will always be 1
, 10
, 100
, etc. That said, of course it's nice to be correct for any input if possible....
And yeah, you're right, it's wrong if the second argument is a power of 2.. :( Very good catch, @TotalVerb!!
Okay, so, i thought about this for a while and experimented a bit. So in general, you're right that it returns 2^n - 1
. But it ends up not mattering, i think, because we only end up returning the upper half of the invcoeff
(and we round when we truncate), so that off-by-one disappears. I've tried to make that clearer in the comments, and I've pushed up a commit now that attempts to clarify this, lemme know what you think.
That said, your intuition was right: the result is still wrong for powers of 2. I tried changing this to always calculate exactly 2^N by always using BigInts, but that didn't change the answers at all.
Instead, the problem was with the shift: If you shift away all the leading zeros, the answer ends up looking like 0xffff..
, and so the _round_to_even
at the end ends up rounding to 0x1
. It's sort of kludgy, but I was able to fix this by just subtracting 1
from toshift
whenever ispow2(C)
. This keeps a leading zero, and prevents it from looking like -1 to the rounding function.
That got me to be off-by-one as you predicted, i think, and so combining this with the BigInt change makes it work i think. So I think that fixes this function, but then the div_by_const
still isn't returning the right values.. :/
Hmm, so what I've done for now is just add a REQUIRES: C must not be a power of two
, and then added a check into div_by_const
like this:
elseif ispow2(C)
return div(x,C)
Because LLVM can handle dividing by a constant power of 2 just fine (it's just a bitshift)! But i'm down to keep trying to work on this if you want calculate_inv_coeff
to work when C
is a power-of-2.
:/ Sorry this is confusing.
This whole thing was rather tricky, which of course is bad, but the performance gains or so gooood, so ideally we can massage this into the least tricky version we can! :) I'm super happy to keep re-working it with any suggestions you might have. :)
Thanks!
test/fldmod_by_const.jl
Outdated
end | ||
|
||
@testset "div_by_const" begin | ||
vals = [2432, 100, 0x1, Int32(10000), typemax(Int64), typemax(Int16)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my concern above, could we test some powers of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happily! Done.
You were right of course. I've added a big note about fixing this above!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to most of the comments but not all. (In particular, I'm waiting on your feedback regarding renaming custom_fldmod
to fldmodinline
or whatever, so i didn't change that part.)
src/fldmod_by_const.jl
Outdated
return invcoeff, toshift | ||
end | ||
# These are needed to handle Int128, which widens to BigInt, since BigInt doesn't have typemax | ||
twoToTheSizeOf(::Type{T}) where {T} = typemax(widen(unsigned(T))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so, it never will be, because this argument isn't actually f
, it's 10^f
. Sorry, I did a bad job naming it. I've just renamed it to C
, to match the rest of the comment.
So in practice, the inputs will always be 1
, 10
, 100
, etc. That said, of course it's nice to be correct for any input if possible....
And yeah, you're right, it's wrong if the second argument is a power of 2.. :( Very good catch, @TotalVerb!!
Okay, so, i thought about this for a while and experimented a bit. So in general, you're right that it returns 2^n - 1
. But it ends up not mattering, i think, because we only end up returning the upper half of the invcoeff
(and we round when we truncate), so that off-by-one disappears. I've tried to make that clearer in the comments, and I've pushed up a commit now that attempts to clarify this, lemme know what you think.
That said, your intuition was right: the result is still wrong for powers of 2. I tried changing this to always calculate exactly 2^N by always using BigInts, but that didn't change the answers at all.
Instead, the problem was with the shift: If you shift away all the leading zeros, the answer ends up looking like 0xffff..
, and so the _round_to_even
at the end ends up rounding to 0x1
. It's sort of kludgy, but I was able to fix this by just subtracting 1
from toshift
whenever ispow2(C)
. This keeps a leading zero, and prevents it from looking like -1 to the rounding function.
That got me to be off-by-one as you predicted, i think, and so combining this with the BigInt change makes it work i think. So I think that fixes this function, but then the div_by_const
still isn't returning the right values.. :/
Hmm, so what I've done for now is just add a REQUIRES: C must not be a power of two
, and then added a check into div_by_const
like this:
elseif ispow2(C)
return div(x,C)
Because LLVM can handle dividing by a constant power of 2 just fine (it's just a bitshift)! But i'm down to keep trying to work on this if you want calculate_inv_coeff
to work when C
is a power-of-2.
:/ Sorry this is confusing.
This whole thing was rather tricky, which of course is bad, but the performance gains or so gooood, so ideally we can massage this into the least tricky version we can! :) I'm super happy to keep re-working it with any suggestions you might have. :)
Thanks!
test/fldmod_by_const.jl
Outdated
end | ||
|
||
@testset "div_by_const" begin | ||
vals = [2432, 100, 0x1, Int32(10000), typemax(Int64), typemax(Int16)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happily! Done.
You were right of course. I've added a big note about fixing this above!
Fix tests (`using Compat.Test`).
Renamed custom_fldmod to fldmod_by_const, and merged with fldmodinline. Moved that definition into the helper file `fldmod_by_const.jl` and merged with the existing fldmod_by_const.
Reorganized the fldmod_by_const.jl helper file: reordering all the functions top-down so that the interface function, `fldmod_by_const`, is at the top, and then the definitions of the functions it calls come after. I think this is clearer and easier to read.
Happy New Year you all! :) Hope you had a nice holiday. :) I've cleaned this up a bit after some time away and I think i've addressed all the outstanding comments! :) Please take another look when you get time. Cheers! |
Converted the docstrings on `fld_by_const` and `manual_mod` to regular function comments since A) they're not intended to be used outside this file anyway, and B) they're quite simple. Made spacing consistent: `(a, b)` not `(a,b)`.
I'll try to re-review this week. I'll note the CI is failing on Julia 0.6. |
# The implementation for fld_by_const was lifted directly from Base.fld(x,y), except that | ||
# it uses `div_by_const` instead of `div`. | ||
fld_by_const(x::T, y::Val{C}) where {T<:Unsigned, C} = div_by_const(x, y) | ||
function fld_by_const(x::T, y::Val{C}) where {T<:Integer, C} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T <: Base.BitInteger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do you think so? I didn't want to prevent anyone from using a custom integer type, such as, for example, a custom Int256
if they were feeling crazy. Or like maybe some has like an Int20
type or something for some reason.
But I do agree this optimization is pretty specific to the bit layout of the integer types, so it's reasonable to want to limit it to "any integer type," because i can imagine other integer types (BigInt
) that it wouldn't work for.
What do you think?
# BigInt doesn't have a concept of "leading zeros", but since we _know_ the value being | ||
# passed here will fit in 256-bits (per two_to_the_size_of), we can pretend this is a | ||
# 256-bit integer, take just the upper half as a UInt128, and count the leading zeros there. | ||
_leading_zeros(x::BigInt) = leading_zeros((x >> 128) % UInt128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative to having a leading_zeros
for BigInt
? I don't like that this could give an incorrect result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe this is a case where control-flow is preferable to dispatch, and i should just replace the call to _leading_zeros(invcoeff)
with a check like this:
if invcoeff isa BigInt
# BigInt doesn't have a concept of "leading zeros", but since we _know_ the value being
# passed here will fit in 256-bits (per two_to_the_size_of), we can pretend this is a
# 256-bit integer, take just the upper half as a UInt128, and count the leading zeros there.
leading_zeros((x >> 128) % UInt128)
else
leading_zeros(x)
end
? What do you think about that? I think that's actually decent because it prevents this function from being misused in the future as you rightly point out.
If you agree, I'll make this change!
narrow(::Type{UInt128}) = UInt64 | ||
narrow(::Type{UInt64}) = UInt32 | ||
narrow(::Type{UInt32}) = UInt16 | ||
narrow(::Type{UInt16}) = UInt8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add narrow(::Type{Int8}) = Int8
and make sure to add tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:/ unless i make some other changes, I don't think that's a good idea.
Maybe I should change the name of this to not be narrow
, because what I'm really relying on is that narrow(T)
will be exactly half the size of T
, so that splitint
will split an integer into the upper and lower halves.
I like the fact that as is, splitint(Int8(3))
throws MethodError: no method matching narrow(::Type{Int8})
, because yeah, you can't split it into two halves since there's no smaller type to hold the result! But maybe if this isn't clear I need to either rename narrow
or add better comments?
# Implemenation based on umul32hi, from https://stackoverflow.com/a/22847373/751061 | ||
# Compute the upper half of the widened product of two unsigned integers. | ||
# Example: `widemul(0x0020,0x2002) == 0x0004_0040` vs | ||
# `unsigned_splitmul_upper(0x0020,0x2002) == 0x0004` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment should go into the purpose of this function. I found the link to be more insightful than this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was good feedback.
I've hopefully improved the comments in this last commit: dd78a51
Test the expected result directly, instead of checking `narrow(widen(T)) == T`
- Deleted promotion rule for `splitmul_upper` - Cleaned tests - Renamed `unsigned_splitmul_upper` to be the `T<:Unsigned` method of `splitmul_upper` - Removed semicolons & parentheses around if condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note the CI is failing on Julia 0.6.
Huh, apparently in julia 0.6
, widen(Int8)
returns Int32
. Weird. I've fixed this by changing the tests to just directly compare the result of narrow(T)
to the expected output.
# The implementation for fld_by_const was lifted directly from Base.fld(x,y), except that | ||
# it uses `div_by_const` instead of `div`. | ||
fld_by_const(x::T, y::Val{C}) where {T<:Unsigned, C} = div_by_const(x, y) | ||
function fld_by_const(x::T, y::Val{C}) where {T<:Integer, C} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do you think so? I didn't want to prevent anyone from using a custom integer type, such as, for example, a custom Int256
if they were feeling crazy. Or like maybe some has like an Int20
type or something for some reason.
But I do agree this optimization is pretty specific to the bit layout of the integer types, so it's reasonable to want to limit it to "any integer type," because i can imagine other integer types (BigInt
) that it wouldn't work for.
What do you think?
# BigInt doesn't have a concept of "leading zeros", but since we _know_ the value being | ||
# passed here will fit in 256-bits (per two_to_the_size_of), we can pretend this is a | ||
# 256-bit integer, take just the upper half as a UInt128, and count the leading zeros there. | ||
_leading_zeros(x::BigInt) = leading_zeros((x >> 128) % UInt128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe this is a case where control-flow is preferable to dispatch, and i should just replace the call to _leading_zeros(invcoeff)
with a check like this:
if invcoeff isa BigInt
# BigInt doesn't have a concept of "leading zeros", but since we _know_ the value being
# passed here will fit in 256-bits (per two_to_the_size_of), we can pretend this is a
# 256-bit integer, take just the upper half as a UInt128, and count the leading zeros there.
leading_zeros((x >> 128) % UInt128)
else
leading_zeros(x)
end
? What do you think about that? I think that's actually decent because it prevents this function from being misused in the future as you rightly point out.
If you agree, I'll make this change!
narrow(::Type{UInt128}) = UInt64 | ||
narrow(::Type{UInt64}) = UInt32 | ||
narrow(::Type{UInt32}) = UInt16 | ||
narrow(::Type{UInt16}) = UInt8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:/ unless i make some other changes, I don't think that's a good idea.
Maybe I should change the name of this to not be narrow
, because what I'm really relying on is that narrow(T)
will be exactly half the size of T
, so that splitint
will split an integer into the upper and lower halves.
I like the fact that as is, splitint(Int8(3))
throws MethodError: no method matching narrow(::Type{Int8})
, because yeah, you can't split it into two halves since there's no smaller type to hold the result! But maybe if this isn't clear I need to either rename narrow
or add better comments?
Added much more explanation to the definition of splitmul_upper and its usage site. Improved comments in div_by_const to clarify the logic, and explain why splitmul_upper is necessary.
2b8d47b
to
dd78a51
Compare
(note: i've Resolved the threads I've addressed. Any remaining unresolved comments have questions that I'm waiting for feedback on. Thanks again for this thorough review!) |
…dmod_by_const` Currently `fldmod_by_const` is assuming that the coefficient is always computed as a runtime value, but for sufficiently large values of `f` (>=3 from what i've seen), this turns out not to be true. Marking it `Base.@pure` asks the compiler to _always_ peform this computation at compile time.
(One other change: I marked Without this, I was seeing type instability for |
Adds @inferred tests of several possible built-in FD representations that multiplication (even w/ two different types) is type stable. Before the last commit, this PR caused these tests to fail. For example: ``` (f1, f2) = (2, 4): Error During Test at /Users/nathan.daly/.julia/dev/FixedPointDecimals/test/runtests.jl:402 Test threw exception Expression: #= /Users/nathan.daly/.julia/dev/FixedPointDecimals/test/runtests.jl:402 =# @inferred(FD{T, f1}(1.1) * FD{T, f2}(1)) == FD{T, fmax}(1.1) return type FixedDecimal{UInt64,4} does not match inferred return type Any ``` After the last commit, these tests all pass.
8ee96e2
to
11ee5e1
Compare
Excellent news! We can now close this PR because Julia+LLVM now do this optimization themselves, automatically! 🎉 Since Julia 1.6, when (The image is scaled to fit the browser width, but the total time is 10x less) |
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Add a custom implementation of
fldmod(x::T,y)
for wheny
is a const,and call it whenever T >= Int128.
This improves performance for
*(a::FD{Int64}, b::FD{Int64})
, b/c*
passesthe result of
widemul(a,b)
intofldmod
-- and LLVM isn't able tooptimize
fldmodinline(::Int128,::Const{Int128})
nearly as well as we'd like.Here, we opt to use a custom implementation of
fldmod
in that case,which runs significantly faster, making the multiply faster.
Specifically, the custom
fldmod_by_const
implementation is based off of the assembly generated by LLVM forfld(x,100)
. One big reason it can't emit similarly optimized code whenx
is anInt128
is because there isn't a singleop
to efficiently get the upper-half of the "split-widemul". This requires adding a function that does it, which I've calledsplitmul_upper
.Note that this PR only directly improves performance for
FD{Int64}
:FD{Int128}
still widens toBigInt
, and we do not use the custom implementation in that case.I'm following this up with RelationalAI-oss#7, which changes
Int128
to widen to a 256-bit integer implementation, so that we can use the customfldmod
forFD{Int128}
as well! :)