Skip to content

Support for Fused Multiply-Add (FMA) #102

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
dylanede opened this issue Apr 21, 2021 · 16 comments · Fixed by #138
Closed

Support for Fused Multiply-Add (FMA) #102

dylanede opened this issue Apr 21, 2021 · 16 comments · Fixed by #138
Assignees
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@dylanede
Copy link

dylanede commented Apr 21, 2021

From my limited testing, multiplying and then adding SimdF32/SimdF64 does not result in FMA instructions, and rightly so, however it would be useful to have a method on these types and a corresponding platform intrinsic to generate LLVM calls to llvm.fma.* intrinsics, which support vector types.

Edit: In fact it looks like the simd_fma platform intrinsic is already available, just not used by the crate?

@dylanede dylanede added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Apr 21, 2021
@calebzulawski
Copy link
Member

Yes, it's simply not implemented yet, but is planned. It's easy enough to add so I'll do it when I get a chance

@dylanede
Copy link
Author

Well, I'm looking at making a PR for it right now.

Currently it looks like this:

/// Performs `self * b + c` as a single operation.
#[inline]
pub fn fma(self, b: Self, c: Self) -> Self {
    unsafe { crate::intrinsics::simd_fma(self, b, c) }
}

But maybe it should be something like c.add_mul(a, b) which is c + a*b? With the above code it would be a.fma(b, c).

@programmerjake
Copy link
Member

we also will want something like llvm.fmuladd.* where llvm picks whichever is more efficient: separate mul/add or fma

@dylanede
Copy link
Author

dylanede commented Apr 21, 2021

Yeah, I've just tested the llvm.fma.* version out, and although it kind of improves performance in some cases when the fma feature is available on x86_64, without it it falls back to an fma procedure call, which completely tanks performance.

Presumably both options should be made available, as I understand that fused multiply add normally comes with the guarantee of machine precision for the whole operation, whereas llvm.fmuladd.* does not provide that guarantee.

Doesn't look like a platform intrinsic for fmuladd is exposed yet.

@calebzulawski
Copy link
Member

I believe the function should be called mul_add to match the equivalent scalar functions in std.

Exposing fast versions of intrinsics is probably a separate issue and definitely affects other functions too

@calebzulawski
Copy link
Member

@dylanede regarding the procedure call, this is related to #76, we have a path forward to fixing that, though it's non-trivial

@workingjubilee
Copy link
Member

/// Performs `self * b + c` as a single operation.
#[inline]
pub fn fma(self, b: Self, c: Self) -> Self {
    unsafe { crate::intrinsics::simd_fma(self, b, c) }
}

But maybe it should be something like c.add_mul(a, b) which is c + a*b? With the above code it would be a.fma(b, c).

Sounds good, with appropriate tests!
I agree with Caleb that it should be mul_add. Unfortunately it needs to be gated under the std feature due to the current fallback to a procedure call.

I would rather not have two functions that do the same thing, nor another f32 type (SimdFastF32? what?). Resolving this is beyond the scope of this issue, however.

@Lokathor
Copy link
Contributor

yeah people know the term "multiply add" and f32 has a mul_add method already, so SIMD types should also stick to mul_add as the method name.

@programmerjake
Copy link
Member

maybe:

impl<const N: usize> f32x<N> {
    pub fn mul_add<const STRICT: bool = true>(self, multiplier: Self, addend: Self) -> Self {
        if STRICT {
            // llvm.fma.*
        } else {
            // llvm.fmuladd.*
        }
    }
}

@Lokathor
Copy link
Contributor

Since we're not tied to being an exact hardware impl, I think we should just define it from the start to always let llvm pick what's best. If people absolutely want strict ops they can transmute the value to a platform version and use stdarch.

Particularly, I think that const-generics in methods and functions that aren't simply passed in from the type are very bad ergonomically at this time, and should be avoided if possible.

@calebzulawski
Copy link
Member

IEEE 754-2008 specifies FMA to have a single rounding step.

@Lokathor
Copy link
Contributor

I can't tell if that means that you're for or against having strict fma

@programmerjake
Copy link
Member

I'm for having strict fma by default, alongside another function with semantics of: use separate mul & add (correctly rounded) or use fma (correctly rounded) at compiler's choice -- no other options (in particular the afn flag is not allowed). The non-strict function matches the semantics of llvm.fmuladd.*

@gilescope
Copy link

@dylanede happy to test your PR out if you've got something that hangs together? I am using _mm_maddubs_epi16 currently - am assuming that mull_add will be able to replace this.

@workingjubilee
Copy link
Member

ping @dylanede to check if you're still interested in followup or have any questions ❤️

@gilescope
Copy link

@dylanede if you have something halfway but don't have time to finish it feel free to pop the branch on your fork and I can have a crack at finishing it.

workingjubilee added a commit that referenced this issue Jun 23, 2021
Add various fns
- Sum/Product traits
- recip/to_degrees/to_radians/min/max/clamp/signum/copysign; #14
- mul_add: #14, fixes #102
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants