Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Vectorisation: gcc errors for float vector times int scalar (or for double vector) #95

Closed
valassi opened this issue Dec 8, 2020 · 11 comments

Comments

@valassi
Copy link
Member

valassi commented Dec 8, 2020

This is a spinoff of #71

I just tried to switch to float and I get some errors like

CPPProcess.cc:164:32: error: conversion of scalar ‘int’ to vector ‘mgOnGpu::fptype_v’ {aka ‘__vector(16) float’} involves truncation
       fi_0 = cxmake( p, -pvec3 * nsf );
                         ~~~~~~~^~~~~

I made a simple reproducer. This fails with

g++ -O3 -march=core-avx2 local/truncate1.cc
local/truncate1.cc: In function ‘float4 timesN(int, const float4&)’:
local/truncate1.cc:29:57: error: conversion of scalar ‘int’ to vector ‘float4’ {aka ‘__vector(4) float’} involves truncation
 float4 timesN( const int n, const float4& v ){ return n * v; }
                                                       ~~^~~

where

cat local/truncate1.cc
#include <iostream>

typedef double double4 __attribute__ (( vector_size ( 4 * sizeof(double) ) ));
typedef float float4 __attribute__ (( vector_size ( 4 * sizeof(float) ) ));

inline std::ostream& operator<<( std::ostream& out, const double4& v )
{
  out << "{ " << v[0];
  for ( int i=1; i<4; i++ ) std::cout << ", " << v[i];
  out << " }";
  return out;
}

inline std::ostream& operator<<( std::ostream& out, const float4& v )
{
  out << "{ " << v[0];
  for ( int i=1; i<4; i++ ) std::cout << ", " << v[i];
  out << " }";
  return out;
}

double4 times2( const double4& v ){ return 2 * v; }

float4 times2( const float4& v ){ return 2 * v; }

double4 timesN( const int n, const double4& v ){ return n * v; }

// error: conversion of scalar ‘int’ to vector ‘float4’ {aka ‘__vector(4) float’} involves truncation
float4 timesN( const int n, const float4& v ){ return n * v; }
//float4 timesN( const int n, const float4& v ){ return (float)n * v; } // this is ok instead...

int main( int argc, char **argv )
{
  double4 f4 = {0.1, 1.1, 2.1, 3.1};
  float4 d4 = {4.1, 5.1, 6.1, 7.1};
  std::cout << d4 << std::endl;
  std::cout << f4 << std::endl;

  std::cout << std::endl;
  std::cout << times2(d4) << std::endl;
  std::cout << times2(f4) << std::endl;

  std::cout << std::endl;
  std::cout << 3*d4 << std::endl;
  std::cout << 3*f4 << std::endl;

  std::cout << std::endl;
  std::cout << timesN(3,d4) << std::endl;
  std::cout << timesN(3,f4) << std::endl;
}

This only happens with floats, not doubles, and only with ints that I pass from a function (not hardcoded).

Some links mention this, but I do not see a solution.

I am not sure about

  • what is actually causing this? should it not simply convert the scalar int to scalar float an dmultiply each element of the float vector?
  • is this a gcc bug?

@sponce and @hageboeck do you have an idea?
Thanks Andrea

@valassi
Copy link
Member Author

valassi commented Dec 8, 2020

PS I forgot to mention, I tried gcc8.3, 9.2 and 10.1, all of them have the same issue

@sponce
Copy link

sponce commented Dec 8, 2020

I believe it is as simple as the error says : "conversion of scalar ‘int’ to vector ‘float4’ {aka ‘__vector(4) float’} involves truncation". The important thing being at the end : the problem is that an int does not fit in a float, you will lose digits as an int has 31 bits and the mantissa of the float only 23. On the other hand the mantissa of a double is 52 bits, so it fits. Nothing related to vectors I believe. Although it may be a warning in scalar ?

@sponce
Copy link

sponce commented Dec 8, 2020

Ah, and of course if you explicitly cast (like (float)n), you're telling the compiler you know what you are doing and in this case that you want to but precision, so it does it

@sponce
Copy link

sponce commented Dec 8, 2020

And last point : with a hardcoded value, the compiler can check whether this particular value fits, and if it does, no problem. While with a parameter it has to suppose it won't fit

@valassi
Copy link
Member Author

valassi commented Dec 8, 2020

Very good, thanks a lot! 👍
I had not done my maths, ehm...

I am still a bit surprised that this only shows up as an error for vector types however:

  • if you multiply an int and a float, the int is converted/narrowed to float with no warning or error; if you multiply an int and a vector of floats, you get an error
  • similarly, I hav ethe impression, if you build a scalar float from an int, you get no warning; if you build a vector float from an int, you get a narrowing warning

Anyway I found a simple solution to my problem here: in Madgraph I can simply replace the "int" by a "short". Here these are only helicities, mainly +1 or -1. Until we find a particle with spin +32769, that should do it...

Debugging another unrelated issue, but I have a patch that should fix this, then I will close this. Thanks again Sebastien!

@sponce
Copy link

sponce commented Dec 8, 2020

Your weird points are surely explain by the automatic casting that happens behind the scene. I would have to check the references but basically float * int is probably resulting in 4 steps : converting float to double, converting int to double, computing product, converting back to float. Now take care, this would mean a slow down of a factor ~10 for the operation ! I've seen codes where you gain a factor 2 by fixing types properly to avoid conversions between floats and doubles. And I mean real life reconstruction code in LHCb, not toy examples !

For all these kind of things the key tool is godbolt. As an example, let's check I'm right on the float * int -> https://godbolt.org/z/73nq7c and I'm totally wrong ! :-D Ok, one would have to dig further !

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 8, 2020
…t multiplied by float vector)

Results ok in no-simd float. Throughput 1.84E5 (lower than no-simd double?!)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Internal loops fptype_sv   = SCALAR (no SIMD)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.973374e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.945622e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.775213e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.001029e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.845519e+00                 )  sec
MeanTimeInMatrixElems      = ( 2.845519e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.845519e+00 ,  2.845519e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.763276e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.779889e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.842504e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372328e-02 +- 1.131740e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 5.445383e-03 ,  7.884562e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194670e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
@valassi
Copy link
Member Author

valassi commented Dec 8, 2020

Thanks a lot Sebastien! Do not worry too much about my question. It's just peculiar that gcc does two different things in places where I would expect it to do the same thing, that's all.

Anyway, the issue is fixed, I moved to short int for helicities and I checked that all behave ok for floats also with simd (I had to fix a few more issues down the line): valassi@327b5c3

Closing this one, thanks again Sebastien!
Andrea

@valassi valassi closed this as completed Dec 8, 2020
@hageboeck
Copy link
Member

hageboeck commented Dec 8, 2020

Very good, thanks a lot! 👍
I had not done my maths, ehm...

I am still a bit surprised that this only shows up as an error for vector types however:

  • if you multiply an int and a float, the int is converted/narrowed to float with no warning or error; if you multiply an int and a vector of floats, you get an error
  • similarly, I hav ethe impression, if you build a scalar float from an int, you get no warning; if you build a vector float from an int, you get a narrowing warning

Anyway I found a simple solution to my problem here: in Madgraph I can simply replace the "int" by a "short". Here these are only helicities, mainly +1 or -1. Until we find a particle with spin +32769, that should do it...

Debugging another unrelated issue, but I have a patch that should fix this, then I will close this. Thanks again Sebastien!

@valassi
That's because you have ways to do error handling with scalar floating-point arithmetic. You can set traps that check for overflows nans, infinities etc. You can make the library throw floating-point exceptions, you can make it generate the SIGFPE signal and install signal handlers. Many C functions have the side-effect of setting the errno when something goes wrong. In short, you have lots of ways to "do stuff" if something goes wrong.

Now, if you are playing with vector types, things might go wrong in only a single lane of the vector register. What should be done? Send signals? What if you stored all 4 values already in memory, and entry 1 generates an exception? Are entries 2 and 3 valid or should they be wiped away? (A regular for-loop would not have computed them)

That is to say:
gcc is suuuuper conservative when you vectorise, because you might get different results as compared to the scalar implementation.

  • When doing auto vectorisation, it just doesn't vectorise the loop.
  • When doing explicit vectorisation, it doesn't let you shoot yourself in the foot.

So Sebastiaen is right that getting your types right can improve a lot of things.

@valassi
Copy link
Member Author

valassi commented Dec 8, 2020

Hi @hageboeck thanks, useful to know. But I think implementing some of the things you mention would be too complicated and unnecessary. Anyway, using short int (which I am quite sure is a good type for helicities) resolves the issues (without any hacks, it les gcc do what it finds right). So, issue solved :-)

@valassi
Copy link
Member Author

valassi commented Dec 9, 2020

For the record, here's a very useful table (by user "Z boson": thanks!): https://stackoverflow.com/a/43778723

This may be extremely useful to port this implementation beyond gcc to clang, icc or other compilers. It seems that clang/icc do not support most scalar-operator-vector operations, so some extra boilerplate additions is probably needed to port there (or we could use the LHCb SIMD wrapper described by @sponce).

@hageboeck
Copy link
Member

hageboeck commented Dec 9, 2020

I explained why gcc complains. I actually didn't Edit: mean to propose to implement anything.

I'm all for writing a nice simple for loop and letting the compiler do the rest. Too much of those vector extensions looks like a step in the opposite direction of portability.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants