-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Bad matrix-vector multiplication performance in Cuda #257
Comments
Very interesting. It's possible that the strategies in place are better fit for SSE register size allocations than for scalar GPUs. Actually, the implementation is really based SSE registers. It seems very possible to come with a better strategy for scalar GPUs. I'll investigate when I find the time for this. Thanks, |
Could you give a rough estimate so that I can decide whether to use glm for my project? Thanks, adam |
Sadly I can't. |
ok : ) I made one additional test by replacing operator*(..) with my own function:
the performance didn't change. it's still 4 times slower. therefore I think it might have something to do with how the matrices are stored. in my custom code rowMajor is used, glm uses column major as far as i know. on the other hand, i also tried storing row major and flipping my custom mul with no success. bottom line: i think it's neither row major storage nor the multiplication function, but something else. |
Looking in glm/detail/type_vec4.hpp I saw the storage type used in
And the |
yey : D those are the new benchmarks 8)
the matrix multiplication is not totally there yet, but very close. the other two are even faster than before. this is a diff to version 0.9.5.4:
but i'm afraid it can't go into master like that. anyway, it fixes glm for me now and i can use it until the change goes in. thanks, i'm happy. |
Interesting results. fvec4SIMD does have these alignments code in place and yes it's pretty important for CPU SSE code paths. Do you actually need: |
hm, it seems that i don't need them. i put them originally because qtcreator showed me an error. but at the moment i'm not sure of anything because after messing around with glm's mat4*vec4 operator suddenly cuda's helper math matrix operations take 333 milliseconds (100 more). So i'll reboot and recheck. (edit: after reboot it's back to 225) is it possible to use some define or something to enable it, so that i can use vanilla glm? one more info:
instead of the code with |
Use GLM_ALIGN(16) instead of builtin_align(16) and you might be good to go... But that's probably all that is needed here. |
this works for me. |
This changes made it to master for GLM 0.9.6 release. Thanks for contributing! |
do you think it would be also possible to add the alignment rules for vec2 and 3? it would be align(8) for vec2 and align(16) for vec3. i did some tests with the cross kernel: float3 i don't understand why vec3 aligned to 16 bytes is slower than vec4. it might be the testing method. but the thing is that the cuda docs recommend align 16, so it should be better.. |
The fix for this bug turns out to be not valid as it would force all vec4 types to be aligned to 16 bytes including i8vec4 for example. The fix has been reverted for the moment. |
I created a new extension exposing aligned types. Alignment is definitely not what we always want even if it's extremely useful. If you want an aligned flavor of a vec4, include <glm/gtc/type_aligned.hpp> and you can use aligned_vec4. You can also define your own aligned type in a cross platform manner using: Where my_vec3 is a vec3 aligned to 16 bytes. Thanks for contributing, |
Hi, What's left to fix here? |
During development I noticed that glm slowed down my project quite a bit. So I made some simple tests with cuda + glm.
I basically do
several times in a cuda kernel. Once using glm and once cuda's float4 + a very simple custom mat4.
interestingly dot and cross product are faster with glm in a similar test:
I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.
i've put the full test onto bitbucket (https://bitbucket.org/adamce/cuda-glm-performance-test/src/).
The text was updated successfully, but these errors were encountered: