Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Optimise OpenCL kernels #7

Open
hjabird opened this issue May 23, 2019 · 4 comments
Open

Optimise OpenCL kernels #7

hjabird opened this issue May 23, 2019 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@hjabird
Copy link
Owner

hjabird commented May 23, 2019

Surely the code can be faster?

@hjabird hjabird added the enhancement New feature or request label May 23, 2019
@hjabird hjabird self-assigned this May 23, 2019
@hjabird
Copy link
Owner Author

hjabird commented May 23, 2019

OpenCL particle induced velocity:

See commit ae683f3

Approximately 7.5% faster:

  • Move 1/4pi term to host
  • Precompute 1 / regularisation_radius
  • Winckelmans kernel: g = ... / powr(rho, 2.5) -> ... * rsqrt(pown(rho, 5))
  • Neaten with radd = length(rad)
  • Remove if(all(isequal(mespnt, particlepnt))) and replace with isnormal(..) ? ret : 0.0 0.0
  • Use knowledge that get_global_id(0) == get_local_id(0)

Not used:

  • fast_length instead of length results in ~7% faster, but this results in far lower precision.

Verified good:

  • Workgroup size seems good.
  • Reduction method seems good. I tried a scheme where reductions was performed in registers rather than local data store. This slowed down GPU considerably.
  • Host memory prep only uses a few % of runtime.

To do:

  • GPU loop unrolling?

NB:

  • Benchmarking performed using a Julia script. The times are for the ind_vel only. The julia script adds overhead, dominating the low end results and adding a couple of seconds at around a million particles.

@hjabird
Copy link
Owner Author

hjabird commented May 23, 2019

commit d1a77dc

ind_dvort is approximately 27% faster.

  • Move 1/(4 pi reg_rad^3) to host
  • Precompute 1 / reg_rad
  • Winckelmans kernel: same rsqrt(pown(..)) trick as above for both f and g terms.
  • Same removal of if with ret = isnormal(ret) ? ret : (float3)(0.f, 0.f, 0.f) as above

As above, there is some overhead in the Julia script used for benchmarking. 3% in worst case.

@hjabird
Copy link
Owner Author

hjabird commented May 23, 2019

commit 6d962b7

ind_dvort is 1.5% faster

  • Use knowledge that get_global_id(0) == get_local_id(0)
  • use repetition of 1/(rho^3)

@hjabird
Copy link
Owner Author

hjabird commented May 24, 2019

commit 4db32f2

Changing to transpose method for dvort calculations allows reusing a cross product. 7% faster.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant