Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

WIP: Split the EMHD kernel #122

Merged
merged 3 commits into from
Sep 6, 2024
Merged

WIP: Split the EMHD kernel #122

merged 3 commits into from
Sep 6, 2024

Conversation

bprather
Copy link
Contributor

@bprather bprather commented Sep 3, 2024

Backport of previous PR, which had merged the ISMR branch.

This is a bigger patch than I'd remembered. It eliminates the booleans from emhd_params, adds and modifies a few overloads to make the implicit kernel functions work on all globally-indexed variables, and only then can split the kernel.

Turns out these changes still make the CPU performance substantially slower (~2x?). If it's faster on both Nvidia and AMD GPUs that might be justifiable, but we shouldn't toss this in just yet.

Backport of previous patch which merged the ISMR branch.

This is a bigger patch than I'd remembered.  It eliminates the
booleans from `emhd_params`, adds and modifies a few
overloads to make the implicit kernel functions work on all
globally-indexed variables, and only *then* can split the kernel
@bprather bprather changed the title Split the EMHD kernel WIP: Split the EMHD kernel Sep 3, 2024
@vedantdhruv96
Copy link
Contributor

vedantdhruv96 commented Sep 5, 2024

Results from running on a single A100 on Delta (perf reported in total ZCPS):

Problem: Extended MHD linear modes 1024x1024 (nlim=150)

dev: 1.8 x 10^6
unsplit: 1.75 x 10^6
split: 1.90 x 10^6

Problem: Extended MHD torus 288x128x128 (nlim=102)

dev: 3.64 x 10^6
unsplit: 3.90 x 10^6
split: 3.56 x 10^6

I'm not sure if I see a clear preference; at most there is ~10% gain for any type of solver split. I believe the problem size, especially for the torus problem, is large enough that we are deep into the compute-bound regime.

@bprather
Copy link
Contributor Author

bprather commented Sep 6, 2024

On Frontier, one full node, using a slightly larger EMHD torus (256x128x256) (parameters in sane_perf_emhd.par):

  • dev: 4.03e+06
  • merged: 7.64e+06
  • split: 1.22e+07
    The old PR (which didn't have an option) and new PR using split have the same performance. Informally testing on a single large CPU, I see 5-8e5 (!) ZCPS, both in dev and in this branch, split or not -- there's perhaps a 10-30% decrease in performance in the new code, but not a 2x hit. Since CPUs aren't our primary target, especially for EMHD, this seems like an acceptable trade.

So, this PR improves performance drastically on MI250x, bringing it back in line with other platforms so that EMHD is about 8-10x slower than ideal GRMHD everywhere. It mildly slows performance on CPUs and seems to be completely within measurement error on A100s. Seems like we should merge it.

@bprather bprather merged commit 16f3871 into dev Sep 6, 2024
1 of 2 checks passed
@bprather bprather mentioned this pull request Sep 12, 2024
@bprather bprather deleted the feature/split-emhd branch September 13, 2024 19:00
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants