WIP: Split the EMHD kernel #122

bprather · 2024-09-03T16:39:38Z

Backport of previous PR, which had merged the ISMR branch.

This is a bigger patch than I'd remembered. It eliminates the booleans from emhd_params, adds and modifies a few overloads to make the implicit kernel functions work on all globally-indexed variables, and only then can split the kernel.

Turns out these changes still make the CPU performance substantially slower (~2x?). If it's faster on both Nvidia and AMD GPUs that might be justifiable, but we shouldn't toss this in just yet.

Backport of previous patch which merged the ISMR branch. This is a bigger patch than I'd remembered. It eliminates the booleans from `emhd_params`, adds and modifies a few overloads to make the implicit kernel functions work on all globally-indexed variables, and only *then* can split the kernel

…makes much difference in this form.

vedantdhruv96 · 2024-09-05T06:04:56Z

Results from running on a single A100 on Delta (perf reported in total ZCPS):

Problem: Extended MHD linear modes 1024x1024 (nlim=150)

dev: 1.8 x 10^6
unsplit: 1.75 x 10^6
split: 1.90 x 10^6

Problem: Extended MHD torus 288x128x128 (nlim=102)

dev: 3.64 x 10^6
unsplit: 3.90 x 10^6
split: 3.56 x 10^6

I'm not sure if I see a clear preference; at most there is ~10% gain for any type of solver split. I believe the problem size, especially for the torus problem, is large enough that we are deep into the compute-bound regime.

bprather · 2024-09-06T17:34:29Z

On Frontier, one full node, using a slightly larger EMHD torus (256x128x256) (parameters in sane_perf_emhd.par):

dev: 4.03e+06
merged: 7.64e+06
split: 1.22e+07
The old PR (which didn't have an option) and new PR using split have the same performance. Informally testing on a single large CPU, I see 5-8e5 (!) ZCPS, both in dev and in this branch, split or not -- there's perhaps a 10-30% decrease in performance in the new code, but not a 2x hit. Since CPUs aren't our primary target, especially for EMHD, this seems like an acceptable trade.

So, this PR improves performance drastically on MI250x, bringing it back in line with other platforms so that EMHD is about 8-10x slower than ideal GRMHD everywhere. It mildly slows performance on CPUs and seems to be completely within measurement error on A100s. Seems like we should merge it.

bprather added 2 commits September 3, 2024 10:37

Allow compiling new implicit kernel as merged or split. Not clear it …

f2e8960

…makes much difference in this form.

bprather changed the title ~~Split the EMHD kernel~~ WIP: Split the EMHD kernel Sep 3, 2024

Merge branch 'dev' into feature/split-emhd

890fb6e

bprather merged commit 16f3871 into dev Sep 6, 2024
1 of 2 checks passed

bprather mentioned this pull request Sep 12, 2024

KHARMA 2024.9 #98

Merged

bprather deleted the feature/split-emhd branch September 13, 2024 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Split the EMHD kernel #122

WIP: Split the EMHD kernel #122

bprather commented Sep 3, 2024 •

edited

Loading

vedantdhruv96 commented Sep 5, 2024 •

edited

Loading

bprather commented Sep 6, 2024

WIP: Split the EMHD kernel #122

WIP: Split the EMHD kernel #122

Conversation

bprather commented Sep 3, 2024 • edited Loading

vedantdhruv96 commented Sep 5, 2024 • edited Loading

Results from running on a single A100 on Delta (perf reported in total ZCPS):

dev: 1.8 x 10^6 unsplit: 1.75 x 10^6 split: 1.90 x 10^6

dev: 3.64 x 10^6 unsplit: 3.90 x 10^6 split: 3.56 x 10^6

bprather commented Sep 6, 2024

bprather commented Sep 3, 2024 •

edited

Loading

vedantdhruv96 commented Sep 5, 2024 •

edited

Loading

dev: 1.8 x 10^6
unsplit: 1.75 x 10^6
split: 1.90 x 10^6

dev: 3.64 x 10^6
unsplit: 3.90 x 10^6
split: 3.56 x 10^6