Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NVHPC takes >10 minutes to compile params.cpp #922

Open
bprather opened this issue Aug 15, 2023 · 6 comments
Open

NVHPC takes >10 minutes to compile params.cpp #922

bprather opened this issue Aug 15, 2023 · 6 comments

Comments

@bprather
Copy link
Collaborator

When compiling Parthenon with NVHPC (every version, machine, host & target architecture I tested), the compiler takes a very long time on one specific file, params.cpp. In my testing it adds about ~10 minutes to the compile on both desktop & HPC machines, all of which is spent solely compling this one file on one core, mostly in the cpp2 subprogram.

The compiler is also pretty slow with some other later-stage Parthenon files (e.g. metadata.cpp), but params.cpp is by far the worst. This appears to be due to a change within the last ~8mo and likely on the more recent side of that, since previous versions I've used with KHARMA do not show this behavior with any NVHPC version. Now that I've tested enough to know it's present in all versions of NVHPC, I'll be bisecting locally and hopefully isolate the relevant commit.

Just wanted to check whether others have seen similar behavior, and if so whether it's worth surfacing to Nvidia.

@Yurlungur
Copy link
Collaborator

I have not observed this, but I suspect I know what it is. params.cpp now has a lot of template nonsense in it to support output of params of different non-scalar types to the restart file. It could be split up into several implementation files to let the compiler parallelize probably, if that would be useful.

@bprather
Copy link
Collaborator Author

I can verify it's present in Parthenon proper when using the Kokkos nvcc_wrapper with nvc++ to compile for device (incidentally the process also uses up to ~11GB of RAM by itself, which could be an issue for even getting the code compiled on smaller machines). However, the issue isn't present when compiling with nvc++ for the host, which is interesting. The offending process cpp2 definitely seems to be processing CUDA front-end output for the file, so if we could find a way to avoid compiling (attempting to compile?) its functions for device we'd avoid the issue entirely.

That said, it's definitely a compiler issue at its core. Going to try splitting up the file next as @Yurlungur suggested, but maybe we should also think about simplifying it so we don't have two processes both trying to chew on these templates?

@Yurlungur
Copy link
Collaborator

That said, it's definitely a compiler issue at its core. Going to try splitting up the file next as @Yurlungur suggested, but maybe we should also think about simplifying it so we don't have two processes both trying to chew on these templates?

If you can figure out a way to simplify that templated code, by all means. I think there's some necessary complexity here, though. We're basically compiling the same function a bunch of times for different types for input/output data for file IO. So there's just a lot of code for the compiler to get through. It might be possible to move some of the code out of the lowest level functions in the header file into implementation files maybe? But I don't see any obvious wins there.

@bprather
Copy link
Collaborator Author

bprather commented Oct 6, 2023

Just tried to compile with NVHPC/NVC++ 23.9 and... it segfaults. The crash is in cpp2 as you might expect, when processing params.cudafe1.cpp. So, if we can't figure out how to avoid compiling this particular code with CUDA, this might be a good time to bump this up to Nvidia.

@pgrete
Copy link
Collaborator

pgrete commented Oct 6, 2023

Looks like nvc++ is now officially a supported compiler for Kokkos 4 (https://kokkos.github.io/kokkos-core-wiki/requirements.html#kokkos-4-x) though I'm not sure if this is for host or device code.
Last time I checked nvc++ was not supported for device builds.
So if this also happens for host builds, then it's definitely time to raise this to nvidia.

@bprather
Copy link
Collaborator Author

bprather commented Dec 1, 2023

This issue has gotten worse (>30min compiles with NVHPC 22.11 and above, which may as well be a compile error). It doesn't disappear even if I place

#ifndef __CUDA_ARCH__
...
#endif

around the entire file to prevent device compilation (though it seems to help a bit?) so it's at least partially host-side as long as nvcc is the wrapper doing the processing (FWIW placing __CUDACC__ as the test results in a linker error, so nvcc must be processing the host code too). Thus, I don't think that selectively avoiding compilation will help us here.

It's also starting to be a problem in GCC-based compiles with recent versions (on the now quaint order of 8-10 minute compiles), and it seems to be the long pole in all Parthenon compiles even with LLVM-based compilers.

So, I'm bumping this: if anyone has simple modifications to params.cpp that would make this any faster to compile and would be easy to test, I would be eager to help test them out. Otherwise I'll take a look next week, though I think my knowledge of templates is insufficient to produce a quick or light-handed fix, I'd likely default to just trying to break up all the instantiations into different units

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants