-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathopenmp_to_gpu.txt
350 lines (234 loc) · 17.4 KB
/
openmp_to_gpu.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
OpenMP Offload to GPUs
======================
Steven K. Baum
0.1, Oct.. 10, 2021: It begins.
:doctype: book
:toc:
:icons:
:numbered!:
== Overview
Codes that are already instrumented for OpenMP use via pragmas can also be offloaded to GPUs
without further coding.
*Introduction to OpenMP GPU Offloading* - https://www.olcf.ornl.gov/calendar/introduction-to-openmp-gpu-offloading/[`https://www.olcf.ornl.gov/calendar/introduction-to-openmp-gpu-offloading/`]
*OpenMP Accelerator Support for GPUs* - https://www.openmp.org/updates/openmp-accelerator-support-gpus/[`https://www.openmp.org/updates/openmp-accelerator-support-gpus/`]
*Analysis of OpenMP 4.5 Offloading in Implementations* - https://www.osti.gov/servlets/purl/1648853[`https://www.osti.gov/servlets/purl/1648853`]
*OpenMP on GPUs, First Experiences and Best Practices* (2018) - https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf[`https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf`]
=====
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between
CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback to generate host and device code
=====
*Best Practices for OpenMP on NVIDIA GPUs* - https://www.youtube.com/watch?v=9w_2tj2uD4M[`https://www.youtube.com/watch?v=9w_2tj2uD4M`]
*Is OpenMP 4.5 Target Off-Load Read for Real Life?* - https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Jost.pdf[`https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Jost.pdf`]
*Improved Automatic GPU Offloading Method* - https://www.tandfonline.com/doi/full/10.1080/17445760.2021.1941010[`https://www.tandfonline.com/doi/full/10.1080/17445760.2021.1941010`]
*Writing A Portable GPU Runtime with OpenMP 5.1* - https://arxiv.org/abs/2106.03219[`https://arxiv.org/abs/2106.03219`]
== GCC
*Offloading Support in GPU* - https://gcc.gnu.org/wiki/Offloading[`https://gcc.gnu.org/wiki/Offloading`]
*Gfortran GPU Offload with OpenMP* - https://fortran-lang.discourse.group/t/gfortran-gpu-offload-with-openmp/314[`https://fortran-lang.discourse.group/t/gfortran-gpu-offload-with-openmp/314`]
GCC 11 supports OpenMP offloading to Intel MIC, Nvidia PTX and AMD GCN targets.
=====
The main option to control offloading is:
* `foffload=<targets>=<options>`
By default, GCC will build offload images for all offload targets specified in configure with non-target-specific options passed to host compiler. (However, in most Linux distributions: by default, offloading is disabled (executed on the host) and `-foffload=<targets>` is required to compile to enable the offloading to accelerators.) This option is used to control offload targets and options for them. It can be used in a few ways:
* `-foffload=disable` - Tells GCC to disable offload support. Target regions will be run in host fallback mode.
* `-foffload=<targets>` - Tells GCC to build offload images for `<targets>`. They will be built with non-target-specific options passed to host compiler.
* `-foffload=<options>` - Tells GCC to build offload images for all targets specified in configure. They will be built with non-target-specific options passed to host compiler plus <options>.
* `-foffload=<targets>=<options>` - Tells GCC to build offload images for `<targets>`. They will be built with non-target-specific options passed to host compiler plus <options>.
`<targets>` are separated by commas. Several <options> can be specified by separating them by spaces. Options specified by `-foffload` are appended to the end of option set, so in case of option conflicts they have more priority. The `-foffload` flag can be specified several times, and you have to do that to specify different `<options>` for different `<targets>`.
*Note*: You may need to specify `-foffload=-lm` and for Fortran `-foffload=-lgfortran`, if the offloaded code uses math functions or Fortran-library procedures.
*Note*: If you use atomics directly or indirectly, you may need to specify `-foffload=-latomic` or, if only one target needs it, e.g., `-foffload=nvptx-none=-latomic`.
For AMD GCN devices, you have to specify additionally the GPU to be used: `-march=<name>` where name is either `carrizo` or `fiji` (both: third generation) – or the fifth-generation `VEGA` (gfx900 or gfx906). In order to apply this setting to the AMD GCN offloading target, only, and not to the host (`-march=…`) or all other offloading targets (as with `-foffload=-march=…`), use `-foffload=amdgcn-amdhsa=<options>`. For instance: `-foffload=amdgcn-amdhsa="-march=gfx906"`. [NOTE: the target-triplet can be set when building the compiler and might differ between vendors; it can also be, e.g., `amdgcn-unknown-amdhsa`.]
=====
== Intel
*Get Started with OpenMP Offload to GPU* - https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html[`https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html`]
=====
The OpenMP* Offload to GPU feature of the Intel® oneAPI DPC++/C++ Compiler and the Intel® Fortran Compiler
(Beta) compiles OpenMP source files for a wide range of accelerators. Only the icx
and ifx compilers support the OpenMP Offload feature.
...
Intel supports two new options:
* `-qopenmp`
* `-fopenmp-targets=spir64`
that support OpenMP and offloading execution on CPU and GPU. The `-qopenmp`
option enables a middle-end that supports the transformation of OpenMP
in LLVM* (but not in the Clang* front-end). The `-fopenmp-targets=spir64`
option enables the compiler to generate a x86 + SPIR64
fat binary for the GPU device binary generation.
=====
=== Compiling and Running an Example
The following code is instrumented for OpenMP via various pragmas:
-----
// matmul.cpp: Matrix Multiplication Example using OpenMP Offloading
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define MAX 128
int A[MAX][MAX], B[MAX][MAX], C[MAX][MAX], C_SERIAL[MAX][MAX];
typedef int BOOL;
typedef int TYPE;
BOOL check_result(TYPE *actual, TYPE *expected, unsigned n) {
for (unsigned i = 0; i < n; i++) {
if(actual[i] != expected[i]) {
printf("Value mismatch at index = %d. Expected: %d"
", Actual: %d.\n", i, expected[i], actual[i]);
return 0;
}
}
return 1;
}
void __attribute__ ((noinline)) Compute()
{
#pragma omp target teams distribute parallel for map(to: A, B) map(tofrom: C) \
thread_limit(128)
{
for (int i = 0; i < MAX; i++)
for (int j = 0; j < MAX; j++)
for (int k = 0; k < MAX; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
int main() {
for (int i = 0; i < MAX; i++)
for (int j = 0; j < MAX; j++) {
A[i][j] = i + j - 1;
B[i][j] = i - j + 1;
}
for (int i = 0; i < MAX; i++)
for (int j = 0; j < MAX; j++)
for (int k = 0; k < MAX; k++)
C_SERIAL[i][j] += A[i][k] * B[k][j];
Compute();
if (!check_result((int*) &C[0][0], (int*) &C_SERIAL[0][0], MAX * MAX)) {
printf("FAILED\n");
return 1;
}
printf("PASSED\n");
return 0;
}
-----
==== Compiling the Code
This code `matmul_offload.cpp` is compiled for GPU offloading via:
-----
icpx -qopenmp -fopenmp-targets=spir64 matmul_offload.cpp -o matmul
-----
C codes are compiled for this using `icx`, Fortran codes with `ifx`, and Cpp codes with `ifx`.
==== Set the Environment Variable
The `OMP_TARGET_OFFLOAD` environment variable must be set to force offloading or fail via:
-----
export OMP_TARGET_OFFLOAD=MANDATORY
-----
==== Run the Compiled Code
The compiled binary `matmul` is run via:
-----
./matmul
PASSED
-----
and will produced the output `PASSED` if successful.
=== Integration of GPU-Optimized LIBM Functions
The compilers provide a way to choose some GPU-optimized math functions as an alternative
to the standard math libraries.
*OpenMP GPU Offload Introduction* (2021) - https://www.hlrn.de/doc/download/attachments/13730484/Steyer_OpenMP_GPU_Offload.pdf[`https://www.hlrn.de/doc/download/attachments/13730484/Steyer_OpenMP_GPU_Offload.pdf`]
=====
* OpenMP offload supported by the Intel® C++ Compiler and Intel® Fortran Compiler
as part of the Intel® oneAPI HPC Toolkit
* Use the target directive to offload
* Use the map clause with target, target data, target enter/exit data
directives to improve data transfer efficiency
* Use the teams/distribute directives fully utilize multiple GPU subslices
* Use the parallel/for/do directive to use the threads within a GPU subslice
* Use the simd directive for optimal simd execution on GPU execution units
=====
*Three Quick, Practical Examples of OpenMP Offload to GPUs* - https://www.intel.com/content/www/us/en/developer/videos/three-quick-practical-examples-openmp-offload-gpus.html#gs.drs907[`https://www.intel.com/content/www/us/en/developer/videos/three-quick-practical-examples-openmp-offload-gpus.html#gs.drs907`]
*Profiling an OpenMP* Offload Application running on a GPU* - https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/profiling-openmp-offload-application.html[`https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/profiling-openmp-offload-application.html`]
== NVIDIA
*HPC Compilers Users Guide: Using OpenMP* - https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-use[`https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-use`]
*Thinking OpenMP with NVIDIA HPC Compilers* - https://www.nas.nasa.gov/assets/nas/pdf/ams/2021/AMS_20210504_Ozen.pdf[`https://www.nas.nasa.gov/assets/nas/pdf/ams/2021/AMS_20210504_Ozen.pdf`]
The NVIDIA `nvc++`, `nvc` and `nvfortran` compilers can be enabled for OpenMP and target GPU and
multicure via:
-----
nvc -mp=gpu -gpu=[target] -Minfo=mp
-----
An environmental variable must also be set:
-----
set NVCOMPILER_ACC_NOTIFY 1/2/3
-----
*NVIDIA HPC SDK for C/C++/Fortran* - https://www.alcf.anl.gov/support-center/theta-gpu-nodes/openmp-thetagpu[`https://www.alcf.anl.gov/support-center/theta-gpu-nodes/openmp-thetagpu`]
-----
module load nvhpc-sdk/nvhpc/21.3
nvfortran -mp=gpu -gpu=cc80 your_source.f90
-----
*Easybuild Easyconfig* - https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/n/NVHPC[`https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/n/NVHPC`]
== Clang/LLVM
*LLVM/OpenMP Support and FAQ* - https://openmp.llvm.org//SupportAndFAQ.html[`https://openmp.llvm.org//SupportAndFAQ.html`]
*Building LLVM/Clang with OpenMP Offloading to NVIDIA GPUs* - https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs[`https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs`]
https://github.com/pc2/OMP-Offloading[`https://github.com/pc2/OMP-Offloading`]
*Compiling CUDA with Clang* - https://llvm.org/docs/CompileCudaWithLLVM.html[`https://llvm.org/docs/CompileCudaWithLLVM.html`]
*LLVM Clang for C/C++* - https://www.alcf.anl.gov/support-center/theta-gpu-nodes/openmp-thetagpu[`https://www.alcf.anl.gov/support-center/theta-gpu-nodes/openmp-thetagpu`]
-----
module load llvm/release-12.0.0
clang++ -fopenmp -fopenmp-targets=nvptx64 your_source.cpp
-----
*SOLLVE: OpenMP for HPC and Exascale* - https://www.exascaleproject.org/highlight/sollve-openmp-for-hpc-and-exascale/[`https://www.exascaleproject.org/highlight/sollve-openmp-for-hpc-and-exascale/`]
=====
The SOLLVE project is deeply engaged in efforts to provide a high-quality and complete OpenMP implementation in the LLVM Compiler Infrastructure Project. LLVM, an open-source collection of compiler and toolchain technologies, serves as a test bed for proposed OpenMP extensions (e.g., the interop directive in OpenMP 5.1) and as a vehicle to provide production-quality implementations of OpenMP features that could be exploited in vendor compilers built with LLVM technology.
=====
*FOTV: A Generic Device Offloading Framework for OpenMP* - https://link.springer.com/chapter/10.1007/978-3-030-85262-7_12[`https://link.springer.com/chapter/10.1007/978-3-030-85262-7_12`]
=====
We have extended the last development version (v13) of Clang to support the FOTV device.
=====
*DiscoPoP* - https://github.com/discopop-project/discopop[`https://github.com/discopop-project/discopop`]
=====
DiscoPoP is an open-source tool that helps software developers parallelize their programs with threads. It is a joint project of Technical University of Darmstadt and Iowa State University.
In a nutshell, DiscoPoP performs the following steps:
* detect parts of the code (computational units or CUs) with little to no internal parallelization potential,
* find data dependences among them,
* identify parallel patterns that can be used to parallelize a code region,
* and finally suggest corresponding OpenMP parallelization constructs and clauses to programmers.
DiscoPoP is built on top of LLVM. Therefore, DiscoPoP can perform the above-mentioned steps on any source code which can be transferred into the LLVM IR.
=====
*Easybuild Easyconfig* - https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/c/Clang[`https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/c/Clang`]
== AOCC (AMD)
https://developer.amd.com/amd-aocc/[`https://developer.amd.com/amd-aocc/`]
=====
The AOCC system is a high performance, production quality code generation tool. The AOCC environment provides various options to developers while building and optimizing C, C++, and Fortran applications targeting 32-bit and 64-bit Linux® platforms. It simplifies and accelerates the development and tuning for x86 applications.
The features include:
* Leveraging the robust LLVM Compiler Development Infrastructure
* Optimized throughout for AMD EPYC™, Ryzen™, and Ryzen™ Threadripper™ processors
* Offering Flang as the default Fortran front-end supporting F2008 and Real 128 features
* Support of Spack for flexible package management, including use by AMD optimized application recipes
=====
*Install Guide* - https://developer.amd.com/wp-content/resources/AOCC_57223_Install_Guide_Rev_3.1.pdf[`https://developer.amd.com/wp-content/resources/AOCC_57223_Install_Guide_Rev_3.1.pdf`]
*User Guide* - https://developer.amd.com/wp-content/resources/AOCC_57222_User_Guide_Rev_3.1.pdf[`https://developer.amd.com/wp-content/resources/AOCC_57222_User_Guide_Rev_3.1.pdf`]
*AMD Clang Guide* - https://developer.amd.com/wp-content/resources/AOCC_57224_Clang_C_C++_Compiler_Rev_3.1.pdf[`https://developer.amd.com/wp-content/resources/AOCC_57224_Clang_C_C++_Compiler_Rev_3.1.pdf`]
*AMD Flang Guide* - https://developer.amd.com/wp-content/resources/AOCC_57225_Flang_Fortran_Compiler_Rev_3.1.pdf[`https://developer.amd.com/wp-content/resources/AOCC_57225_Flang_Fortran_Compiler_Rev_3.1.pdf`]
*Easybuild Easyconfig* - https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/a/AOCC[`https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/a/AOCC`]
== AOMP (AMD)
https://github.com/ROCm-Developer-Tools/aomp[`https://github.com/ROCm-Developer-Tools/aomp`]
=====
AOMP is a scripted build of LLVM and supporting software. It has support for OpenMP target offload on AMD GPUs. Since AOMP is a clang/llvm compiler, it also supports GPU offloading with HIP, CUDA, and OpenCL.
=====
*LLVM Support for GPU Offloading for AMD Radeon GPUs* - https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-LLVM-OpenMP-Merged[`https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-LLVM-OpenMP-Merged`]
=====
AMD has been maintaining AOMP as their downstream of LLVM Clang for supporting OpenMP offloading to Radeon hardware.
...
At the moment it looks like AOMP still has some patches not yet upstreamed into LLVM but great seeing definite progress in this area. We'll see though if AOMP is still going to be maintained moving forward for new/premature code for AMD support prior to it being mature enough for upstreaming or if AOMP will ultimately be phased out from their ROCm compute stack.
...
Similarly, since GCC 10 has been OpenMP and OpenACC offloading to AMD Radeon GPUs. However, in the case of the GNU Compiler Collection its support and compatible GPUs is not nearly as well off as AMDGPU LLVM.
=====
*Easybuild Easyconfig* - https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/a/AOMP[`https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/a/AOMP`]
=====
The `AOMP-13.0-2-gcccuda-2020a.eb` easyconfig file was modified with the addition of:
`cuda_compute_capabilities = ['7.5', '8.0', '8.6']`
to account for the CUDA computing capabilities of the NVIDIA A100, RTX-6000 and T4 GPUS on grace which are,
respectively, 8.0, 8.6 and 7.5.
=====
== Mercurium
https://github.com/bsc-pm/mcxx[`https://github.com/bsc-pm/mcxx`]
=====
Mercurium is a C/C++/Fortran source-to-source compilation infrastructure aimed at fast prototyping developed by the Programming Models group at the Barcelona Supercomputing Center.
Mercurium is used, together with the Nanos++ Runtime Library, to implement the OmpSs programming model. Both tools provide also an implementation of OpenMP 3.1. More recently, Mercurium has been also used to implement the OmpSs-2 programming model together with the Nanos6 Runtime Library.
=====