-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAcceProgModel.tex
1273 lines (1068 loc) · 41.3 KB
/
AcceProgModel.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%
%% AcceProgModel.tex
%% Login : <hoang-trong@hoang-trong-laptop>
%% Started on Tue Nov 10 10:18:34 2009 Hoang-Trong Minh Tuan
%% $Id$
%%
%% Copyright (C) 2009 Hoang-Trong Minh Tuan
%%
\chapter{Accelerator Programming Model}
\label{chap:accel-progr-model}
References:
\begin{enumerate}
\item \url{http://www.pgroup.com/lit/articles/insider/v1n1a1.htm}
\end{enumerate}
\section{Introduction: Accelerator Programming Model (APM)}
\label{sec:introduction-1}
With the advance in computer hardware, especially special-purpose
high-throughput computing device like GPU, there is an emerging in the number of
different parallel programming models. In this chapter, we will discuss {\bf
Accelerator Programming Model} (APM) developed by PGI for Fortran programming
language (since version 9.0-4, with version 10.1 is the first to support full
specification v1.0). The special-purpose computing device is called {\bf
Accelerator} (Sect.\ref{sec:Accelerator}) to which the CPU can offloads data and
have it to carry out the computing.
\subsection{What is Accelerator}
\label{sec:Accelerator}
In the context of APM, an Accelerator is defined as a device that can
offload and process the works from CPU.
\textcolor{red}{In essence, an Accelerator is typically a
{\it co-processor} to the host (the CPU); it has its own instruction
sets, and usually (but not always) its own memory}.
So, an Accelerator can be another CPU, either homogeneous or
heterogeneous, or GPU. Nowadays, an Accelerator fits on a single chip
like a CPU, e.g. Sony/Toshiba/IBM Cell Broadband Engine
vs. GPUs. Current APM (v1.1) supports CUDA-capable NVIDIA cards only.
\subsection{What is kernel}
The code run on the Accelerator is called the {\bf kernel}.
For the current implementation of APM, parallelism is exploited between multiple
iterations of the kernel, not between multiple kernels.
\subsection{How APM works?}
PGI APM use a list of OpenMP-like directives to direct the compiling process,
allowing the compiler to know which part of the codes to be compiled for running
on the accelerator and which for running on the CPU. This is a precursor
to OpenACC (Chap.\ref{chap:openACC}), which is an open standard.
APM, by using directives, allows programmers to maintain a single source code
which can be compiled to run either on CPU or GPU. In essence, APM defines some
directives for programmers to defined which code to run on CPU and which one to
run on GPU. Via a preprocessing process, the appropriate codes will be extracted
to be compiled for the CPU (if user want the program to run on CPU), or to
compile for the GPU (if user wants the program to run on GPU).
NOTE: APM does not make parallel programming easy, but it does reduce
the cost of making it parallel, i.e. we don't have to be a heroic
programming expert.
\subsection{Hardward/Software requirements}
\label{sec:hardw-requ-1}
Currently, APM v1.0 only support x64 systems, with specific OSs (check
the documentation). When you program, there are also other
restrictions; yet for now, restrictions on hardware and requirements
for software are mentioned
\begin{enumerate}
\item A 64bit X64 system (single or multicore) with a Linux
distribution, support both by PGI and NVIDIA, e.g. RedHat, OpenSUSE,
Fedora, Ubuntu, SLES.
\item A CUDA-capable card, e.g. NVIDIA Tesla
\item Install appropriate both CUDA software (CUDA driver, CUDA
toolkit, CUDA SDK) and PGI compiler.
\end{enumerate}
\subsection{Limitations}
\label{sec:limitations}
Tesla second generation (Tesla with Compute Capability 1.3) has some
limitations\footnote{\url{http://www.pgroup.com/resources/accel.htm}}.
\begin{enumerate}
\item do not support function calls inside the kernel, except for
{\it inlined} functions.
\item do not work well with rounding modes, and some operations.
% (square root, exponential, logarithm, transcendental functions).
Here are the list of supported intrinsic functions
\begin{verbatim}
ABS, ACOS, AINT (truncation to the whole number),
ANINT (nearest whole number), ASIN, ATAN, ATAN2,
COS, COSH, DBLE (convert to double precision),
DPROD(double precision real product), EXP,
IAND (bit-by-bit logical AND), IEOR (bit-by-bit XOR),
INT, IOR, LOG10, MAX, MIN, MOD, NINT, NOT, REAL,
SIGN, SIN, SQRT, TAN, TANH
\end{verbatim}
\end{enumerate}
{\bf Something about Compute Capability}: G80 has CC 1.0, G8x/G92 has
CC 1.1, GT200 has CC 1.3 (which supports FP64 ALUs). They don't have
any devices with CC 1.2; probably intentionally reserved for a future
low-end parts that do not support FP64 ALUs. CC is all about
hardware's capability. CUDA version, on the other hand, is about
software (including driver, toolkit...). We may have CUDA version 2.0,
or 2.3. \sout{The latest one is 2.3.1.} The latest one is 3.0.
A good reference for all CUDA programmers is
here\footnote{\url{http://www.beyond3d.com/content/reviews/51}}.
\subsection{How Accelerate works}
\label{sec:how-accelerate-works}
It will compile your code to a portable intermediate format. This
format is the same for different applications (graphics,
games...). The code will be dynamically translated and reoptimized at
runtime by the drivers supplied by the vendors for a particular model
of the GPU.
List of Accelerate-enabled GPU cards is
\hyperref[http://www.nvidia.com/object/cuda_learn_products.html]{here}. Currently,
with APM v1.0, the Accelerator must be CUDA-capable GPU card(s).
The host must coordinate the execution
\begin{enumerate}
\item allocate memory on the accelerator
\item initialize data transfer
\item send the kernel code to the accelerator
\item pass kernel arguments
\item queues the kernel
\item wait for the completion of the kernel
\item transfer necessary result back to the host
\item deallocate the memory
\end{enumerate}
\section{Enable APM support}
\label{sec:enable-accelerator}
\subsection{In code}
\label{sec:code}
These are the accelerator runtime libraries:
\begin{enumerate}
\item In C: include to \verb!accel.h!
\item In Fortran: interfaces declarations are provided in the file
\verb!accel_lib.h!, and in a Fortran module named \verb!accel_lib!
\end{enumerate}
% \section{Makefile}
% \label{sec:makefile}
In the code you need to use \verb!accel_lib! module
\begin{lstlisting}
#ifdef _ACCEL
use accel_lib
#endif
...
#ifdef _ACCEL
! set the device for each process
numdevices = acc_get_num_devices(acc_device_nvidia)
! mydevice = mod(myRank,numdevices)
call acc_set_device_num(mydevice,acc_device_nvidia)
#endif
\end{lstlisting}
To use the preprocessor directives (\verb!#ifdef...#endif!) in Fortran
code, you need to compile the program with \verb!-Mpreprocess!
compiler option.
\subsection{In linking + compiling}
\label{sec:linking-+-compiling}
If your code has APM directives, and you want the code to be compiled
to PGI Unified Binary for accelerator, you link it to the accelerator
libraries using the ``-ta'' flag. Currently, it supports only NVIDIA
GPU Card.
\begin{lstlisting}
pgf95 -ta=nvidia -Minfo=accel
\end{lstlisting}
There are several other options that you need to know to tune
Accelerate matching your need. The best way is to play with them and
test the performance of the compiled file.
\begin{itemize}
\item \verb!-ta=nvidia[,suboption],host!, with suboption for NVIDIA
CUDA-capable GPU can be either
\begin{enumerate}
\item \verb!analysis!: analysis only, no code generation
\item \verb!cc10!, \verb!cc11!, \verb!cc13!, (since PGI 10.4)
\verb!cc20!: select the appropriate
\hyperref[sec:compute-capability]{compute capability}.
\item \verb!fastmath!: use routines from the fast math library
(faster, yet less precision)
\item \verb!keepgpu!: keep the kernel source file
\item \verb!keepptx!: keep the portable assembly file (.PTX) for the
GPU code.
\item \verb!maxregcount:n! (maximum number of registers to be used),
system managed by default.
\item \verb!mul24! (use only low 24-bits of the two integer X,Y for
multiplication, \verb!mul24(X,Y)!; this can save time if X,Y in
the kernel is in the range $[-2^{23},2^{23}-1]$)
\item \verb!nofma! (no use of fused-multiply-add instruction - FMA
can increase the precision of the computation)
\item \verb!time! (this tell the compiler to link in a timer
library, so that when the program runs, it prints out simple timing
information for profiling the accelerated kernel)
NOTE: Until Fortran 10.1, you need to either (1) link to
``/opt/pgi/linux86-64/10.1/lib/kvprint.o'', (2) use `pgf90 -Mmpi''
\end{enumerate}
\item \verb!-fast! : use fast math library (for computation in CPU).
\item \verb!-O3!: for optimized code.
\item \verb!-tp=...! : to target the binary code to a specific
processors, e.g. for AMD Opteron with revision E,
\verb!-tp=k8-64e!. As APM can detect this automatically, use this
just in case you want the compiled program to run on a different
machine with a different architecture.
\item \verb!-Minfo=accel!: display compiling
information for Accelerate. You can also limits the information to
be displayed by using ``-Minfo=...'' with ... can be ``loop'', ...
\item \verb!-Minfo! is equivalent to
\begin{lstlisting}
-Minfo=accel,inline,ipa,loop,lre,mp,opt,par,unified,vect
\end{lstlisting}
\end{itemize}
\subsection{Tips and Tricks}
\label{sec:tips-tricks}
{\bf Tips and Tricks}:
\begin{enumerate}
\item If the NVIDIA GPU is off, for the first accelerator region, it
takes 0.5 to 1.5 seconds to warm up the GPU (from a power-off
condition). To avoid this, we run \verb.pgcudainit. program in the
background.
\item You can generate a single binary file that include two versions
(one to work on CPU, the other to work on GPU), you compile with the
option \verb!-ta=nvidia,host!. Then you can use \verb.ACC_DEVICE. to
specify whether the compiled program use GPU or CPU (see below).
\item If the compiled program contains two versions of the code, you
can control whether to use GPU or CPU by setting the environment
variable \verb.ACC_DEVICE. to {\bf NVIDIA} (run on GPU) or
{\bf HOST} (run on CPU). NOTE: NVIDIA or HOST is case insensitive.
\item If the program use the GPU and your machine has more than one
NVIDIA CUDA-capable GPU, you can tell which device to use via the
environment variable \verb!ACC_DEVICE_NUM! (its value is a
non-negative integer value)
\item Regardless of the environment variable setting described above,
the programmer can explicitly specify in the code using the
following functions
\begin{enumerate}
\item \verb!acc_get_device!
\item \verb!acc_get_num_devices!
\item \verb!acc_init!
\item \verb!acc_on_device!
\item \verb!acc_set_device!
\item \verb!acc_set_device_num!
\item \verb!acc_get_device_num! (new in PGI 10.5): return the ID of
the device being used to execute an accelerator region.
\end{enumerate}
\item After finish using the Accelerator, a programmer should shutdown
the connection and free up memories using \verb!acc_shutdown!
\item To print out a message whenever the program launches a kernel,
we set the environment variable \verb.ACC_NOTIFY. to a non-zero
integer value.
\item It is recommended to fully specify the dimension of the array in
all APM clauses.
\end{enumerate}
\section{Fortran version}
\label{sec:fortran-version}
\subsection{9.04}
\label{sec:9.04}
At version 9, only a few directives have been implemented.
\begin{enumerate}
\item \verb.!$acc region.
\end{enumerate}
\subsection{10.0}
\label{sec:10.0}
At version 10, many other features have been added
\begin{enumerate}
\item \verb.!$acc data region.
\item \verb.!$acc update.
\item loop scheduling: unroll, shortloop, cache
\item runtime routines: \verb.acc_shutdown., \verb.acc_on_device.
\end{enumerate}
\subsection{10.1}
\label{sec:10.1}
At version 10.1, the full APM v.1.0 standard is now supported.
\subsection{10.3}
\label{sec:10.3}
Support APM v.1.1 standard.
\subsection{10.4}
\label{sec:10.4}
Support APM v.1.2 standard.
\section{Directives}
\label{sec:directives-1}
Similar to OpenMP, PGI Accelerate Programming Model (APM) contains a
set of directives. The model can be used with either Fortran or C
language. Directive in Fortran are {\it sentinels} (stylized comment)
while those in C conform with \verb!#pragma! syntax.
\textcolor{red}{In this chapter, we will discuss directives for Fortran
only.}
Directives tell the regions of codes or program units to be converted
to {\bf kernels} which run on Accelerator, e.g. GPU or CPU. Using
directives, APM provides a {\it fine-grained
control}\footnote{control
the execution of loops}
over the mapping of loops, allocation of memory, and optimization for
the GPU memory hierarchy.
\textcolor{red}{It's important to know that not all regions of codes
can be converted to kernels.}
Such restrictions will be described in subsequent sections.
A segment of code (containing loops), in order to be parallelized to
run on an accelerator need to be surrounded by a pair of directives,
except the case when
\hyperref[sec:define-decl-data]{implicit data region} is used
(\verb.!$acc.)\footnote{This is not available until PGI Fortran 10.1}.
Normally, the code contains one or more DO loops which can contain
other DO loops; each loop body is intended to map onto a kernel and
the loop iterations map to the kernel domain (i.e. threads). In other
words, each thread will execute a single loop iteration. Thus, the
result must not depend upon the order of execution for each iteration.
If the source file is free-form, then you can use \verb.!$acc.. In
fixed-form source, you can use three different sentinels
\verb.!$acc, c$acc., or \verb!*$acc!. Another important point is that
if the source file is in {\bf fixed-form format}, the sentinel need to
be in column 1-5.
\begin{verbatim}
!$acc directive-name [clause [[,] clause] ... &
!$acc continuation-from-the-previous-line
the code to be parallelized
!$acc end directive-name
\end{verbatim}
Each directive has only one directive-name and, optionally, one or
more clauses. The directive-name can be either
\begin{enumerate}
\item \hyperref[sec:do-loop-mapping]{do}: telling how to manipulate
the single DO loop coming right after the directive.
\item \hyperref[sec:define-region]{region} : an accelerator compute
region to be parallelized.
% \verb!region!
\item \verb!data region!:
\item \verb!region do!
\end{enumerate}
In Fortran, by default, the directives are case-insensitive. If
case-sensitive is enabled, all keywords must be in lower cases. Also,
they must not be intervened with any continued statements.
{\bf TERMS}:
\begin{enumerate}
\item A structured block is a statement of a compound statement with a
single entry at the top and single exit at the bottom. In C, a
structured block is enclosed by \{ and \}.
\end{enumerate}
% There are three types:
% \begin{enumerate}
% \item A region to be parallelized
% \item A
% \end{enumerate}
\subsection{DO loop mapping}
\label{sec:do-loop-mapping}
DO loop mapping is done via the \verb.!$acc do. directive and it must
be within an Accelerator compute region
(Sect.~\ref{sec:define-region}). Otherwise, it is ignored, and runs on
host.
When a single DO loop is inside a compute region, the APM model
provides MIMD parallelism ({\it doparallel}) and SIMD parallelism
({\it dovector}). Thus, it allows the programmer to map a single DO
loop to parallel mode or vector mode, or specify that a loop should be
strip-mined\footnote{strip-mining or loop sectioning is a technique
that splitting a single loop to a nested loop. The number of
iterations of the inner loop is know as the loop strip's length or
width}
(using either {\it unroll} or {\it ...}). Or you specify the loop to
be executed in sequential.
NOTE: The DO loop must precede right after the directive. The syntax
is
% \begin{lstlisting}
% !$acc do [clause [, clause]...]
% DO loop
% \end{lstlisting}
\begin{lstlisting}
!$acc do [clause [, clause] ...]
DO i=start, end [,step]
....
END DO
\end{lstlisting}
with the {\it clause} can be (note: [ ] means optional)
\begin{enumerate}
\item \verb!host! [(width)] : execute the loop sequentially on the
CPU. If {\it width} is used, the loop is strip-mined.
Example\footnote{\url{http://software.intel.com/en-us/articles/strip-mining-to-optimize-memory-use-on-32-bit-intel-architecture/}}:
\begin{lstlisting}
for (i=0; i<Num; i++) {
Transform(v[i]);
}
for (i=0; i<Num; i++) {
Lighting(v[i]);
}
\end{lstlisting}
Then, strip-mining with the width = \verb!strip_size!.
\begin{lstlisting}
for (i=0; i < Num; i+=strip_size) {
for (j=i; j < min(Num, i+strip_size); j++) {
Transform(v[j]);
}
for (j=i; j < min(Num, i+strip_size); j++) {
Lighting(v[j]);
}
}
\end{lstlisting}
\item \verb!seq! [(width)]: execute the loop sequentially on the
Accelerator, if {\it width} is used, the loop is strip-mined.
\item \verb!vector! [(width)]: strip-mine the loop to run on
Accelerator in vector more (synchronous parallelism) with the best
possible \verb!strip_size! value which is
accelerator-dependent. Normally, it's the maximum possible
value. Some Accelerator limits this value, e.g. to 256 maximum.
\begin{lstlisting}
!$acc do vector
do i = 1,n
!!! the compiler change to this
do is = 1,n,256
!$acc do vector
do i = is,max(is+255,n)
\end{lstlisting}
You may also specify the strip length you want via the \verb!width!.
\item \verb!parallel! [(width)]: the loop is executed in parallel mode
on the Accelerator (\verb!doall! parallelism) with maximum number of
loops to be allowed. You can also set the limits of iterations to
run in parallel via the width option.
\begin{lstlisting}
!$acc do parallel
do i = 1,n
\end{lstlisting}
\item \verb!unroll! (width) : similar to strip-mining, yet no nested
loop is created
\begin{lstlisting}
DO i = 1,n
a[i] = b[i] + c[i]
ENDDO
\end{lstlisting}
switch to
\begin{lstlisting}
DO i = 1, n, 2
a[i] = b[i] + c[i]
a[i+1] = b[i+1] + c[i+1]
ENDDO
\end{lstlisting}
NOTE: If there are other clauses also, all but one must have
\verb!width! argument. Some requires the value of the \verb!width!
argument be a power of 2 or multiple of a power of 2.
\item \verb!independent! : if we know the loop iterations are
independent, we can explicitly tell the compiler to generate code in
parallel.
NOTE: It is a programming error if any iteration write to a variable
or array element that any other iteration also write or
read. E.g. this is ERROR
\begin{lstlisting}
!$acc do independent
DO i =2,n
a[i] = a[i] + a[i-1]
ENDDO
\end{lstlisting}
\item \verb!kernel! : tell that this loop is the body of the
computational kernel. Any nested loop will be executed sequentially
on the accelerator.
NOTE: Nested loops cannot have any other \verb.!$acc! directives.
\item \verb!shortloop! : use this clause if the number of loops is
known to be less than the maximum value supported in \verb!parallel!
and \verb!vector! clauses.
\item \verb!private! (list) : tell which variable, arrays, or
subarrays to be allocated on device for each iteration of the
loop. Though the dimension can be detected by the compiler, it's
recommended to explicitly tell the dimensions.
NOTE: The compiler may pad dimension to improve memory alignment and
program performance.
\item \verb!cache! (list): even though CUDA decides which variables to
put in the cache, the \verb!cache! clause can help the compiler
choose what data to keep in the fast memory during the loop.
NOTE: This doesnot guarantee all the variables, arrays or subarrays in
the list to be cached.
\end{enumerate}
You can also use more than one clauses
\begin{lstlisting}
!$acc do host(16), parallel
do i = 1,n
!!! the compiler strip-mine the loop to 16 host iterations,
!!! with the parallel clause apply to the inner loop
ns = ceil(n/16)
!$acc do host
do is = 1, n, ns
!$acc do parallel
do i = is, min(n,is+ns-1)
\end{lstlisting}
Example:
\begin{lstlisting}
!$acc region copyin(s(1:n,1:m)) copyout(r)
!$acc do parallel,vector(8)
do j = 2,m-1
!$acc do parallel,vector(8)
do i = 2,n-1
r(i-1,j-1) = 0.25*(s(i-1,j) + s(i+1,j) &
+ s(i,j-1) + s(i,j+1))
enddo
enddo
!$acc end region
\end{lstlisting}
\begin{lstlisting}
!$acc region
!$acc do parallel
do j = 2,m-1
!$acc do vector
do i = 2,n-1
r(i-1,j-1) = 0.25*(s(i-1,j) + s(i+1,j) &
+ s(i,j-1) + s(i,j+1))
enddo
enddo
!$acc end region
\end{lstlisting}
% {\bf REMARK}: Without a clause, the compiler will automatically an
% appropriate schedule.
% \begin{enumerate}
% \item If you don't want the loop to be executed on the accelerator,
% you \verb!host! clause.
% \begin{lstlisting}
% !$acc do host
% do = .....
% enddo
% \end{lstlisting}
% \end{enumerate}
% {\bf Strip-mining}:
% \begin{lstlisting}
% DO I = 1, 10000
% A(I) = A(I) * B(I)
% ENDDO
% !!! strip length of 1000
% DO IOUTER = 1, 10000, 1000
% DO ISTRIP = IOUTER, IOUTER+999
% A(ISTRIP) = A(ISTRIP) * B(ISTRIP)
% ENDDO
% ENDDO
% \end{lstlisting}
% The common program unit that can be parallelized is the DO loop. This
% directive applies to a single DO loop only, and the DO loop must
% precede right after the directive. The syntax of the directive is
% The clauses can be
% \begin{enumerate}
% \item host [(width)] :
% \item parallel [(width)]
% \item seq [(width)]
% \item vector [(width)]
% \item unroll (width)
% \item kernel
% \item shortloop
% \item private( list )
% \item cache( list )
% \end{enumerate}
\subsection{Define a compute region}
\label{sec:define-region}
A compute region is a more general case to the DO loop
mapping.
\textcolor{red}{A compute region is a {\bf structured block} which
contains loops that can be compiled to run on the Accelerator}.
\begin{lstlisting}
!$acc region [clause, [clause]...]
! loops code here
!$acc end region
\end{lstlisting}
The compiler will analyze the code to determine which data to be
copied to the device on the entry, and which to copy back on exit of
the region. If the compiler fails to determine, it may fail to
generate the equivalent code for the accelerator; and thus the code
will run on host.
\begin{lstlisting}
!!! multiline directive
!$acc region copyin(b(:m,:m), c(:m,:m)), &
!$acc& copyout(a(start:end,:m))
do j = 1, m
do i = start,end
a(i,j) = 0.0
enddo
do i = start,end
do k = 1,m
a(i,j) = a(i,j) + b(i,k) * c(k,j)
enddo
enddo
enddo
!$acc end region
\end{lstlisting}
For explicit control of how the compiler process the data, APM
provides some additional {\it clause} to describe which data to be
copied to the device, and which to be copied back to the host... Here
are the list of {\it clauses}
\begin{enumerate}
\item \verb!if!(condition) : specify the condition on which the code
to run on Accelerator. Thus, two versions are created - one to run
on host, one to run on accelerator - and the appropriate one will
run depending on the condition checked at runtime.
E.g. test if the number of iterations is large enough to run on
Accelerator.
\item \verb!copy!(list) : which variables, arrays, subarrays to be
copied in and out
\item \verb!copyin! (list) : which variables, arrays, subarrays to be
copied in only
\item \verb!copyout! (list) : which variables, arrays, subarrays to be
copied back on exit
\item \verb!local! (list) : which variables, arrays, subarrays to be
initiated and used in device memory only, don't copy it back to the
host.
\item \verb!update device (list)! : which variables, arrays, or
subarrays to be copied to the device after each iteration
\item \verb!update host (list)! : which variables, arrays, or
subarrays to be copied back to the host after each iteration
\end{enumerate}
For more information, read \hyperref[sec:clauses]{clauses}.
{\bf SHOULD WE PARALLELIZE THE CODE}:
\begin{itemize}
\item If there is/are operations that cannot be performed on the
accelerator, e.g. the accelerator doesnot support it, they will be
executed on the host. This may require data to be copied back and
forth between the host and device. This can degrade the
performance. Thus it is critical to examine the code to see if this
happens or not to decide whether it's better to leave the code to
run on CPU.
\item The copied back and forth between the device and the host can be
overhead, so we use the kernel only when the data are being use
intensively. Otherwise, you can run the code on the host with
OpenMP, if the processor is multicore.
\end{itemize}
{\bf Example:}
\begin{lstlisting}
!$acc region
do i=1,n
r(i) = a(i) * 2.0
enddo
!$acc end region
\end{lstlisting}
\begin{lstlisting}
!$acc region if(n .gt. 200)
do i=1,n
r(i) = a(i) * 2.0
enddo
!$acc end region
!$acc region copyin(s(1:n,1:m)) copyout(r)
do j = 2,m-1
do i = 2,n-1
r(i-1,j-1) = 0.25*(s(i-1,j) + s(i+1,j) &
+ s(i,j-1) + s(i,j+1))
enddo
enddo
!$acc end region
\end{lstlisting}
{\bf RESTRICTIONS}:
\begin{enumerate}
\item A compute region CANNOT contain another compute region or a
data region (Sect.\ref{sec:define-data-regions}).
\begin{lstlisting}
!!! THIS IS ERROR (no nested accelerator region)
!$acc region
....
!$acc region
...
!$acc end region
!$acc end region
\end{lstlisting}
% \begin{verbatim}
% ERROR
% !$acc region
% !$acc region
% ...
% !$acc end region
% !$acc end region
% \end{verbatim}
\item There is at most ONE \verb!if! clause inside the compute region
(result is LOGICAL in Fortran, and INTEGER in C)
\item Order of the clause must not affect the evaluation of the code
region, i.e. there is no data dependency between the iterations of
the loops.
\item There is no branching into or out of the code region, e.g. no
GOTO statement.
\end{enumerate}
\subsection{Define combined directive}
\label{sec:define-comb-direct}
When you want to specify a loop directive nested immediately inside an
accelerator compute region, you can combine them
\begin{lstlisting}
!$acc region do [clause [,clause]...]
DO loop
\end{lstlisting}
The restrictions from those of compute region and DO loop mapping
apply to the combined directive.
\subsection{Define data regions}
\label{sec:define-data-regions}
If, in different loops, they share the same data, it's better to keep
the data in the kernel rather than copy it back to host after in the
first loop and copy it to the Accelerator memory in the entry of the
second loop. This can be done by wrapping the two loops inside a data
region.
\subsubsection{Explicit data region}
\label{sec:explicit-data-region}
This feature is available since the version 1.0 of APM only. The
directive below defines which data ({\bf mainly arrays}) to be
allocated on the device memory, depending on the clauses being used,
the data can be copied from the host or copied back, if required.
\begin{lstlisting}
!$acc data region [clause, [clause]...]
structured block or data region
!$acc end data region
\end{lstlisting}
with clauses can be
\begin{enumerate}
\item \verb!copy! (list)
\item \verb!copyin! (list)
\item \verb!copyout! (list)
\item \verb!local! (list)
\item \verb!mirror! (list) : (valid only in Fortran) the allocation
state of the data (e.g. unallocated, allocated) in the device should
be the same as that in the host.
\item \verb!update device! (list)
\item \verb!update host! (list)
\end{enumerate}
{\bf RECOMMENDED}: It's recomended to include the dimension of the
arrays in the clause also. If either the lower bound or the upper
bound is missing, the declared or allocated bounds of the array, if
available, is used.
\begin{lstlisting}
!$acc data region copyin(a(1:n, 1:m))
!$acc end data region
\end{lstlisting}
{\bf RESTRICTIONS}:
\begin{enumerate}
\item A data region CAN contain other data regions or
compute region (Sect.\ref{sec:define-region}).
\item A variable or array can appear in only one clause. However, a
single subarray of an array is allowed to appear in all data clause
\begin{lstlisting}
!!! ERROR
!$acc data region copyin(a(1:5,:)), copyout(a(2:4,1:3))
...
!$acc end data region
!!! OKIE
!$acc data region copyin(a(1:5,:)), copyout(a(1:5,:))
...
!$acc end data region
\end{lstlisting}
\item If a variable, array, subarray appears in any clause of the data
region, it CANNOT appear in any enclosed regions, e.g. data regions,
compute region, DO loop mapping.
\begin{lstlisting}
!!! ERROR
!$acc data region copyin(a(1:10,1:20))
...
!$acc region copy(a(1:4,1:5))
...
!$acc end region
!$acc end data region
\end{lstlisting}
\item With \hyperref[sec:array]{assumed-size dummy array}, the upper
bound must be specified.
\item Pointer array can be used in clauses, however, pointer
association is not guaranteed to preserve in device memory
$\rightarrow$
\textcolor{red}{it is recommended not use pointer array, though}.
\item User doesn't have to pad data, the compiler will automatically
does that to improve memory alignment and performance gain.
\end{enumerate}
\subsubsection{Implicit data region (declarative data)}
\label{sec:define-decl-data}
In some cases, you want an array or arrays to be allocated on the
device memory and stay there
{\it during the whole execution of the subprogram unit (SUBROUTINE,
FUNCTION)}.
As this apply to the whole body of the subprogram unit, user doesn't
have to specify the region to be defined, aka
{\bf implicit data region}. Programmers can also choose whether the
data should be copied from the host upon entry and/or be copied back
to the host upon exit.
% Within the section of a Fortran subroutine, function, or module, you
% can define the variables to tell the compiler to allocate the data on
% the device memory during the execution of the
% subroutine/function/module, aka {\bf implicit data region} as there is
% no region to be defined here. Further, it also tells whether the data
% values are to be transferred from the host to the device memory upon
% entry to the implicit data region, and from the device to the host
% memory upon exit the implicit data region.
\textcolor{red}{Where to put the directive?} - In Fortran, put them in
the declaration section. In C, put them following the array
declaration.
\begin{lstlisting}
!!! within a subroutine/funtion/module/...
c put in the declaration section
integer :: i
!$acc declclause [, declclause ...]
!!! your code here
\end{lstlisting}
with {\it declclauses} can be
\begin{enumerate}
\item \verb!copy! (list) : copyin + copyout
\item \verb!copyin! (list)
\item \verb!copyout! (list)
\item \verb!local! (list) : allocated in device memory only (don't
care the values in the host memory)
\item \verb!mirror! (list) : valid only in Fortran (subroutine,
function, module)
\item \verb!reflected! (list) : valid only in Fortran subroutine or
function
\end{enumerate}
{\bf RESTRICTIONS}:
\begin{enumerate}
\item subarray specifications are not allowed in {\it list} for
this case.
\item assumed-sized dummy arrays (arguments) cannot appear in the
declarative clauses. So, it is better to use fixed size dummy
arrays.
\item within the range of the implicit data region, the variables
defined in the {\it list} cannot appear in a data clause for any
other explicit data region.
\item we use \verb!reflect! when we want to actual argument arrays
bound to the dummy argument arrays in the {\it list}
\item Only \verb!mirror! can be used in a MODULE subprogram. All
others MAY NOT be used.
\item Pointers can be used in the list; however, pointer association
is not preserved in the device memory.
\end{enumerate}
% \subsection{Define a loop region}
% \label{sec:define-loop-region}
\subsection{UPDATE device-data}
\label{sec:update-device-data}
In a \hyperref[sec:define-data-regions]{data region}, at some points,
you may want to copy the data in device memory to the host memory or
from host to Accelerator, i.e. update the data. This can be done via
the UPDATE directive.
\begin{lstlisting}
!$acc update updateclause [, updateclause]...
\end{lstlisting}
with updateclause can be
\begin{enumerate}
\item \verb!host! (list) : copy data from device to host
\item \verb!device! (list) : copy data from host to device
\end{enumerate}
with list is a comma-seperated list of variables, subarrays or
arrays. Multiple subarrays of the same array can appear in the list.
{\bf RESTRICTIONS}: As this is an executable directive,
\begin{enumerate}
\item it must not appear in a statement \verb!if!, \verb!while!,
\verb!do!, \verb!switch!, \verb!label!.
\item the items in the list must have a visible device copy.
\end{enumerate}
% \subsection{Clauses}
% \label{sec:clauses}
% There are 5 different clauses
% \begin{enumerate}
% \item \verb!copy! (list): accept variables, arrays or subarrays.
% \item \verb!copyin! (list): accept variables, arrays or subarrays.
% \item \verb!copyout! (list): accept variables, arrays or subarrays.
% \item \verb!local! (list) : data (variable, array, subarray) need to
% be allocated on the array, the values computed on the device are not
% need on the host and vice verse.
% \item \verb!mirror! (list) : **available only for data region*** it
% accept a list of array whose state must be mirrored to the device
% from the host. In particular, upon entry of the region,
% \begin{enumerate}
% \item if the host array is allocated, the device copy of the
% array will also be allocated of the same size;
% \item if the host array is not allocated, the device copy of the
% array will also be initialized to an unallocated state.
% \item if the host array is allocated and deallocated within the
% region, the device copy of the array will also be allocated of the
% same size and/or deallocated at the same point in the region.
% \end{enumerate}
% Upon exit of the region, the device copy will be automatically
% deallocated in any cases.
% \item \verb!reflected! (list)
% \end{enumerate}
% {\it list} is a comma-separated collection of variable names, array
% names, or subarray specifications, i.e. arr(2:high, low:100). The
% order of the clauses is not important.