-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFCUDA.tex
3020 lines (2468 loc) · 99 KB
/
FCUDA.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%
%% FCUDA.tex
%% Login : <hoang-trong@hoang-trong-laptop>
%% Started on Tue Nov 10 10:19:01 2009 Hoang-Trong Minh Tuan
%% $Id$
%%
%% Copyright (C) 2009 Hoang-Trong Minh Tuan
%% This program is free software; you can redistribute it and/or modify
%% it under the terms of the GNU General Public License as published by
%% the Free Software Foundation; either version 2 of the License, or
%% (at your option) any later version.
%%
%% This program is distributed in the hope that it will be useful,
%% but WITHOUT ANY WARRANTY; without even the implied warranty of
%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
%% GNU General Public License for more details.
%%
%% You should have received a copy of the GNU General Public License
%% along with this program; if not, write to the Free Software
%% Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
%%
\lstset{language={[90]Fortran}, numbers=left, numberstyle=\tiny,
stepnumber = 5, numbersep=5pt, keywordstyle=\color{blue}}
\chapter{CUDA Fortran}
\label{chap:cuda-fortran}
% \chapter{GPU Programming model}
% \label{chap:gpu-progr-model}
You are recommended to read Chap.~\ref{chap:gpu-programming} first. As
CUDA Fortran is derived from CUDA C of NVIDIA, it's better to read the
Sect.~\ref{sec:introduction-6}, \ref{sec:compute-capability},
\ref{sec:nvcc}, \ref{sec:interf-cuda-with}, \ref{sec:kern-exec-conf}.
\section{Introduction}
\label{sec:introduction-4}
CUDA Fortran allows programmers to define subroutines that execute on
the GPU - the device.
\begin{enumerate}
\item If the subroutine, running on GPU, is invoked from a host
subprogram, it is called {\bf device subroutine}
(\textcolor{red}{only subroutine is allowed, not function (checked
in PGI Fortran 10.4)}).
\item If the subprogram (subroutine/function), running on GPU, is
invoked from a device subroutine, it is called
{\bf kernel subprogram}. This can be a function or a subroutine. It's
important to use ``volatile'' for the actual argument if it's intended to
receive an output value.
\end{enumerate}
Both device subroutine and kernel subprogram are collectively called
{\bf kernels}. How to tell a subroutine/function run on host or on an
accelerator? - Each subroutine/function is given a specific
qualifier. We'll discover that in Sect.~\ref{sec:subroutinefunctions}.
\subsection{Built-in variables}
\label{sec:built-variables-1}
\begin{lstlisting}
type(dim3) :: threadIdx, blockDim, blockIdx, gridDim
integer(4) :: warpsize
\end{lstlisting}
with
\begin{lstlisting}
type(dim3)
integer(kind=4) :: x, y, z
end type
\end{lstlisting}
All array dimensions are one-based in CUDA Fortran; not zero-based
like in C. So, it's important to notice the difference when mapping to
global thread index.
\subsection{History}
\begin{enumerate}
\item CUDA Fortran 2013: supports Texture Memory
(Sect.\ref{sec:cudafortran_texture}), CUDA 5.0, CUDA dynamic paralellism,
creating and linking static device libraries, objects.
\end{enumerate}
\section{Subroutine/Functions}
\label{sec:subroutinefunctions}
\subsection{Qualifiers}
\label{sec:qualifiers}
\begin{enumerate}
\item A {\bf host subprogram}: a subroutine/function runs on CPU and
can only be invoked from another host subprogram.
\begin{lstlisting}
attributes(host) subroutine foosub(arguments)
.........
end subroutine
attributes(host) integer foofunc(arguments)
.........
foofunc = x;
end subroutine
\end{lstlisting}
The \verb!host! is used by default, if no other attribute is
used. If it is specified explicitly, it can be preceded or followed
by any allowable subroutine or function prefixes,
e.g. \verb!recursive!, \verb!pure!, \verb!elemental!, or function
return datatype.
\item A {\bf kernel subroutine}: a subroutine (NOT a function) that
run on GPU (device), and can only be called from a
{\it host subprogram}
\begin{lstlisting}
attributes(global) subroutine foofunc(arguments)
end subrouting
\end{lstlisting}
A kernel subroutine cannot use any prefixes as mentioned in host
subprogram. A kernel subroutine must NOT be defined within any
subprogram, or contains any other subprogram.
\textcolor{red}{Anything apply to kernel subroutines also does to
{\it device subprogram}, the reverse is not correct.}
\item A {\bf device subprogram}: a subroutine/function runs on device,
and can only be called from a kernel (i.e. a kernel subroutine or
device subprogram).
\begin{lstlisting}
attributes(device) subroutine foofunc(arguments)
attributes(device) integer function foofunc(arguments)
\end{lstlisting}
A device subprogram MUST appear within a module, and can only be
called from device subprograms in the same module. The only
allowable prefix is the function return datatype.
\item A subprogram (subroutine/function) may have both \verb!host! and
\verb!device! attributes (the order is unimportant). In that case,
the compiler will generate 2 version, one to run on CPU and one to
run on GPU. Notice that this subprogram must satisfy all
requirements for a device subprogram.
\textcolor{red}{However, current PGI Fortran compilers may have
problem combining the two attributes, thus you're advised to
create two different subprograms, one for the host, one for the
device}.
\begin{lstlisting}
attributes(host, device) subroutine foofunc(arguments)
end subrouting
\end{lstlisting}
The restriction for this kind of subprogram is the combined of those
apply for host subprogram and device subprogram, i.e. no use of
builtin variable \verb!threadIdx! and \verb!blockIdx!.
It will be compiled for execution both on GPU and CPU. Then, it may
be called from another device subprogram in the same module (to run
on GPU), or from any host subprogram in the same module or any
subprogram that use the module or is contained in a subprogram that
use the module (to run on CPU).
\end{enumerate}
\begin{framed}
If none of the attributes is used, \verb!attributes(host)! is used
by default.
\end{framed}
\subsection{Restriction}
\label{sec:restriction}
\subsubsection{To kernel subroutine}
\label{sec:kernel-subroutine}
% The restrictions on device subprogram apply to kernel subroutine
% also. In addition, there are also further restrictions for kernel
% subroutines with {\bf global} attribute.
Here are the restrictions apply to kernel subroutine with {\bf global}
attribute.
\begin{enumerate}
\item A kernel subroutine cannot have {\bf host} attribute. Even
though this feature is available in C CUDA. It means that we need to
create two versions, one to work on GPU (e.g. when data is large)
one on CPU (e.g. when data is small).
\item A call to a kernel subroutine must specify the execution
configuration via \hyperref[sec:call-kernel]{chevron syntax}.
\item it CANNOT be pure or elemental, i.e. no PURE or ELEMENTAL prefix
on the subprogram definition.
\end{enumerate}
\subsubsection{To device subprogram}
\label{sec:device-subprogram}
These are the restrictions to kernel with {\bf device} attribute.
% (besides the limits described in the previous section).
\begin{enumerate}
\item it CANNOT be recursive, i.e. no RECURSIVE prefix on the
subprogram definition
\item it CANNOT contain variables with SAVE attribute or data
initialization.
\item it CANNOT have optional arguments
\item Dummy arguments CANNOT be assumed-shape arrays or have POINTER
attribute.
\item Arguments to a kernel subroutines are currently limited to 256
bytes.
\item it CANNOT contain another subprogram, i.e. call to another
subprogram.
\item it CANNOT be contained by a host subprogram, i.e. called by a
host subprogram.
\item STOP and PAUSE statements are NOT allowed
\item OPTIONAL arguments are NOT allowed
\item Arrays with POINTER or ALLOCATABLE attribute are NOT allowed
\item CHARACTER string are NOT supported
\item CHARACTER object must have LEN=1
\item Assumed-shaped argument are NOT allowed
\item Assumed-sized array are NOT allowed $\rightarrow$ must be
fixed-size.
\item I/O statements are NOT allowed, e.g. READ, WRITE, PRINT, FORMAT,
NAMELIST, OPEN, CLOSE, BACKSPACE, REWIND, ENDFILE, INQUIRE
\item ENTRY statement are NOT allowed
\item Floating-point exception handling is not supported
\item Subroutine and function calls are supported ONLY when they are
inlined.
\item Cray pointers are NOT supported
\item Alternate return specification are NOT allowed.
\end{enumerate}
\section{Memory hierarchy}
\label{sec:memory-hierarchy}
\subsection{From CPU}
\label{sec:from-cpu}
The main memory - DRAM - is organized in a virtual space with
2-component addressing scheme. In other words, the memory is organized
in the form of pages, each pages is a segment of memory. So,
addressing a true physical memory requires knowing the page number and
the index in the page.
To provide quick access to a certain amount of memory, the system
reserved an amount of DRAM that is called {\bf paged-lock} memory,
i.e. we only need a single address to identify the true memory
location. This is aka {\bf pinned memory}.
\subsection{From GPU}
\label{sec:from-gpu}
The GPU organize the memory into a number of memory space with
different access latency.
\begin{itemize}
\item On-chip memory [low-latency]:
\begin{enumerate}
\item register files
\item shared memory
\item read-only constant memory
\item read-only texture cache memory
\end{enumerate}
\item Off-chip memory [high-latency]:
\begin{enumerate}
\item device global memory
\item local memory
\end{enumerate}
\end{itemize}
NOTE: Access data in {\it device constant memory} is faster than
access to {\it device global memory}
\begin{framed}
\textcolor{red}{PGI Fortran 2010 hasn't supported accessing to
texture memory yet}.
\end{framed}
\subsection{Memory access (from host side)}
\label{sec:memory-access-from-host}
From host side, you can access data on
\begin{itemize}
\item host memory.
\item copy data to/from device global memory, using DMA access to the
device. So it's slow relative to accessing host memory
\end{itemize}
\section{Variables}
\label{sec:variables-1}
The GPU has a hierarchical structure of different memory space
(discussed in Sect.~\ref{sec:memory-hierarchy}). Now, the question is
how we can tell a data reside on which kind of memory space.
To designate where the variables are stored and the scope of memory
access, Fortran add 5 new attributes (\verb!pinned!, \verb.devices.,
\verb.shared., \verb.local., \verb.constant.) that tell where the
variables are allocated on the host page-locked memory, device global
memory, thread block shared
memory\footnote{scope: blocks, location: shared memory}, thread local
memory\footnote{scope: thread, location: global memory}, device
constant memory
space\footnote{scope: program, location: cached global memory}.
If the variable is declared
\begin{enumerate}
\item in a device subprogram (including kernel subroutine): it may
have either one of the 4 attributes.
\item in a module: it may only have either \verb.device. or
\verb.constant. attribute.
\end{enumerate}
\begin{lstlisting}
real :: a(100)
attributes(device) :: a
real, device :: b(100)
\end{lstlisting}
There are restriction when using these new attributes discussed in
Sec.~\ref{sec:restriction-1}.
\subsection{pinned}
\label{sec:pinned}
\subsubsection{How to define}
\label{sec:how-define}
Data declared with \verb!pinned! attribute
reside in CPU DRAM indeed, IF the memory is available; otherwise, the
data reside in the regular CPU paged memory. As some OSs restrict the
use of host page-locked memory (e.g. limit the size), the variable is
automatically allocated in the normal paged host memory when it fails.
As the availability of the paged-lock memory is only known at runtime,
data with \verb!pinned! attribute must be defined with ALLOCATABLE
attribute. To know if the allocation is successful or not, we use the
INTENT(OUT) dummy argument PINNED to the ALLOCATE command.
\begin{lstlisting}
logical :: plog
integer, pinned, allocatable :: p(:)
allocate(p(4000), pinned=plog)
if (.not. plog) then
print *, "not successful"
endif
\end{lstlisting}
\begin{itemize}
\item It must be an ALLOCATABLE array or ALLOCATABLE scalar (new
feature of Fortran 2003). If you use {\it pinned} attribute with
non-allocatable array or scalar, the compiler simply ignores the
{\it pinned} attribute.
\begin{lstlisting}
real :: p(:), x
allocatable :: p, x
attributes(pinned) :: p, x
! or simply
real, allocatable, pinned :: p(:), x
allocate(p(n), stat=istat, pinned=plog)
if (.not. plog) then
...
else !success
...
endif
allocate(x)
x = 100.4
! free memory
deallocate(x)
deallocate(p)
\end{lstlisting}
\item We can also use CUDA runtime API
\begin{lstlisting}
! to allocate the pinned data
integer function cudaMallocHost(hostptr, size)
type(C_PTR) :: hostptr ! the address of the
! page-locked allocation
integer :: size ! in bytes
! to convert from type(C_PTR) to a Fortran pointer, use
! iso_c_binding() or c_f_pointer()
! and to free
integer function cudaFreeHost(hostptr)
type(C_PTR) :: hostptr
\end{lstlisting}
\end{itemize}
\subsubsection{Where to use}
\label{sec:where-use}
Data with \verb!pinned! attribute can be defined in a module or a host
subprogram.
% At most one attribute can be used for a single variable. If none of
% these 4 attributes are used, by default, variables declared in modules
% or host subprograms are allocated on host memory.
\subsubsection{How to use}
\label{sec:how-use}
It can be passed as actual argument to host subprogram regardless
whether the interface of the host subprogram is explicit or the
corresponding dummy argument has \verb!allocatable! or \verb!pinned!
attribute.
However, if you want to deallocate it inside the host subprogram, the
dummy argument must be declared with, at least, \verb!pinned!
attributed or the deallocation may fail.
\subsubsection{Why}
\label{sec:why}
Memory is organized into pages of fixed size, so, addressing a memory
location require the page number and the offset in the page. By fixing
the page (page-locked), we only need the offset information; thus
reducing the time for addressing.
\begin{table}[hbt]
\begin{center}
\caption{Benchmark with 1024x1024 mesh, 400 RK4 steps on Windows, 2D
isotropic turbulence }
\begin{tabular}{ccc}
\hline
& Opteron 250 & Opteron 2210 \\
PCI-e bandwidth & 1135 MB/s & 1483 MB/s \\
host-2-device & 1003 MB/s & 1223 MB/s \\
\hline\hline
\end{tabular}
\end{center}
\label{tab:pinneddata}
\end{table}
In addition, the copying can be {\it asynchronous}. This is the
purpose of using {\bf pinned data} i.e. allow the data copy and the
kernel execution to be overlapped which potentially can enhance the
performance of functions like \verb!cudaMemCpy!, as shown in Table
\ref{tab:pinneddata}, obtained via \verb!allocate()! (or in C,
\verb!malloc()!). However, we need to make sure the kernel doesn't
modify the data to be copied.
% NOTE: using too much page-locked memory may degrade the performance of
% the system. Thus, this is best used sparingly to allocate staging
% areas for data exchange between host and device.
\subsection{device }
\label{sec:device-}
Access to data in global device memory is very expensive (400-600
clock cycles). Thus, it is recommended
\begin{itemize}
\item accessing device data in a pattern that allows coalescing
\item only to store data being used by all threads rather than data
locally for some threads. For data local to thread(s), try to put
them into registers or shared memory.
\end{itemize}
\subsubsection{How to define}
\label{sec:how-define-1}
A data reside in global device memory is defined with
\verb.device. attribute.
\begin{enumerate}
\item A device array can be explicit-shape array, ALLOCATABLE array,
or (in host subprogram) an assumed-shaped dummy array.
\begin{lstlisting}
subroutine vfunc(a,c,n)
real, device :: adev(n)
real, device :: atmp(4)
end subroutine ! adev and atmp are automatically deallocated
\end{lstlisting}
\item
\textcolor{red}{Since Fortran 2003, it can be a scalar, yet require
ALLOCATABLE attribute}.
\begin{lstlisting}
integer, allocatable, device :: ndev
allocate(ndev)
ndev = 100
\end{lstlisting}
\item If declared in a module, it can be (1) a fixed size device data
or (2) an ALLOCATABLE
array\footnote{automatic array is not allowed}; and can be accessed
by any (host, device) subprogram defined in that modules and those
that use the modules
[\textcolor{red}{option (2) is available since Fortran 10.4}].
\begin{verbatim}
PGF90-S-0310-Automatic arrays are not allowed in a MODULE
\end{verbatim}
\item If declared in a host subprogram, \verb!device! scalar and
automatic arrays CANNOT have SAVE attribute; it can only be accessed
by the that host subprogram and any subprograms contained in that
host subprogram. However, device ALLOCATABLE array can have SAVE
attribute.
By default, a device ALLOCATABLE array is deallocated in the device
memory after the host subprogram finish. If we use SAVE attribute,
it won't be deallocated, and require manually call to
\verb!deallocate()!.
\item If declared in a kernel subprogram, the data is
``automatically'' allocated in the device memory, and
is ``automatically'' freed after the kernel subprogram complete.
\begin{lstlisting}
subroutine sub1(a,n)
real, device :: adev(n)
real, device :: bdev(10)
real, device :: cscalar
\end{lstlisting}
\item It CANNOT have POINTER or TARGET attribute
\item It CANNOT be declared in a COMMON block or EQUIVALENCE
statement.
\item An ALLOCATABLE device array has the lifetime from it is
allocated until it is deallocated. Other device variables have the
lifetime of the entire application unit where it is declared.
\begin{lstlisting}
subroutine vfunc(a,c,n)
real, device :: adev(n)
real, device :: atmp(4)
...
end subroutine vfunc ! adev and atmp are deallocated
\end{lstlisting}
\item Member of a derived type CANNOT have `DEVICE' attribute.
% \item It can be passed as an actual argument to a (device, host)
% subprogram, with the condition that the interface of the
% subprogram must be explicit and the matching dummy argument must
% also have `device' attribute.
\end{enumerate}
We can also use CUDA runtime API to allocate/deallocate a DEVICE
ALLOCATABLE array. IMPORTANT: Mixture using of cudaMalloc()/cudaFree()
and Fortran standard allocate()/deallocate() for a single array is not
supported. It means that if you allocate an array with cudaMalloc(),
you need to free it using cudaFree(); and similarly with the other
case.
\begin{lstlisting}
real, ALLOCATABLE, DEVICE :: v(:)
istat = cudaMalloc(v, 100)
...
istat = cudaFree(v)
\end{lstlisting}
\subsubsection{How to use}
\label{sec:how-use-1}
It can be passed as an actual argument to host or device
subprogram. $\rightarrow$, the matching dummy argument must have
DEVICE attribute and the (host or device) subprogram must have
explicit interface (i.e. in the same module).
\subsubsection{Why}
\label{sec:why-1}
The only way to pass data from host memory to GPU is to copy them to
global device memory. However, accessing them extensively from threads
is not recommended, as it has high-latency. A better way is to make a
copy to shared memory or texture cache. Since Fermi, with the
availability of L1 and L2 cache, this will be mostly handled by the
hardware.
\subsection{constant}
\label{sec:cudaf_constant}
The constant cache is a region in global device memory, but cached so
that access to it is faster than to global device memory
(Sect.\ref{sec:const-shar-cache}, \ref{sec:constant-memory}).
Data on constant memory space is shared by all cores and cannot be modified by
any threads running on GPU. Current NVIDIA CUDA architecture limits
the size of data on ``constant'' memory to 64KB. There is another
cache called {\bf texture cache}; however it not accessible from CUDA
Fortran.
\subsubsection{How to define}
\label{sec:how-define-2}
A ``constant'' data to threads are those resides in read-only global
device memory; however access to them is faster than to global device
memory. A constant device data is defined with
\verb.constant. attribute.
\begin{enumerate}
\item It can be a scalar variable or array
\item If declared in a module (in PGI Fortran v10.2, it supports
defining in the region before the CONTAINS line, cannot be
declared in any subroutine), it can be accessed by any (host,
device) subprogram defined in that modules and those that use the
modules.
\item It CANNOT be declared in a COMMON block or EQUIVALENCE
statement.
\item It CANNOT have POINTER, TARGET, or ALLOCATABLE attribute
\item Member of a derived type CANNOT have \verb!constant! attribute.
\item Array with \verb!constant! attribute must have fixed size.
\end{enumerate}
\subsubsection{Where to use}
\label{sec:where-use-1}
In the host subprogram, you can use {\it constant} data in any of
these situation
\begin{enumerate}
\item declaration statements, i.e. local variables
\item source or destination in a data transfer assignment statement,
i.e. can be on the left-side or right-side
\item an actual argument to another host subprogram
\item a dummy argument in a host subprogram
\end{enumerate}
\subsubsection{How to use}
\label{sec:how-use-2}
It has the lifetime of the entire application.
In a device subprograms, it may NOT be assigned or modified in any
{\it device subprogram}, i.e. it can only be initialized or modified
from a host subprogram.
In a host subprogram, it can be the source or destination of an
assignment statement. In addition, it can be passed as an actual
argument to another host subprogram, with the condition that the
interface of the subprogram must be explicit and the matching dummy
argument must also have \verb!constant! attribute. It means that it
CANNOT be passed to a device subprogram via as argument.
\subsection{shared}
\label{sec:shared}
The time to access shared memory is much faster than to device
memory. It is considered as fast as accessing data on registers. Data
on {\it shared memory} are shared by threads in the same thread-block.
It has the lifetime of the thread block.
If you define a scalar variable,
\begin{lstlisting}
INTEGER, shared :: begin, end;
\end{lstlisting}
if you define a shared array, it's called {\bf scratchpad memory}
\begin{lstlisting}
INTEGER, shared, dimension(blocksize) :: scratch
// load to scratch
scratch(threadIdx.x) = data(threadIdx.x);
// compute on scratch values...
// then save it back at the end of the kernel
data(threadIdx.x) = scratch(threadIdx.x);
\end{lstlisting}
We need to use synchronize if one thread use scratchpad data loaded by
another thread
\begin{lstlisting}
INTEGER, shared, dimension(blocksize) :: scratch
INTEGER :: left
// load to scratch
scratch(threadIdx.x) = data(threadIdx.x)
call syncthreads();
left = scratch(threadIdx.x - 1)
// compute on scratch values...
// then save it back at the end of the kernel
data(threadIdx.x) = scratch(threadIdx.x)
\end{lstlisting}
\subsubsection{How to define}
\label{sec:how-define-3}
A data resides on shared memory is defined with
\verb.shared. attribute.
\begin{enumerate}
\item It CANNOT be declared in a COMMON block or EQUIVALENCE
statement.
\item It CANNOT have POINTER, TARGET, or ALLOCATABLE attribute
\item It CANNOT be a member of a derived type
\item It can ONLY be declared inside a device subprogram% (where
% threads in the same thread-block can access)
\item If it is an array, in most cases, it must be of fixed size.
\begin{lstlisting}
attributes(device) subroutine foosub(a, b)
integer :: a
real, dimension(a) :: b
real, shared, dimension(10) :: c
....
\end{lstlisting}
or
\begin{lstlisting}
attributes(global) subroutine sub(A, n, nb)
integer, value :: n, nb
real, shared :: s(nb*blockDim%x, nb)
\end{lstlisting}
The only exception is that \label{lab:assa} when the shared array is
NOT a dummy argument, i.e. it is not passed from another device
subprogram but is locally declared, it can be declared as
assumed-sized array
\begin{lstlisting}
attributes(device) subroutine foosub(a, b)
integer, value :: a
real, dimension(a) :: b
real, shared, dimension(a,*) :: c
.....
\end{lstlisting}
The true size of the array \verb!c!, which is now leaved blank at
the extent of the last dimension is specified at runtime by the
number of bytes that you want it to be dynamically allocated for
each thread block when you invoke the kernel in the execution
configuration.
\begin{lstlisting}
call foosub<<<grid, block, bytes, streamId>>> (a, b)
\end{lstlisting}
If you declare more than one assumed-sized share array in the device
subprogram, then they share the same starting address, i.e.
implicitly equivalent. Thus, programmers have to manually process
them to retrieved the data they want.
\begin{lstlisting}
attributes(device) subroutine foosub(a, b)
integer :: a
real, dimension(a) :: b
! local
real, shared, dimension(a,*) :: c
real, shared, dimension(*) :: d
.....
\end{lstlisting}
then \verb!c! and \verb!d! actually start from the same memory
address.
\item It may NOT be data initialized inside the device subprogram.
\begin{lstlisting}
attributes(device) subroutine foosub(a, b)
integer :: a
real, dimension(a) :: b
! THIS IS WRONG
real, shared, dimension(2) :: d = (/ 2, 4.4 /)
\end{lstlisting}
\item It can be written or read by any threads within that thread
block. To guarantee the latest value of a shared variable is
visible by all threads in the block, the thread needs to use
\verb.SYNCTHREADS(). before reading it.
\begin{lstlisting}
attributes(device) subroutine foosub(a, b)
integer :: a
real, dimension(a) :: b
real, shared, dimension(16,16) :: d
!do copy
d(:,:) = ...
syncthreads();
! process here
...
\end{lstlisting}
\end{enumerate}
The amount of shared memory for a block can be increased by an amount
of dynamically allocated block of memory whose size is specified via
the third argument of the chevron syntax. This amount of memory will
be used by arrays declared as external array (i.e. the array whose
size is determined at runtime)
\begin{lstlisting}
! in C
extern __shared__ float shared[];
\end{lstlisting}
All arrays in the same kernel declared as external arrays start at the
same memory address, so their data content are overlapped. Handling
them must be taken with care.
\begin{lstlisting}
attribute(global) subroutine mmul_kernel(A,B,C, N, M, L)
real, device :: A(N,M), B(M,L), C(N,L)
integer, value :: N, M, L
real, shared :: Ab(16, 16), Bb(16, 16)
tx = threadidx%x
ty = threadidx%y
i = (blockidx%x-1) * 16 + tx
j = (blockidx%y-1) * 16 + ty
Ab(tx,ty) = A(i, kb+ty-1)
Bb(tx,ty) = B(kb+tx-1,j)
! wait until all elements of Ab, Bb are filled
call syncthreads()
....
end subroutine
\end{lstlisting}
\subsubsection{How to use}
\label{sec:how-use-3}
It must be used inside a device subprogram where it is defined. It
can also be passed to another device subprogram as actual argument, as
long as the matching dummy argument has the \verb!shared! attributed.
\subsection{local}
\label{sec:local}
Fortran doesn't \verb!local!: use to define variables locally for each
thread. Local memory is not cached so reading data from local memory
is as expensive as reading from reading global memory if the data is
not kept in the registers.
Local variables in a kernel are normally mapped to registers. However,
if it is an array, and is indexed by ``variables'', instead of
``constant'', or the number of local variables exceed the limited
number of registers per thread; then the compiler (nvcc) spill this
array onto local memory which is indeed global device memory. Thus, it
is extremely SLOW, and it may kills the performance. A better way for
array indexed by variables is defined with \verb!shared! attribute.
\subsection{value}
\label{sec:value}
This is a new feature. By default, data are passed by reference in
Fortran. Since Fortran 2003, dummy argument can be declared to passed
by value using \verb!value! attribute.
Normally, the variable pass to the kernel must refer to the data on
the device memory. However, for scalar variable, we don't need to do
so, as we can pass them by value, and the compiler will automatically
store it in the register.
\begin{lstlisting}
! matA, matB on device memory
! n on host memory
call madd<<< , >>>(matA, matB, n)
...
attribute(global) subroutine madd( a, b, n)
real, dimension(n, n) :: a, b
integer, value :: n
...
end
\end{lstlisting}
\subsection{Texture memory}
\label{sec:cudafortran_texture}
% Texture memory is planned to be available in CUDA Fortran in Nov, 2012 (version
% 12.0).
Texture memory is supported since CUDA Fortran 2013.
\subsection{zero-copy pinned memory}
\label{sec:cudaf_zero-copy}
\url{http://cudamusing.blogspot.com/2010_07_01_archive.html}
\section{GPU Data allocation}
\label{sec:cudafortran_data_malloc}
\subsection{Statistic}
\label{sec:statistic}
When a static data is declared with \verb!device! attribute, the data
is allocated locally and automatically on the device memory without
using \verb.allocate(). statement. The device array data thus have the
lifetime of the subprogram where it is declared.
\begin{lstlisting}
subroutine vfunc(a,c,n)
real, device :: adev(n)
real, device :: atmp(4)
...
end subroutine vfunc ! adev and atmp are freed automatically
\end{lstlisting}
\subsection{Dynamic}
\label{sec:dynamic}
At first, all the data need to have ALLOCATABLE attribute. There are
three ways to dynamically allocate a device data (scalar or arrays)
\begin{enumerate}
\item (1D, 2D, 3D ...) Using \verb!allocate()!
\begin{lstlisting}
real, device, dimension(:), allocatable :: adev
allocate(adev(10))
\end{lstlisting}
The data is freed using \verb!deallocate()!.
\begin{framed}
Ultimately, \verb!allocate()! call \verb!cudaMalloc()!. However,
there are some subtle differences.
Fortran has some semantics associated with allocatable arrays that
are not implemented when using \verb!cudaMalloc!. For instance,
the ALLOCATED() intrinsic will work with the former but not with
arrays allocated via cudaMalloc. Programmers should not mix
ALLOCATE() with \verb!cudaFree!, or \verb!cudaMalloc! with
DEALLOCATE().
\end{framed}
\item (1D only) Using CUDA Fortran interfaces to CUDA C runtime APIs
\verb!cudaMalloc! in which the first argument can be
\begin{itemize}
\item \verb!adev! is an ALLOCATABLE, one-dimensional array; and the
second argument is the number of elements
\begin{lstlisting}
integer :: istat
integer :: num_elements
num_elements = 10
istat = cudaMalloc(adev, num_elements)
\end{lstlisting}
\item a variable of \verb!TYPE(C_DEVPTR)!, and the second argument
is the number of bytes
\begin{lstlisting}
integer :: istat
integer :: num_elements
TYPE(C_DEVPTR) :: bdev
num_elements = 10
num_bytes = num_elements * 4 !assume 4-byte elements
istat = cudaMalloc(bdev, num_bytes)
\end{lstlisting}
\end{itemize}
The data must be freed using \verb!cudaFree()!.
\item \textcolor{red}{IMPORTANT WITH 2D or 3D array} Using CUDA
Fortran interfaces for multidimensional arrays with
\verb!cudaMallocPitch()! (2D) and \verb!cudaMalloc3D()! (3D).
These functions are different from the two previous ones in that it
automatically pad the array dimension to maximize the aligned memory
reference on the device; while the two previous implementations
don't do this, i.e. it is managed by programmers. This is
particularly helpful.
\end{enumerate}
% There are differences between device data allocated using the two
% methods. Using \verb!allocate()!, the data has semantics associated
% with allocatable arrays that are not implemented when using
% cudaMalloc. Thus, we cannot mix the using of allocation and
% deallocation between the two methods.
\begin{framed}
For allocations of 2D arrays, it is recommended that programmers
consider performing pitch allocations \verb!cudaMallocPitch()!.
\end{framed}
There is a minor notice when using a new Fortran 2003 feature,
{\bf allocatable scalar data}:
\begin{enumerate}
\item device array
\begin{lstlisting}
real, allocatable, device :: b(:)
allocate(b(5024),stat=istat)
...
if(allocated(b)) deallocate(b)
\end{lstlisting}
dynamically allocated in the host subprogram using \verb!allocate()!
and dynamically deallocated using \verb!deallocated()! statement.
If the device allocatable array has SAVE attribute, you need to
explicitly deallocate the data; or it continues to be
available after the subprogram returns.
\item allocatable scalar data in device memory (only in Fortran 2003)
\begin{lstlisting}
integer, allocatable, device :: ndev