-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathOpenMP.tex
516 lines (423 loc) · 15.8 KB
/
OpenMP.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
%%
%% OpenMP.tex
%% Login : <hoang-trong@hoang-trong-laptop>
%% Started on Wed Jul 29 23:28:50 2009 Hoang-Trong Minh Tuan
%% $Id$
%%
%% Copyright (C) 2009 Hoang-Trong Minh Tuan
%%
\chapter{OpenMP}
\label{chap:openmp}
\href{http://www.open-mpi.org/}{OpenMP} is a new MPI project that was
specifically designed for shared-memory-space (multicore or SMP
machine). OpenMP particularly fits ``high-end workstations rather than
supercomputer for complex computational applications''[NYT 28Oct,
1997]. OpenMP has
\begin{enumerate}
\item some compiler directives $\rightarrow$ help compiler generate
more efficient code
\begin{itemize}
\item Fortran directives, start with \verb.!$OPENMP.
\item C/C++ pragmas
\end{itemize}
\item some libraries routines
\item environment variables
\end{enumerate}
that can be used to specify the region of code that can be run in
parallel, aka {\bf parallel region}, when OpenMP is enabled. OpenMP
is not designed to run across multiple machines like MPI. If you want
your application to be portable on both clusters and SMP machines, MPI
is the solution. As multicore continues to grow, the number of
processors on an SMP will continue to grow.
OpenMP is a highler-level abstraction for native Pthreads
programing. At its core, OpenMP uses Pthreads, yet the details are
hidden from the programmer. Like all other higher level approaches,
OpenMP sacrifices flexibility for the ease of writing code. OpenMP
shared-memory programming model is a collection of compiler
directives, library routines, and environment variables that can be
used to specify the shared-memory parallelism in Fortran, C, C++
programs. In our chapter, we discuss OpenMP Fortran only.
Important terminologies:
\begin{itemize}
\item {\bf Thread}:
\item {\bf Region}:
\item {\bf Parallel region}:
\item {\bf Task}:
\end{itemize}
The definition of an {\it active parallel region} has changed between
OpenMP2.5 and 3.0. Since 3.0, an active parallel region is a parallel
region that is being executed by more than one thread; otherwise, it
is an {\it inactive parallel region}.
Before the release of GCC 4.2, OpenMP was only available in commercial
compilers. In gcc and gfortran, OpenMP programs can be compiled using
{\bf -fopenmp} flag.
{\bf Example}: Ubiquitous matrix multiplication
\begin{verbatim}
! OpenMP version
$ gfortran -fopenmp -o matmult_omp matmult.f
! sequential version
$ gfortran -o matmult matmult.f
\end{verbatim}
To test the running time
\begin{verbatim}
$time ./matmult
$ time ./matmult_omp
\end{verbatim}
PGI compiler currently supports only ONE level of parallel, e.g. NO
support for nested parallel region. Thus, \verb!OMP_NESTED! should not
be changed.
There is an environment variable called OMP\_NUM\_THREADS that will tell
OpenMP binaries how many threads to use. If this is not defined, one
thread per CPU (core) is used. Using more threads than the available
CPUs or cores, can make the program run slower.
\section{Directives}
\label{sec:directives}
OpenMP directives mark the region of code that should be run in
parallel. The program should be compiled with
\begin{itemize}
\item PGI Fortran: ``mpif90'', or ``mpf95 -mp''
\item GCC:
\item Intel Fortran: ...
\end{itemize}
OpenMP Fortran directives (or Fortran parallelization directives) has
the form\footnote{\url{https://computing.llnl.gov/tutorials/openMP/}}
\begin{verbatim}
sentinel directive-name [clause]
\end{verbatim}
The \textcolor{red}{sentinel} should be separated with the
directive-name by at least one white space. In fixed-form source,
\verb!sentinel! can be either
\begin{verbatim}
!$OMP , C$OMP, or *$OMP
\end{verbatim}
and \textcolor{red}{starting at column 1}. In free-form source,
\verb!sentinel! must be \verb.!$OMP..
The directive can extend multiple line using an ampersand (\&) at the
end of the line (free-source form) or a non-blank character on the
fifth column of the continuous line (fixed-source form).
\begin{lstlisting}
!!! free-source form
!$OMP PARALLEL DEFAULT(NONE) SHARED(A, B) PRIVATE(C, D) &
!$OMP REDUCTION(+:A)
!!! fixed-source form
!$OMP PARALLEL DEFAULT(NONE) SHARED(A, B) PRIVATE(C, D)
!$OMPx REDUCTION(+:A)
\end{lstlisting}
The \textcolor{red}{directive-name} can be any of the following
directives, as shown in Table \ref{tab:OpenMP_Fortran_directive}. The
valid clause depend on the directive-name being used.
\begin{table}[hbt]
\begin{center}
\caption{Fortran directive-names for OpenMP}
\begin{tabular}{cl}
\hline
Directive-name & Explanation \\
\hline\hline
BARRIER & specify the point at which all thread should be \\
& synchronized, i.e. all threads have to complete \\
& their works before any of them can continue \\
CRITICAL ... END CRITICAL & define the region, inside a parallel
region, \\
& that should be run by a single thread only \\
MASTER ... END MASTER & specify the code that should be \\
& on the master thread only \\
PARALLEL DO & a short cut for PARALLEL ... END PARALLEL region \\
& that contains only a single DO directive right after \\
& the line of PARALLEL directive \\
DO ... END DO & divide and distribute this region to all available \\
& threads to run in parallel \\
PARALLEL ... END PARALLEL & define a {\it parallel region} within which \\
& you can define a parallel directives. Then, the region is
executed\\
& by only a single thread, until it reachs this parallel
directive\\
\end{tabular}
\end{center}
\label{tab:OpenMP_Fortran_directive}
\end{table}
As you can see, several OpenMP Fortran directives come in pairs and
have the form below
\begin{verbatim}
!$OMP directive
[ structured block of code ]
!$OMP END directive
\end{verbatim}
Such directives allow programmers to
\begin{itemize}
\item create task,
\item create loop, and
\item create parallel section work-sharing
constructs and synchronization constructs.
\item define how data is shared or copied between parallel threads of
execution.
\end{itemize}
Among all directives, the most important directives is the one in
charge of defining the so-called {\bf parallel region}; that is
\verb.!$OMP PARALLLEL/!$OMP END PARALLEL. The equivalent form is the
single \verb.!$OMP PARALLEL DO. that must be followed by a DO loop.
\begin{itemize}
\item To learn how to use OpenMP directives with your codes, read
\hyperref[sec:tasks]{Task}.
\item To learn how many threads are being used, read the Section
\ref{sec:openmp-statements}
%\ref{sec:envir-vari}.
\end{itemize}
References:
\begin{itemize}
\item \url{http://www.atnf.csiro.au/computing/software/sol2docs/mr/READMEs/openmp.html}
\end{itemize}
\section{OpenMP statements}
\label{sec:openmp-statements}
OpenMP also define a set of routines. To specify that a statement that
use these routine being compiled only for OpenMP-compliant compilers,
you can put \verb.!$. in front of the statement.
\begin{lstlisting}
!$ interval = L * OMP_get_thread_num() / &
!$ (OMP_get_num_threads() - 1)
\end{lstlisting}
% \section{Libraries Routines}
% \label{sec:libraries-routines}
There are rountines that can query the runtime enviroment, e.g. to
determine the number of thread are participating in the execution of a
parallel region of code.
% \section{Environment variables}
% \label{sec:envir-vari}
The performance of a parallel program is heavily determined by the
number of threads. This and many other information can be given to the
program via using an appropriate routines, or by default, are
specified in environment variables.
In Fortran, this can be done via
\begin{lstlisting}
CALL OMP_SET_NUM_THREADS(4)
\end{lstlisting}
To set any of the below environment variable, use
\begin{verbatim}
setenv OMP_SCHEDULE "STATIC, 5"
\end{verbatim}
or read section~\ref{sec:set-envir-vari}.
In general, we need to set \verb!OMP_NUM_THREADS! to a value no larger
than the available number of processors on the target platform. Set
\verb!OMP_DYNAMIC! to FALSE to use the number of threads specified by
\verb!OMP_NUM_THREADS!.
\section{Mechanism}
\label{sec:mechanism}
Before or after a parallel region, the code is executed by only one
thread. This is the way that traditional program executes, aka serial
program. Thus, these regions are called {\bf serial regions}. The
thread that execute the serial regions is called the
{\bf master thread}. When a parallel region is detected, the master
thread create a set of threads, called {\bf slave threads}, to run the
parallel region concurrently. The master thread can also join into the
computation like any slave threads.
Each slave thread has a unique {\bf thread number} ranging from zero
to $N_p-1$ with $N_p$ is the number of threads (probably including the
master thread if it join in the computation).
There are many information that a programmer can specify how the
parallel region is executed
\begin{enumerate}
\item the scopes of variables
\item the number of threads
\item special treatments of some variables
\item ...
\end{enumerate}
These are specified in a set of clauses
\begin{lstlisting}
!$OMP directive clause1 clause2 ...
\end{lstlisting}
Depending on the directive, we can have different clauses.
\section{PARALLEL region}
\label{sec:parallel-region}
There can be only ONE work-sharing construct only for a PARALLEL
region. It is aka {\bf combined parallel work-sharing
constructs}. This region can be defined by one of the two directives
\begin{lstlisting}
!(1)
!$OMP PARALLEL DO clause1 clause2 ...
DO ...
!(2)
!$OMP PARALLEL clause1 clause2
DO ...
!$OMP END PARALLEL
\end{lstlisting}
After PARALLEL DO must be a DO loop, yet within PARALLEL/END PARALLEL
pair, it can be anything. \verb.!$OMP END PARALLEL. has an implicit
synchronization, i.e. wait until all thread complete before returning
the control to the master thread.
The clauses that we can use for PARALLEL region are
\begin{enumerate}
\item PRIVATE(list): declare a list of variables that are local to
each thread.
\item SHARE(list)
\item DEFAULT(PRIVATE | SHARED | NONE)
\item FIRSTPRIVATE(list)
\item COPYIN(list)
\item REDUCTION(operator:list)
\item IF(\verb.scalar_logical_expression.)
\item \verb.NUM_THREADS(scalar_integer_expression).: specify the
number of threads to process the parallel region.
\end{enumerate}
It is possible to nest parallel regions into parallel regions.
\begin{lstlisting}
!$OMP PARALLEL
...
!$OMP PARALLEL
...
!$OMP END PARALLEL
...
!$OMP END PARALLEL
\end{lstlisting}
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/nested_parallel_region.eps}}
\caption{Execution of nested parallel regions}
\label{fig:nested_parallel_reg}
\end{figure}
\section{Clauses}
\label{sec:clauses-1}
There are two types of clauses: {\bf data scope attribute clauses}
(specify how each variable is handled and which thread is allowed to
see its value and change it), the second type involves all clauses
that don't fit to the first type.
\subsection{PRIVATE}
\label{sec:private}
A variable should have different values in each thread, then you
specify it in the PRIVATE clauses
\begin{lstlisting}
!$OMP PARALLEL PRIVATE(a, b)
\end{lstlisting}
with $a, b$ are two variables. Variables declared as private have an
undefined value at the peginning of the parallel region. The variables
like counters for DO-loop, FORALL command, or implicit DO-loop or any
one that are specified to be THREADPRIVATE become private
automatically to each thread.
\section{Tasks}
\label{sec:tasks}
Let's make yourself familiar with this concept: task. Any part of your
code can be considered as a task. Here are the properties of a task,
it has
\begin{enumerate}
\item some code to execute
\item its own data
\item an assigned thread to execute that code and use the data
\end{enumerate}
\subsection{Task construct}
\label{sec:task-construct}
A task construct is a {\it task directive} plus a structured block
\begin{verbatim}
#pragma omp task [clause[[,]clause] ...]
{
structured-block
}
\end{verbatim}
you can have one or more \verb!clause! which can be one of the
following
\begin{verbatim}
if (expression)
untied
shared (list)
private (list)
firstprivate (list)
default( shared | none )
\end{verbatim}
\subsection{Task scheduling}
\label{sec:task-scheduling}
\section{Write to shared-data}
\label{sec:write-shared-data}
Simultaneous write to a shared containers must be protected by
\begin{verbatim}
#pragma omp critical
\end{verbatim}
section.
\section{Parallel a DO loop}
\label{sec:parallel-do-loop}
A DO loop is very common. Then, the question is ``do we always be able
to parallelize a DO loop?'' - The answer is NO.
This is an example of DO loop in which the execution of one loop is
independent from the other.
\begin{lstlisting}
DO i=1, 10
A(i) = i
END
\end{lstlisting}
Here, the computation of A(i) and A(i+1) are independent.
However, let's learn all possible cases. Here is a classification of
data dependencies\footnote{\url{http://www.ibiblio.org/pub/languages/fortran/ch2-17.html}}
\begin{verbatim}
------------------------------
The dependencies we will discuss are:
Store before load Competing stores Load before store
================= ================ ===============
DO I=2,4 DO I=2,4 DO I=2,4
A(I-1) = B(I) A(I-1) = B(I) B(I) = A(I-1)
C(I) = A(I) A(I) = C(I) A(I) = C(I)
ENDDO ENDDO ENDDO
Recurrence Sum reduction
=============== ==============
DO I=2,4 SUM = 0.0
A(I) = A(I-1) DO I=2,4
ENDDO SUM = SUM + A(I)
ENDDO
\end{verbatim}
\subsection{Store before hand}
\label{sec:store-before-hand}
A technique called PRE-LOADING can make the code parallelable.
\begin{lstlisting}
DO I=2,4
TEMP = A(I)
A(I-1) = B(I)
C(I) = TEMP
ENDDO
\end{lstlisting}
\subsection{Recurrence}
\label{sec:recurrence}
This situation, the dependency is essential and cannot be removable.
\begin{lstlisting}
DO I=2,4
A(I) = A(I-1)
ENDDO
\end{lstlisting}
\section{Choice of number of threads}
\label{sec:choice-numb-thre}
When you set \verb!OMP_NUM_THREADS!, you over-ride any relationship
between the number of processors and the number of threads chosen by
OpenMP. If you set more threads than processors, you have threads
waiting for an opportunity to run. You would expect performance to
degrade when you choose too many
threads\footnote{\url{http://software.intel.com/en-us/forums/threading-on-intel-parallel-architectures/topic/47535/}}.
If you do any reading about threading or OpenMP, you will see there
are several common problems which could block speedup by threaded
parallelism. If it were not so, there wouldn't be justification for
supporting tools such as Intel Thread Profiler.
OpenMP compilers generally don't pay any attention to false sharing,
even in simple cases where it might be possible to diagnose. On modern
processors which have significant Instruction Level Parallelism when
running a single thread, you could easily find ways to defeat ILP with
threaded parallelism. These ideas hardly make a dent in the list of
possibilities.
\section{How to compile a program}
\label{sec:how-compile-program}
\begin{enumerate}
\item PGI: pgf95 (Fortran 90/95), pgf77 (FORTRAN 77)
\end{enumerate}
\subsection{PGI compiler}
\label{sec:pgi-compiler}
In the makefile, add the \verb!-mp! switch to FFLAGS variable to
enable processing of OMP pragmas
\section{How to run a program}
\label{sec:how-run-program}
code, see ''Parallel Programming Using PGI Compilers'' in the PGI User's
Guide.
\section{In C/C++}
\label{sec:cc++}
\subsection{Pragmas}
\label{sec:pragmas}
Pragmas is a dialect to directives and is a technical term in C/C++
programming language. Pragmas is not covered here. However, you should
know a pragmas for OpenMP begins with
\begin{verbatim}
#pragmas omp
\end{verbatim}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "gpucomputing"
%%% End: