-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMultiple_GPUs.tex
2271 lines (1824 loc) · 88.5 KB
/
Multiple_GPUs.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Multiple GPU Programming}
\label{chap:multiple_GPU}
\section{Introduction}
\label{sec:multi-GPU_intro}
The main purpose of using multiple GPU is for domain decomposition, at which
each GPU process one subdomain (one chunk) of the data. This will be helpful
when
\begin{enumerate}
\item domain data is too large to fit on a single GPU (e.g. halo exchange)
\item rich on data parallelism between subdomains.
\end{enumerate}
There are two scenarios: (1) multiple GPUs within a single node, and (2) GPUs
across network nodes. The latter case is applied in very large scale
simulations. The former case is applied in simulations where 2 or maximum 4 GPUs
are being installed on a single machine.
\begin{framed}
Even though you can have different subdomain size for different GPUs,
you typically want them to be the same for easier to code up and better load
balancing.
\end{framed}
\subsection{Important concepts (context, stream)}
\label{sec:context_stream}
Before we learn different techniques for multi-GPU programming, there are
important concepts from CUDA that we need to understand. Anything residing on
device memory when created, must belong to a certain context (see
Sect.~\ref{sec:cuda-context}). A {\bf CUDA Context} is bound to a physical
device and its own virtual address space. As a result, any CUDA application
requires at least a CUDA context. This context is known as primary context.
Depending on the method we use, it can be either
\begin{enumerate}
\item CUDA driver API (or OpenCL) require an explicit call to
\verb!cuCtxCreate()! (or \verb!clCreateContext()!) to create a
context. A CUDA context is an object of type \verb!CUcontext!, and must be
assign to a given GPU.
\begin{verbatim}
cuCtxCreate(CUcontext *, int flags, CUdevice gpuID)
\end{verbatim}
IMPORTANT: To assign a context to the calling host thread (in the case of
OpenMP) or process (in the case of MPI), we use \verb!cuCtxSetCurrent()!.
\item CUDA runtime APIs: at first, you need to select the device \verb!cudaSetDevice()!. However, at this point the context has not been created and initialized yet.
The context will be created and initialized at the first call that change the
device's states, e.g. \verb!cudaMalloc!, \verb!cudaMemcpy!, \verb!cudaFree!, or kernel launch will
initialize and use that context. However, before calling to any of the above
runtime APIs to change the device's states, we need to configure the device,
i.e.the flags, as being passed to \verb!cuCtxCreate()! to initialize the context will be
retrieved using parameters from functions these functions
cudaSetDeviceFlags(), cudaD3D9SetDirect3DDevice(),
cudaD3D10SetDirect3DDevice(), cudaD3D11SetDirect3DDevice(),
cudaGLSetGLDevice(), and cudaVDPAUSetVDPAUDevice()
(Sect.\ref{sec:initialize_GPU}). \end{enumerate}
\begin{framed}
The primary context will remain active until it's explicitly deinitialized using
\verb!cudaDeviceReset()! (or the old name \verb!cudaThreadExit()!) or host
thread terminates. The time for initialization of adevice context can be a
critical factor in small problems, so it's suggested not to deinitialize the
context if you don't have to (except before exit or recover from an unspecified
launch failure). {\bf CUDA setup time} is the time that is necessary to
initialize the CUDA context on the GPU, malloc of memory and the release of the
CUDA
context\footnote{\url{http://forums.nvidia.com/index.php?showtopic=158779}}.
\begin{verbatim}
Times in miliseconds
CreateContext 15,8
GetDeviceProperties 0
Malloc 29,5
Memset 0
ThreadSynchronize 0 (without waiting for any real synchronization)
Free 0,3
ThreadExit 4,2
\end{verbatim}
\end{framed}
Interoperating between CUDA contexts has been enabled in CUDA 4.0, yet with some
restriction\footnote{\url{http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/online/group__CUDART__DRIVER.html}}.
With CUDA 3.2 and prior, data from one host thread cannot be shared with the
other host thread, even though they may reside on the same GPU. As a result,
sharing data between GPUs in CUDA 3.2 and prior is impossible. This has been
relaxed in CUDA 4.0 (to be discussed shortly).
\begin{framed}
Using multiple context for a single GPU within a single host thread is not
recommended (degrade performance). It's highly recommended to use one-to-one
device-to-context
mapping.
\end{framed}
For each CUDA context, a CUDA stream is bound to it. So, CUDA runtime
APIs implicitly use stream 0 (or NULL-stream) by default. \textcolor{red}{In
essence, any CUDA operations require a context and a stream.}
\begin{framed}
\textcolor{red}{In CUDA, one context map to one GPU and cannot be changed; while
in OpenCL; one context can be mapped to different GPUs, though one at a time}.
\textcolor{red}{A context (associated with the calling host thread) is
destroyed by either}: call \verb!cudaThreadExit()! or \textcolor{red}{wait
until host thread complete} (which implicitly call cudaThreadExit()). This
\verb!cudaThreadExit()! API is depricated in CUDA 4.0.
Since CUDA 4.0, we should use \verb!cudaDeviceReset()! whose name reflects
better the purpose of the API; that is to resets the device
immediately, regardless of the context. So, it's caller's responsibility to ensure that the device is not being used by any other host
threads from the process when this function is called.
\end{framed}
\subsection{Using multiple GPUs}
\label{sec:using_multi-GPUs}
CUDA APIs:
\begin{enumerate}
\item cudaGetDeviceCount, cudaGetDeviceProperties enumerate devices
\item cudaSetDevice(i) selects the device $i$-th.
\end{enumerate}
So, using CUDA runtime APIs 3.2 and earlier, as a single host thread can only
call a single \verb!cudaSetDevice()!, i.e.
all following CUDA calls in the thread will concern this device, there are two
ways to use multiple GPU
\begin{itemize}
\item CUDA 3.2 and earlier
\begin{enumerate}
\item single host thread: CUDA driver API 3.2 and earlier
(Sect.\ref{sec:one-a-time_CUDA3.2-driver}). Can return to previous device and
reuse the data.
\item multiple host threads (OpenMP) (Sect.\ref{sec:GPUs_openMPI}) or multiple
processes (MPI) (Sect.\ref{sec:GPUs_cuda32.MPI}), each thread (or process) use a different device via calls to \verb!cudaSetDevice()!.
\item using pthreads: Sect.\ref{sec:GPUs_pthreads}.
% \item multiple host threads belonging to the same processes (e.g. OpenMP,
% pthreads)
NOTE: threads in the same processes share the same address space.
\item Using XMP (Sect.\ref{sec:GPUs_XMP}).
% \item multiple CPU processes (e.g. MPI)
%
\item Using GMAC (Sect.\ref{sec:GPUs_gmac})
\item using GPUDirect 1.0 + Infinband: across network
(Sect.\ref{sec:GPUs_GPUDirect1}).
% \item GPUDirect 1.0 + Infiniband (IB) network (Sect.\ref{sec:GPUs_GPUDirect1}),
\end{enumerate}
In any way above, we cannot communicate data directly from one device to
another. So, these approaches work pretty well on applications where the work to
be processed on different GPU has no or rare data exchange.
To help make multi-GPU programming easier, as well as the communication between
GPUs available, CUDA 4.0 has been added some new features, targetting to
Fermi-based device.
\item CUDA 4.0 and later: Since CUDA 4.0, with UVA (universal virtual addressing), a
single host thread can trivially work with multiple GPUs within a node easily
using CUDA runtime APIs, as well as easily coordinate work among multiple GPUs
(e.g. halo exchange) with peer-to-peer communication (transfer and access data).
There are three ways to control multiple GPUs
\begin{enumerate}
% \item single host thread: CUDA runtime API 3.2 and earlier
% (Sect.\ref{sec:one-a-time_CUDA3.2-runtime}). Cann't return to previous device.
\item a single host thread switch
from one device to another using CUDA
runtime API (Sect.\ref{sec:GPUs_cuda4_singlethread}).
\item using CUDA driver API is easier now (Sect.\ref{sec:GPUs_driverAPI4.0}).
\item using GPUDirect 2.0: enable peer-to-peer communication within node
(Sect.\ref{sec:GPUs_GPUDirect2}).
\end{enumerate}
\end{itemize}
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/multipleGPU.eps}}
\caption{Summary of how to use multiple GPUs}
\label{fig:case_multipleGPU}
\end{figure}
A GPU device is a part of the context and cannot be changed once the context is
established. Thus, a host thread cannot switch form one GPU to another. So you
have the options
\begin{enumerate}
\item Use more than one GPU; yet only one at a time
\item Use multiple GPU at the same time
\end{enumerate}
\section{CUDA + MPI}
\label{sec:CUDA-MPI}
MPI is a standard to exchange data between processes via messages, with APIs
to exchange messages
\begin{verbatim}
Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv
Collectives, e.g. MPI_Reduce
\end{verbatim}
There are multiple implementations (open source and commercial) of such APIs, under different language binding
\begin{verbatim}
Binding for C/C++, Fortran, Python, ...
E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, ...
\end{verbatim}
Example: a minimal MPI program
\begin{lstlisting}
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank,size;
/* Initialize the MPI library */
MPI_Init(&argc,&argv);
/* Determine the calling process rank and total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
/* Call MPI routines like MPI_Send, MPI_Recv, ... */
...
/* Shutdown MPI library */
MPI_Finalize();
return 0;
}
\end{lstlisting}
\ref{sec:CUDA-MPI}
\section{Multi-GPU synchronization: cudaStreamWaitEvent}
\label{sec:cudaStreamWaitEvent}
Streams are associated with devices
But cudaStreamWaitEvent can synchronize with events on other GPU: allow
inter-GPU synchronization
\section{Access data from a different GPU}
Sect.\ref{sec:cudaHostAlloc}
\begin{verbatim}
cudaHostAlloc(..., cudaHostAllocPortable)
\end{verbatim}
once such data is allocated, then call cudaHostGetDevicePointer for each GPU
% \section{One at a time}
% \label{sec:one-at-time}
\section{Intel Chipset Architecture}
It's important to know CPU architecture when programming multipe GPUs.
Intel has three main categories of Intel chipsets:
\begin{enumerate}
\item 4xx series: use PCI bus for interconnection between components. These
includes: 80486, Pentium, Pentium Pro/II/III, Southbridge 4xx chipsets (using
PIIX).
\item 8xx series: use IOH (Hub links) for interconnection between components.
These includes: Pentium II/III, Pentium III-M (mobile), Pentium 4, Pentium
4-M/Pentium M/Celeron M, Southbridge 8xx chipsets.
\item 9xx and 3/4 series: use PCI Express for interconnection between
components.
These includes: Pentium 4/Pentium D/Pentium EE, Pentium M/Celeron M, Core/Core
2 mobile, Core 2, Southbridge 9xx chipsets.
\end{enumerate}
Core 2, and these chipsets use the same design concept,
Fig.\ref{fig:Intel_series4}, in which Front Side Bus (FSB) is used to transfer
the data between CPU and Northbridge which contains a memory controller and
PCI-e graphics controller. In Core 2, Northbridge connect to Southbridge via
Direct Media Interface (DMI).
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/Intel_series4.eps}}
\caption{Front Side Bus (FSB) is heavily used in the design of Intel series 4}
\label{fig:Intel_series4}
\end{figure}
In Southbridge chipsets, CPU connect directly to NorthBridge and use NorthBridge
to interact with high-speed components (RAM, AGP cards). NorthBridge connect to
Southbridge and use it to connect with low-speed components (PCI, USB, ISA, IDE
(harddrive), ACPI, AC'97 (sound controller)). Southbridge is known as {\bf ICH}
(Input/Output Controller Hub), which was later renamed to {\it Legacy I/O
Controller Hub} when Intel introduces Intel X58 I/O hub. PIIX is the predecessor
of ICH, and connect to NorthBridge through an internal PCI bus (133 MB/sec). ICH
uses a proprietery interfact (known as Hub Interface) to connect to Northbridge
with 266 MB/sec.
With ICH in the architecture, Northbridge is called (1) Memory Controller Hub
(MCH), or (2) Graphics and Memory Controller Hub (GMCH) if it has integrated
graphics card connected. There are different version of ICH.
\begin{enumerate}
\item ICH0 (82801AB):(year 1999) support Ultra ATA/33, 4 PCI slots
\item ICH (82801AA): support Ultra ATA/66, 6 PCI slots
\item ICH2 (82801BA) and ICH2-M (82801BAM mobile) : (year 2000) with
360 pins, support ATA/100, 4 USBs, AC'97 support 6 channel sound
\item ICH3-S (82801CA server) and ICH3-M (mobile): (year 2001) with 421
pins, use 6 USB 1.1 (NO version for desktop motherboards)
\item ICH4 (82801DB) and ICH4-M: (year 2002), support 6 USB 2.0, AC'97
specification version 2.3
\item ICH5 (82801EB base), 82801ER (RAID0), 6300ESB (Enterprise Southbridge):
(year 2003) with 460 pins, support SATA, optionally RAID0, 8 USB 2.0, ACPI 2.0
\item {\bf ICH6} (82801FB), ICH6-R (RAID), 6311ESB, 6321ESB (Enterprise
Southbridge with Integrated LAN): (year 2004) use 4 PCI Express x1 (first
time), PCI Express x4 (replace Hub Interface giving 1GB/sec), 2 SATA (first
time), remove PATA, optionally RAID (0,1,0+1, Intel Matrix RAID)
\item ICH7 (82801GB), ICH7-DH (Digital Home): (year mid-2005) with 2 PCI
Express x1 slots, SATA (300 MB/sec)
\item ICH8 (82801HB): (year 2006) with eSATA and Gigabit Ethernet (first time)
\item ICH9 : (year 2007) remove all PATA support.
\item {\bf ICH10} (82801JB): (year 2008) 10 Gbits/sec bidireciontal DMI
(Direct Media Interface) to replace FSB, with 6 PCI Express 1.1, 6 SATA (3Gbits/sec),
use Intel High Definition Audio (better than AC'97).
\end{enumerate}
\begin{framed}
Intel Chipset naming convention, e.g. Q55, Q57
\begin{enumerate}
\item first character: P=, Q = business-oriented, X=eXtream, H=home user,
G=with integrated graphics
\item second character (a numerical): indicate the series
\end{enumerate}
\end{framed}
\begin{figure}[hbt]
\centerline{
% \includegraphics[height=5cm,
% angle=0]{./images/Intel_i5.eps},
\includegraphics[height=5cm,
angle=0]{./images/Intel_series5.eps},
\includegraphics[height=5cm,
angle=0]{./images/Intel_Core-i7.eps}}
\caption{(A) Intel 5 series design. (B) Intel Core-i7 single socket}
\label{fig:Intel_i5}
\end{figure}
Intel Hub Architecture uses NorthBridge and Southbridge. Since 2008, Intel has
made a significant change in the concept design of motherboard design,
Fig.\ref{fig:Intel_i5}. Memory controller is moved to CPU. FSB is replaced
by Intel QuickPath Interconnect (QPI), e.g. 20-lane QPI link pair with 3.2GHz clock
can deliver the speed of 25.6 GB/sec (double the theoretical bandwidth of 1600
MHz FSB). Intel Core i7 using Nehalem microarchitecture was the first
to use QPI connecting to an X58. QPI can also be used to connect 2 CPUs, first
used in Intel Xeon (March, 2009) and Itanium (Feb 2010). Some lower-end Nehalem
with integrated graphics controller on CPU, it doesn't use QPI, but
DMI instead (2GB/sec or 5GT/s).
\begin{enumerate}
\item 5/6/7 series: Core i7 introduced in late 2008 with Nehalem
microarchitecture the basis in Core i7. Core i5 was introduced in Sept,
2009; with DMI 2.5 GT/sec; Core i3 introduced in Jan, 2010 for low-end
proformance processors. Nehalem uses 45nm technology (2008). Westmere is 32nm
die shrink of Nehalem (2010), with support huge page 1GB in size.
\begin{itemize}
\item 5 series (Ibex Peak and Tylersburg): Ipex Peak used by Core i5 series,
with 32nm technology. Northbridge and Southbridge are removed. A new
component, Platform Controller Hub (PCH) to provide peripheral
connectivity, graphics display with Flexible Display Interface. FSB is
replaced by DMI. CPU interact directly with PCI Express GPU and DDR3 memory
for fast access. ICH is integrated into PCH. Intel 5 series uses ICH10/ICH10R.
Intel 5500 series (Intel 5520, Intel 5500 chipset) use ICH9 or ICH10.
Intel X58 (Tylersburg) support Core i7, Xeon 5500 series with 45nm technology.
It doesn't have memory controller hub (MCH), thus is called an X58 I/O hub (IOH, not ICH) and
function as a Nortbridge to connect with CPU via QPI. X58 uses 40 PCIe lanes.
X58 QPI can deliver 12.5 GB/sec each direction, and X58 PCI-e 2.0. Dual-socket
Intel Xeon 5500 series use Tylersburg hub, with QPI connectings 2
CPUs\footnote{\url{http://www.avadirect.com/intel-nehalem-ep/intel-tylersburg-chipset-motherboards.htm}}.
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/Intel_X58.eps}, \includegraphics[height=5cm,
angle=0]{./images/Intel_Xeon_5500.eps}}
\caption{Intel Core i7 series design. (B) Intel Xeon 5500 series}
\label{fig:Intel_X58}
\end{figure}
\item Core i 6 series (Sandy Bridge chipsets): use 32nm technology, support
DDR3-1333 (tested to work with DDR3-2133 also). Built-in GPU has 12
execution units (EUs).
Sandy Bridge-EN and Sandy Bridge-EP based Xeon E5.
\item Core i 7 series (Ivy Bridge chipsets: Sandy Bridge CPU is the new
microarchitecture which is the replacement for previous Pentium families
(P4,P5, P6, Nehalem). It uses 22nm technology with tri-gate transistors,
support DDR3-1600, new random number generator RdRand instruction. Built-in
GPU has 16 EUs, support OpenCL 1.1. Sandy Bridge is 64-bit, quad-core, dual
threaded, 4 issue, out-of-order microprocessor with new AVX instruction
(Advanced Vector eXtensions that support 256-bit FP, i.e. 3 and 4 operand).
Ivy Bridge-E processors (2013) to have 12 cores and L3 cache 30MB. Xeon E5
V2 to replace Xeon E5 Sandy Bridge.
\end{itemize}
\item 8 series (Lynx Point): use LGA 1150 socket (Socket H3) for Haswell and
Broadwell microarchitecture, with 6 USB 3.0, 6 SATA 3.0. Intel Xeon C228
chipset uses LGA 1150 socket. New features: Intel's Transactional
Synchronization Extensions (TSX) to improve multi-core efficiency.
\end{enumerate}
\begin{framed}
First generation of Intel i7 is based on Nehalem. Next generation of
Intel i7 and Xeon E3 (codename: Ivy Bridge) is based on Sandy Bridge with an
integrated GPU and use 22nm technology (rather than 32nm).
Intel Xeon 5600-series, starting with 6-core (and a few 4-core), is based on
Westmere. Intel Xeon 5500-series, with 4 cores, is based on Nehalem.
AMD has new microprocessor architecture called Interlagos
(Bulldozer)\footnote{dozer architecture uses 2 INT units shared an FP unit in a
single core}, Fig.\ref{fig:SandyBridge_Westmere}, that use Hyper Transport
Technology (25.6GB/s and 16-bit wide link), an alternative to QPI.
\end{framed}
Mordern devices put Northbridge into the CPU die, i.e. System on Chip
processors. Examples are Intel Sandy Bridge architecture and AMD's Fusion
architecture. Intel Haswell (expected available in June 2013) will incorporate
both Northbridge and Southbridge on CPU die for Ultrabook platform. This is
known as {\it Platform Controller Hub}.
\subsection{TSX}
\section{QPI and IOH (2 sockets on a mainboard)}
\label{sec:QPI_IOH}
Traditionally, on a machine with 2 CPUs, each connect to a different set of RAM memory.
Programs running on one GPU cannot utilize memory connecting to another CPU.
Modern CPUs also include hardware support for “unified address spaces,” where
multiple CPUs can access one another’s memory efficiently, e.g. Intel's QPI or
AMD's HyperTransport.
QPI is Intel's Quick Path Interconnect technology announced in 2007, and debut
in late 2008. QPI is the interface for point-to-point communication between 2
hardware components. It's the replacement for older technologies, e.g. front-side bus
(FSB), CSI, etc. and is designed primarily for servers and workstations.
\begin{figure}[hbt]
\centerline{\includegraphics[height=3.5cm,
angle=0]{./images/QPI_DMI.eps}}
\caption{Xeon X6550 CPU}
\label{fig:QPI_DMI}
\end{figure}
\begin{mdframed}
DMI 2.0 links the CPU to peripherals (e.g. memory, graphics card), and with a
much slower rate (5 Gbps (gigabit per second)), Fig.\ref{fig:QPI_DMI}.
\footnote{\url{http://superuser.com/questions/692058/dmi-2-0-vs-8-0-gt-s-qpi}}
QPI is point-to-point, high-speed link between processors only (CPU-CPU,
CPU-GPU). QPI can run at 6.4 GT/s (i.e. 25.6 GB/s using DDR3 1066 MHz) or 8 GT/s
(i.e. 32 GB/s using DDR3 1333 MHz) (Giga Byte per second) depending on which RAM
it connects to. \footnote{\url{https://communities.intel.com/message/125528}}
QPI (Intel) and Hypertransport (AMD) are not buses, but point-to-point
connection. A bus is a set of wires (total 150 wires) that allows several
components to be connected; while a point-to-point is a path connecting ONLY two devices.
\footnote{\url{http://www.hardwaresecrets.com/article/Everything-You-Need-to-Know-About-The-QuickPath-Interconnect-QPI/610}}
QPI provides two separate lanes (total 84 wires), one to read the other one to
write, so they can be done at the same time. This is impossible with buses, as they share the same
lane. Another advantage is that QPI uses less wires than FSB.
QPI is faster than HyperTransport: maximum transfer rate in Hypertransport is
10.4 GB/s (Phenom processor use HyperTransport at 7.2 GB/s only). This is even
lower due to the limit of CPU transfers of 4 GB/s by AMD Athlon (formerly known
as Athlon 64), Athlon X2 (formerly known as Athlon 64 X2).
\end{mdframed}
QPI can be used to connect 2 processors on a multi-socket motherboard, or a
processor and I/O hub (IOH), or an IOH and another IOH,
Fig.\ref{fig:QPI_connect}. Here, an IOH has 2 QPI channels and 36 PCI-E Gen.2
plus ESI. In Fig.(B), where each CPU connects to its own IOH, the two IOHs need
to connect to each other. Traditionally, the front side bus (FSB) is the memory
bus shared by I/O request and memory access. With the new generation of Intel
CPU, another external memory bus dedicated to memory access only called IMC
(Integrated Memory Controller) which is the bus that connect CPU to the DDR3
memory (Ch0-Ch1-Ch2 given in the figure). The bus that connect CPU to the
external world the QPI. NOTE: \textcolor{red}{As the memory connect directly
to processor via IMC, the IOH now has no memory channels, except a bunch of
PCI-e buses}.
% \begin{framed}
% I/O hub (IOH) is a new term from Intel which refers to Intel X58.
%
% \end{framed}
\begin{figure}[hbt]
\centerline{\includegraphics[height=3.5cm,
angle=0]{./images/QPI_connect.eps}}
\caption{(A) A 2-socket system with a single IOH, (B) A 2-socket system
with 2 IOHs. NOTE: C1-C4 = core index, Ch0-Ch2 = integrated memory
controller, support DDR3}
\label{fig:QPI_connect}
\end{figure}
Each lane in QPI can transfer 20 bits per time. In these 20 bits, 16-bits are
used for transfer data, and the remaining 4 bits are for correction code (CRC -
Cyclical Redundancy Check). In addition, QPI can treats each 20-bit lane as four
5-bit lanes, Fig.\ref{fig:QPI_5bit-lanes}. This division is to improve
reliability, especially on server market environment (which is thus not
available on desktops). This allows the system to shut down the portion of the
lane that is physically damaged. The transfer rate is reduced, but the system
won't fail.
\begin{figure}[hbt]
\centerline{\includegraphics[height=3.5cm,
angle=0]{./images/QPI_5bit-lanes.eps}}
\caption{QPI can run at four 5-bit lanes }
\label{fig:QPI_5bit-lanes}
\end{figure}
The first version, QPI 1.0, works at the clock rate
3.2 GHz, and transfering two data per clock cycle (DDR - double data rate),
making the bus to work as if it's 6.4 GHz clock rate. However, Intel uses a
different concept, called GT/s (GigaTransfer per second). GigaTransfer refers to
the amount of data transfered (not only data, but also the bits that are lost
as a result of interface overhead).
\begin{verbatim}
6.4 GHz x 16 bits / 8 = 12.8 GB/s (each QPI lane)
= 25.6 GB/s (two data-path QPI read+write)
\end{verbatim}
QPI provides 3 different power modes: L0, L0s and L1.
\begin{itemize}
\item L0: QPI fully functional
\item L0s: the data wires and the circuits that drives these wires are turned
off
\item L1: everything is turned off, in return for a higher wake-up time.
\end{itemize}
On a desktop system, typically, there's a single socket on the mainboard. Core
i7 (Nehalem architecture processor) has a single QPI channel, yet Xeon 5500
(April, 2009) has 2 QPI, i.e. 2-sockets mainboard. Each socket fit to a single
processor. Processors based on QPI are Nehalem, Westmere, and Tukwila families.
\begin{verbatim}
The rate is calculated as
3.2 GHz
x 2 bits/Hz (double data rate)
x 20 (QPI link width)
x (64/80) (data bits/flit bits)
x 2 (unidirectional send and receive operating simultaneously)
/ 8 (bits/byte)
= 25.6 GB/s
\end{verbatim}
QPI bandwidth is 12.8GB/s per direction, giving 25.6GB/s bi-direction. ``Each
PCI-E Gen 2 lane operates at 5Gbit/sec for a net bandwidth of 500MB/s per lane,
per direction. A x4 PCI-E Gen 2 channel is rated 2GB/s per direction, and 4GB/s
per direction for the x8 channel. So while the 36 PCI-E Gen2 lanes on the 5520
IOH are nominally rated for 18GB/s per direction, the maximum bandwidth per QPI
is still 12.8GB/s per direction''.
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/sandybridge_westmere.eps}}
\caption{Sandy Bridge and Westmere}
\label{fig:SandyBridge_Westmere}
\end{figure}
First generation of QPI 1.0 runs at 6.4 GT/s, with 20 lanes of high-speed PCI-E
2.0 and 20GB/s of bandwidth, in which 4 lanes are dedicated to DMI (Intel's Direct Media
Interface). GPUs and RAID controllers can be connected directly to the CPU. It
also has enough PCI-E lanes to handle dual-GPU configuration (or a GPU and a
high performance network card or storage controller). In this version, it has
QPI in the consumer part, Fig.\ref{fig:QPI_1.0}. Nehalem-EP/EX, Westmere-EP/EX
and Tukwila use QPI 1.0.
The second generation QPI 1.1 is an increment enhancement at physical (run at
higher frequencies) and logical layer (use only home snooping protocol, remove
source snooping), to run at 6.4 GT/sec, 7.2GT/sec, or 8GT/sec. QPI 1.1 is mainly
designed for 2-socket servers, and is backward compatible with QPI 1.0. Sandy
Bridge-EP and the Romley platform (two-socket Sandy-Bridge-EP based server line)
will be the first products to use QPI 1.1, followed by Ivy Bridge-EP/EX.
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/QPI_1.0.eps}}
\caption{QPI 1.0}
\label{fig:QPI_1.0}
\end{figure}
\begin{enumerate}
\item \url{http://www.realworldtech.com/qpi-evolved/}
\item \url{http://www.realworldtech.com/common-system-interface/2/}
\item
\url{http://www.qdpma.com/systemarchitecture/systemarchitecture_qpi.html}
\end{enumerate}
\section{SLI}
\label{sec:SLI}
Nvidia Scalable Link Interface (SLI) is a technology that links multiple GPUs on
the same machine to improve graphics performance, i.e. rendering, by dividing
the workload across multiple GPUs.
Two GPUs must be attached to two PCI-e x16 slots, and linked together using
external SLI bridge connectors. SLI is enabled via Nvidia control panel. Then,
the driver can treat both GPUs as a single logical device, and divide the
workload automatically depending on the selected mode. There are 5 modes:
\begin{enumerate}
\item AFR = alternate frame rendering, i.e. frame even is done by GPU0, frame
odd is done by GPU1 (preferref mode, whole frame is done by a single GPU at a
time, and thus requires little inter-GPU communication)
\item SFR = split frame rendering, i.e. a frame is splitted into, say
2, regions, each region is processed by a single GPU. Boundary processing is
required, i.e. some of the works is duplicated, and communication overhead is
higher.
\item Boost Performance Hybrid SLI: this mode is somehow similar to AFR mode,
however, one will do more frames before switching to another. This scenario
work wells when you have one powerful GPU, and the other is at a lower entry.
\item SLIAA : improve antialiasing, but not performance level. The setting is
in ``Antialiased settings'', which can be SLI8x or SLI16x (if two GPUs are
used), and SLI32x (if 4 GPUs are used).
\item Compatibility mode: default setting for all aplications that don't have
SLI profile. Here one GPU is being used for display, the other ones can be
either idle, or in use by other applications or on a separate device.
\end{enumerate}
The choice of an appropriate mode is based on SLI profile, which can be created
for a single application by sending it to nVidia. Then, in the next release of
the driver, the profile of your application will be included, making it
available to end-users. If your application doesn't have an SLI profile, a good
choice is to use SLIAA mode.
\textcolor{red}{An important using SLI is that data are duplicated in all GPUs
involved. So, using two GPUs, each with 512MB, we also have total 512MB for
video displays}. Similarly, when using CUDA on SLI technology, an allocation in
one CUDA device on one GPU will consume more memory than on other GPUs that are
part of SLI configuration.
Also, each GPU on the SLI configuration requires a separate CUDA context. To
identify the CUDA device handle for the device being used for rendering in the
current and next frame.
\section{GPUDirect}
\label{sec:GPUDirect}
\begin{enumerate}
\item GPUDirect v1 (introduced for CUDA 4.0 with IB
HCAs - Sect.\ref{sec:GPUDirect-v1} )
\item GPUDirect v2 = GPUDirect peer-to-peer (introduced for CUDA 4.0 with IB
HCAs on UVA feature of Fermi-class GPU - Sect.\ref{sec:GPUDirect-v2} )
for transfer of data between two CUDA GPUs on the same PCIE fabric only. It
does not enable interoperability with any other kind of device.
\item GPUDirect v3 = GPUDirect RDMA (introduced for Kepler-class GPU -
Sect.\ref{sec:GPUDirect-v3}:
between a GPU with any third-party device, with constraint: two device must be on the same PCIExpress root
\item
\end{enumerate}
\section{Multi-GPU CUDA 3.2 and earlier }
\label{sec:multi-GPU_3.2}
\subsection{CUDA runtime API 3.2 or earlier {\it per se}}
\label{sec:one-a-time_CUDA3.2-runtime}
Impossible.
\subsection{CUDA driver API 3.2 or earlier}
\label{sec:one-a-time_CUDA3.2-driver}
Even though we can only use one GPU at a time, we can return to using the device
after switching to another one via push/pop context function
\verb!cuCtxPushCurrent/ctxCtxPopCurrent! (Sect.\ref{sec:GPUs_driverAPI4.0}). By
doing this, indeed, a single GPU can be
used at a time, though during the runtime, the host thread can switch from one
GPU to another.
\subsection{CUDA 3.2 runtime with MPI}
\label{sec:GPUs_cuda32.MPI}
One CUDA context is created per CPU thread; and one CPU thread can use
only one CUDA context. Different CPU thread cannot share CUDA context;
which is a huge limitation.
So, to use multiple GPUs, we need one CPU thread per device. At a
result, one thread cannot do any thing with GPU data from other CPU
thread. So, the only way is to create a ``proxy pattern'' that allow
one thread to tell another thread to get the data back to host memory
for the thread to use.
To prevent data races, we need to use barriers/metaphores in MPI. So,
the common model is to have a main host thread to do I/O data to
permanent data storage (harddrive), and control other slave host
threads. The slave host threads do (1) copy data to/back device, (2)
launch kernel. The data is split into as many chunk as many GPU devices
\begin{verbatim}
tokens{
thread_t threadID
int id;
}
\end{verbatim}
contains the thread ID and the GPU id that the thread use. Example:
\verb!vecadd()! is the slave host thread.
So, the main limitation is that each host thread can only access to
the memory objects that it allocates.
Semaphore: when a data already copy to the device, the thread that
create that memory address, can enable the semaphore, telling other
thread can get access to the data; but still need this thread to do
the copy back. Page-locked memory buffers may be used to store the
data copied back, and can be very big.
QUESTION: How we know the limitation of page-locked memory in a
machine? (TUAN)
% \subsection{MPI}
% \label{sec:mpi}
\verb!data_server()! is the node to do the data allocation,
In the data stencil example, each node use a subvolume of the
data. Suppose, the whole z-dimension is not splitted; only along the
xy-plane. There is overlapped between the subvolume, i.e. 2 slices on
the left and 2 on the right, so totally 4 increment of data it really
process.
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/MPI_GPU.eps}}
\caption{MPI and CUDA}
\label{fig:MPI_GPU}
\end{figure}
To increase the performance, Compute the boundary that the next thread
need to use first; and during the time it switch, the kernel will
compute the other part of the subvolume.
By default, data allocated on host is for synchronous data
transfer. So, we need to use \verb!cudaMallocHost()! to allocate host
memory for ghost data and can be used for asynchronous data transfer.
Here, using Infiniband, the data from GPU is internal copy (not seen
by user) between CUDA buffers and Infiniband buffers.
\subsection{XMP}
\label{sec:GPUs_XMP}
\begin{verbatim}
#pragma xmp nodes p(*) // node declaration
#pragma xmp nodes gpu g(*) // GPU node declaration
...
#pragma xmp distribute AP() onto p(*) // data distribution
#pragma xmp distribute AG() onto g(*)
#pragma xmp align G[i] with AG[i] // data alignment
#pragma amp align P[i] with AP[i]
int main(void) {
...
#pragma xmp gmove // data movement by gmove (CPU=>GPU)
AG[:] = AP[:];
#pragma xmp loop on AG(i)
for(i=0; ...) // computation on GPU (passed to CUDA compiler)
AG[i] = ...
#pragma xmp gmove // data movement by gmove (GPU=>CPU)
AP[:] = AG[:];
\end{verbatim}
References:
\begin{itemize}
\item P2S2-2010 Panel Is Hybrid Programming a Bad Idea Whose Time Has
Come? - Taisuke Boku. 2010
\end{itemize}
\subsection{CUDA 3.2 runtime with pthreads}
\label{sec:GPUs_pthreads}
The trick using multiple GPUs with CUDA runtime APIs is that each GPU needs to
be controlled by a single CPU thread. First we create a structure
\verb!DataStruct! that contains the information about the device we want to use
as well as the data it will process, and pass it to an entry function that
create a new CPU thread \verb!start_thread()!. Typically, we have 2 GPUs on a single machine, so we create an array of 2
elements.
\begin{lstlisting}
DataStruct data[2];
data[0].deviceID = 0;
data[0].size = N/2;
data[0].a = a; //point to the starting data address
data[0].b = b;
data[1].deviceID = 0;
data[1].size = N/2;
data[1].a = a+N/2; // point to the starting data address
data[1].b = b+N/2;
\end{lstlisting}
The function \verb!start_thread(func_name, &(data[i]))! create a new thread
which then calls to the specified function \verb!func_name! and pass the
DataStruct information to it.
\begin{lstlisting}
CUTThread thread = start_thread ( func_name, &( data[0]));
func_name ( & (data[1])); // we don't have to create a new thread for this
\end{lstlisting}
In the main program, if it need to get the data from the newly created thread,
we call
\begin{lstlisting}
end_thread(thread);
/// free data
free(a);
free(b);
\end{lstlisting}
This is how we define the routine \verb!func_name! to do what we need
\begin{lstlisting}
void* func_name ( void *pvoidData) {
DataStruct *data = (DataStruct*) pvoidData;
// each thread call a device, using a different ID
HANDLE_ERROR( cudaSetDevice ( data->deviceID));
// the body of the function is the same as the original one,
// except the starting point of data need to be different
a = data->a;
b = data->b;
..... // the next step should be the same
\end{lstlisting}
There are many applications where you need to share the data between GPUs, e.g.
stencil computation when each GPU does the calculation for a subset of data,
there should be a mechanism to exchange data. Also, if your simulation iterate
hundreds of thousands of loops, you don't want to create and destroy a CPU
thread each time, but keeping it during the whole simulation. This requires more
interference from programmer side.
\subsection{OpenMPI}
\label{sec:GPUs_openMPI}
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/OpenMP_GPU.eps}}
\caption{OpenMP and CUDA}
\label{fig:OpenMP_GPU}
\end{figure}
GPU communications used 'pinned' buffers for data movement. Optimizations such
as write-combining with overlapping GPU computation and data transfer can be
done to achieve faster performance.
\subsection{GMAC}
\label{sec:GPUs_gmac}
GMAC = Global Memory Access
\url{http://code.google.com/p/adsm/wiki/API}
Based on a unifired CPU/GPU virtual address space. A memory data can
be accessed by both the CPU or GPU using the same pointer.
\verb!gmacPtr()! only need for CUDA 3.x; not in CUDA 4.x. The reason
is that in CUDA 4.0, a CPU host thread can get access to GPU data
allocated by any other CPU host threads. However, it's important to
know that, a GPU kernel can only get access to the data allocated by
the host thread which it's bound to.
\subsection{GPUDirect 1.0 + Infiniband + MPI}
\label{sec:GPUs_GPUDirect1}
\textcolor{red}{Since Infiniband HCA ConnectX-2 (Sect.\ref{sec:Mellanox_cards}),
GPUDirect is supported.}
\textcolor{red}{GPUDirect 1.0} was first released with CUDA 3.1 (June 2010) and
is better with CUDA 4.0. It is designed for application that communciate over a
network or on the same machine using Mellanox ConnectX card with MPI programming
model. It's a joint effort between Nvidia and Mellanox. This allows
third-parties network/storage devices to direct access CUDA memory. This
eliminates unncessary sys mem copy \& CPU overhead, as data no need to go
through CPU DRAM before passing to Infiniband network. This improves upto 30\%
in communication performance.
In CUDA 3.1 and 3.2, GPU and Infiniband communicate by using a shared 'pinned'
buffers (or page-locked buffer) for efficient RDMA transactions. One kernel copy
the data to the 'pinned' buffers, and the other kernel using a different GPU can
access this data thanks to zero-copy data transfer, i.e. the kernel can access
CPU 'pinned' data without copying to GPU global memory.
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/GPUDirect_1.0.eps}}
\caption{Without GPUDirect 1.0, two copies is required for inter-node
communications. With GPUDirect 1.0, only one copy from GPU to 'pinned' memory,
and then Infiniband can access from that}
\label{fig:GPUDirect1}
\end{figure}
Infiniband vendors first support GPUDirect are Mellanox and QLogic.
\textcolor{red}{The first requirement to use GPUDirect is that we need to use
pinned-memory.} Without GPUDirect support, first the data is copied from GPU
memory to CPU pinned-memory sysmem1 which cannot be accessed from Infiniband
devices. Instead, CPU need to copy sysmem1 to a new location sysmem2, where
Infiniband can get access to. With GPUDirect support, Infiniband can get access
to sysmem1 directly usinng remote direct memory access RDMA; so we can avoid one
unnecessary memory copy.
% Without GPUDirect, what the Infiniband driver doing is to allocate
% some buffers, actually 2: one is being used by the Infiniband...
With GPU Direct enabled, Infiniband can do the copy data directly from 'pinned'
buffer to the Infiniband. So, it removes a number of intermediate data copy
steps. However, notice that there is noway for the Infiniband to retrieve the
data directly from GPU, so you cannot pass a pointer to the MPI function.
Instead, you need to copy it back to the host memory fist.
Peer-to-peer GPU data copy without passing to the host memory require
both GPUs to use the same PCI address bus on the same
machine. Otherwise, this cannot happen.
\begin{enumerate}
\item Fermi-based GPUs
\item IB software: OFED 1.5.x with GPUDirect support (Sect.\ref{sec:OFED})
\item MPI: MVAPICH 1.5+
\item NVIDIA driver 256.35 and above
\item NVIDIA driver 3.2+ (including OpenCL driver)
\end{enumerate}
\section{Multiple CUDA 4.0 }
\label{sec:multi-GPU_CUDA4.0}
CUDA 4.0 make it easire to work with multi-GPUs on (1) a single node, (2)
across the network
\begin{enumerate}
\item within node: use runtime (Sect.\ref{sec:GPUs_cuda4_singlethread}) and
driver API (Sect.\ref{sec:GPUs_driverAPI4.0})
\item GPUDirect 2.0
(Sect.\ref{sec:GPUs_GPUDirect2})
\end{enumerate}
\begin{figure}[hbt]
\centerline{\includegraphics[height=5cm,
angle=0]{./images/multiple_GPU_CPU.eps}}
\caption{CPU/GPU coordination data management}
\label{fig:GPU_CPU}
\end{figure}
\subsection{CUDA runtime 4.x (single host thread)}
\label{sec:GPUs_cuda4_singlethread}
In CUDA 3.2 and earlier, you can call cudaSetDevice() only once. CUDA 4.0 now
relaxed this restrion, allowing to switch from one device to another by calling
it multiple times thanks to UVA (Sect.\ref{sec:UVA}). In CUDA model, there is
only one CUDA context per device and one CUDA stream per CPU thread and device.
Once a CUDA context is created, we cannot change the GPU. However, with UVA, the
single CUDA context can be used for both GPU1 and GPU 2.
Using UVA in Fermi-based GPU, CUDA 4.x allow one CPU host thread to switch the
CUDA stream and CUDA context. NOTE: it works for multiple GPUs on a single