This repository contains CPU+GPU benchmark implementations of the kernel-independent fast multipole method (KIFMM), which was used most recently in the following papers:
-
Jee Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard Vuduc. "A CPU-GPU hybrid implementation and model-driven scheduling of the fast multipole method." In Proceedings of the 7th Workshop on General-Purpose Processing using GPUs (GPGPU-7), Salt Lake City, UT, USA, March 2014. doi:10.1145/2576779.2576787
-
Aparna Chandramowlishwaran, Jee Choi, Kamesh Madduri, and Richard Vuduc. "Brief announcement: Towards a communication optimal fast multipole method and its implications at exascale." In Proceedings of the 24th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'12), Pittsburgh, PA, USA, June 25-27, 2012. doi:10.1145/2312005.2312039.
See Other notes, below, for more information.
Feel free to post issues or pull requests on GitHub: https://github.com/jeewhanchoi/kifmm--hybrid--double-only
From the root directory, run:
make clean; make
The build will generate several instances of the benchmark as separate executables (see Description below).
fmmd--{naive,omp,omp_sse,omp_sse_block,cuda,hybrid1,hybrid2,hybrid3} \
<# pts> {uniform,ellipseUniformAngles} <# pts-per-box>
This benchmark includes CPU, GPU, and CPU+GPU hybrid implementations for the fast multipole method (FMM) in double-precision.
The benchmark generates the follwing executables,
fmmd--naive
: Baseline sequential codefmmd--omp
: OpenMP parallelized codefmmd--omp_sse
: OpenMP parallelized + SIMD vectorized codefmmd--omp_sse_block
: OpenMP parallelized + SIMD vectorized + Blocking (translation vector) + Blocking (up)fmmd--cuda
: CUDA codefmmd--hybrid1
: U-list on GPU; up, V-list, down on CPUfmmd--hybrid2
: U-list on CPU; up, V-list, down on GPUfmmd--hybrid3
: A optimal schedule for non-uniform distributions on hybrid CPU-GPU systems
- NUMA-aware memory allocation can be set/unset using the environment
variable,
NUMA
. Default =yes
.
Note: When doing NUMA-aware memory allocation, threads must be pinned appropriately by, for instance, using the appropriate environment variable, e.g.,
KMP_AFFINITY
(icc) orGOMP_ AFFINITY
(gcc). Example:export KMP_AFFINITY=granularity=fine,compact,1,0,verbose # Without hyperthreading; or: export KMP_AFFINITY=granularity=fine,compact,verbose # with hyperthreading
-
Number of threads can be varied by changing the environment variable,
OMP_NUM_THREADS
. Default = Max # of threads. -
When blocking is enabled, translation block size can be set by the environment variable,
BS
. Default = 49. -
When blocking is enabled, up block size can be set by the environment variable,
UPBS
. Default = 1000. -
Accuracy can be varied by changing the environment variable,
NP
. Default = 6.
On the GPU, only 3 precision are supported (
NP=3
,NP=4
, andNP=6
). You must also set#define NP_(X)
incuda.cu
to 1 (all others should be set to 0) and re-compiled in order for the GPU version to work. Lastly,env NP=<X>
must also be set as the GPU code uses CPU's tree construction code.
- The error is computed by taking a random sample. This paramter can
be varied by changing the environment variable,
NV
. Default = 1000.
# with NP_6 set to 1, GPU FMM with uniform
# distribution and 6 digits of precision:
./fmmd--cuda 4194304 uniform 512
# with NP_4 set to 1, GPU FMM with uniform
# distribution and 4 digits of precision:
env NP=4 ./fmmd--cuda 1048576 uniform 512
# Most optimized CPU FMM with 3 digits of
# precision:
env NP=3 ./fmmd--omp_sse_block 1048576 uniform 128
-
By default, for the GPU version, all data structures are allocated in the memory prior to execution. However, due to CUFFT taking too much memory, for large number of points and/or higher precision, there is support to allocate data as they are needed. This can be automatically turned on and off setting the
MIN_DATA
definition found inpartial.h
to1
or0
. -
Our code derives its main algorithmic ideas from KIFMM3d, which was developed by others: http://cs.nyu.edu/~harper/kifmm3d/documentation/index.html . The focus of our implementation is single-node performance optimization.
- Jee Whan Choi jee@gatech.edu
- Aparna Chandramowlishwaran amowli@uci.edu
- Kamesh Madduri madduri@cse.psu.edu
- Richard Vuduc richie@cc.gatech.edu