-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorial
DynaSOAr is a CUDA framework for a class of object-oriented programs that we call Single-Method Multiple-Objects (SMMO). In SMMO, parallelism is expressed by running the same method on all objects of a type. DynaSOAr optimizes the performance of SMMO applications by storing them in a Structure of Arrays (SOA) data layout.
DynaSOAr consists of three parts:
- A dynamic memory allocator that provides
new
/delete
-style functions for object allocation/deallocation in device code. - A data layout DSL that allows programmers to use OOP abstractions in CUDA while data is stored in an SOA layout under the hood. The DSL is based on Ikra-Cpp.
- A parallel do-all operation that assigns objects to GPU threads in such a way that most field accesses have good coalescing. This is the main source of speedup of DynaSOAr.
DynaSOAr is a collection of C++/CUDA header files. Programmers must include the main header file dynasoar.h
and use the DynaSOAr API for defining classes etc.
DynaSOAr has a few software and hardware requirements:
- NVIDIA GPU with compute capability of at least 5.0.
- CUDA Toolkit 9.1 or greater. (DynaSOAr may work with older versions, but we cannot confirm that.)
- For the benchmarks with graphical output: SDL 2 (Ubuntu APT packages:
libsdl2-2.0-0
,libsdl2-dev
) - Linux operating system. (Tested on Ubuntu 16.04.4 LTS.)
We tested DynaSOAr on two different GPUs: NVIDIA Titan Xp and NVIDIA GeForce 940MX.
Clone the git repository and initialize all submodules.
$ git clone https://github.com/prg-titech/dynasoar.git
$ cd dynasoar
$ git submodule init
$ git submodule update
Make sure that you can build and run the API example application.
$ build_scripts/sanity_check.sh
Built API example.
Check passed!
Ran API example. Done.
If this fails, you may be missing some prerequisites. Make sure that the path to nvcc
is set correctly in build_scripts/nvcc.sh
.
We provide 10 benchmark applications that illustrate how DynaSOAr is used. Some of these applications have visualizations, which make it easier to analyze the effect of application parameters. In the following, we are building and running wator
, a simple fish-and-sharks simulation in which fish and shark objects are dynamically allocated and destroyed.
All benchmark build scripts are located in the build_scripts
directory and must be invoked from the DynaSOAr root directory. Before running the scripts, make sure that nvcc
is available or set the correct path in build_scripts/nvcc.sh
.
We analyze the wator
benchmark in this section. The source code of this benchmark is located in the example/wa-tor
directory. This benchmark has three/four classes: Agent
, Fish
, Shark
(, Cell
).
We compile wator
as follows:
$ build_scripts/build_wa_tor.sh
Certain parameters such as the benchmark size or the allocator can be modified. (Check -?
option.) We first take a look at the visualization to get a first impression of the benchmark.
$ build_scripts/build_wa_tor.sh -r -x 500 -y 500 -m 262144
$ build_scripts/wator_dynasoar_no_cell
We provide four variant of wator
:
-
wator_dynasoar
: Runs with the selected allocator (DynaSOAr unless overridden.) -
wator_dynasoar_no_cell
: Same as above, butCell
objects are statically allocated since the cell structure does not change in the benchmark. This variant has only 3 classes instead of 4. We used this variant for our benchmarks. -
wator_baseline_aos
: AOS baseline without any dynamic allocation. -
wator_baseline_soa
: SOA baseline without any static allocation.
Now let us try to understand how dynamic allocation is used in this benchmark. Take a look at the header file example/wa-tor/dynasoar_no_cell/wator.h
which defines the three classes Agent
, Fish
and Shark
. When we build the benchmark in debug mode, we can see how DynaSOAr allocates objects of these types.
$ build_scripts/build_wa_tor.sh -d -x 500 -y 500
$ bin/wator_dynasoar_no_cell
First, we get some information about our platform. We are running the benchmark on a GeForce 940MX at the moment.
Current Device Number: 0
Device name: GeForce 940MX
Memory Clock Rate (KHz): 1001000
Memory Bus Width (bits): 64
Peak Memory Bandwidth (GB/s): 16.016000
Total global memory: 2100.232192 MB
Available (free) global memory: 923.860992 MB
What follows is an overview of the memory usage of DynaSOAr.
┌───────────────────────────────────────────────────────────────────────┐
│ Smallest block type: 4Fish │
│ Max. #objects: 262144 │
│ Block size: 3904 bytes │
│ #Blocks: 4096 │
│ #Bitmap levels: 2 │
│ Data buffer size: 000015.250000 MB │
│ Allocator overead: 000000.114670 MB + block overhead │
│ Total memory usage: 000015.364670 MB │
└───────────────────────────────────────────────────────────────────────┘
We can see from here that:
- The smallest class in our system is
Fish
. This is important because block capacities are determined by the size of the smallest type. - We configured the allocator such that it can allocate up to 262144 objects of the smallest type
Fish
(-m
command line argument). This number must a multiple of 64. - The size of a block is 3904 bytes (see calculation later...).
- The heap consists of 262144 / 64 = 4096 blocks.
- The hierarchical bitmaps of this allocator (e.g., free block bitmap, allocated block bitmaps) have ceil(log_64(4096)) = 2 levels.
- The size of all blocks is 3904 B * 4096 = 15.25 MiB.
- The size of all block bitmaps is 0.115 MiB.
- The total memory usage of the allocator (including all of its allocations) is 15.4 MiB.
Note that the capacity/heap size of the allocator can currently not be specified in bytes. Instead, we specify the heap size via the number of objects of the smallest type in the system. We currently have to launch the program in debug mode to see how many bytes that equates to. This is just a matter of the API and will change in future versions of DynaSOAr.
Now let us take a look at the classes in our application, starting with class Agent
. This class is the first class that was passed as a template argument to SoaAllocator
in the header file, which is why it appears first and has a type ID of 0
.
┌───────────────────────────────────────────────────────────────────────┐
│ Block stats for 5Agent (type ID 0) │
├────────────────────┬──────────────────────────────────────────────────┤
│ #fields │ 3 │
│ #objects / block │ 64 │
│ block size │ 3648 bytes │
│ base class │ v │
│ is abstract │ 1 │
│ data seg. [64] sz │ 3584 bytes │
│ (unpadded) │ 3584 bytes │
│ (simple sz) │ 3584 bytes │
│ (padding waste) │ 0 bytes │
│ data seg. [ 1] sz │ 64 bytes │
│ (unpadded) │ 56 bytes │
│ data seg. [64] sz │ 3584 bytes │
│ (unpadded) │ 3584 bytes │
├────────────────────┴──────────────────────────────────────────────────┤
│ Fields │
├───────┬─────────────────┬───────────────────────┬──────────┬──────────┤
│ Index │ Def. Class │ Type │ Size │ Offset │
├───────┼─────────────────┼───────────────────────┼──────────┼──────────┤
│ 2 │ 5Agent │ i │ 4 │ 52 │
│ 1 │ 5Agent │ i │ 4 │ 48 │
│ 0 │ 5Agent │ 17curandStateXORWOW │ 48 │ 0 │
├───────┼─────────────────┼───────────────────────┼──────────┼──────────┤
│ Σ │ │ │ 56 │ │
└───────┴─────────────────┴───────────────────────┴──────────┴──────────┘
From here we can see that:
- The class has 3 field: Two fields of type
int
(i
) and one field of typecurandStateXORWOW
(defined ascurandState_t
in the source code, which is provided by the cuRAND library). Note that DynaSOAr does not know anything about the typecurandState_t
. It is probably a struct with multiple fields, but we do not change its layout. - The class is abstract. (See
static const bool kIsAbstract = true;
in the header file.) - The block capacity (
#objects / block
) has no meaning in abstract classes. - The class does not inherit from another class. (Base class
v
=void
.) - The size of the data segment is also not important for abstract classes.
The next two classes are non-abstract classes. Let us take a look at class Shark
.
┌───────────────────────────────────────────────────────────────────────┐
│ Block stats for 5Shark (type ID 2) │
├────────────────────┬──────────────────────────────────────────────────┤
│ #fields │ 5 │
│ #objects / block │ 60 │
│ block size │ 3904 bytes │
│ base class │ 5Agent │
│ is abstract │ 0 │
│ data seg. [60] sz │ 3840 bytes │
│ (unpadded) │ 3840 bytes │
│ (simple sz) │ 3840 bytes │
│ (padding waste) │ 0 bytes │
│ data seg. [ 1] sz │ 64 bytes │
│ (unpadded) │ 64 bytes │
│ data seg. [64] sz │ 4096 bytes │
│ (unpadded) │ 4096 bytes │
├────────────────────┴──────────────────────────────────────────────────┤
│ Fields │
├───────┬─────────────────┬───────────────────────┬──────────┬──────────┤
│ Index │ Def. Class │ Type │ Size │ Offset │
├───────┼─────────────────┼───────────────────────┼──────────┼──────────┤
│ 1 │ 5Shark │ j │ 4 │ 60 │
│ 0 │ 5Shark │ j │ 4 │ 56 │
│ 2 │ 5Agent │ i │ 4 │ 52 │
│ 1 │ 5Agent │ i │ 4 │ 48 │
│ 0 │ 5Agent │ 17curandStateXORWOW │ 48 │ 0 │
├───────┼─────────────────┼───────────────────────┼──────────┼──────────┤
│ Σ │ │ │ 64 │ │
└───────┴─────────────────┴───────────────────────┴──────────┴──────────┘
This class inherits from Agent
, so those classes are repeated in here. The capacity of blocks of type Shark
is 60
. (The capacity of the smallest type in the system, Fish
in this example, is always 64
.) The size of the data segment of the block is 3840 B, which is the sum of all SOA arrays (size 60). Sometimes, the data segment must be larger due to padding of SOA arrays.
Finally, we can see some statistics about the state of the allocator. The benchmark code prints statistics after every iteration when built in debug mode. Note that this slows down the allocator significantly, so do not use debug mode for measuring performance.
┌────┬──────────┬──────────┬──────────┬┬──────────┬──────────┬──────────┐
│ Ty │ #B_alloc │ #B_leq50 │ #B_activ ││ #O_alloc │ #O_used │ O_frag │
├────┼──────────┼──────────┼──────────┼┼──────────┼──────────┼──────────┤
│ fr │ 1320 │ n/a │ n/a ││ n/a │ n/a │ n/a │
│ 0 │ 0 │ 0 │ 0 ││ 0 │ 0 │ 0.000000 │
│ 1 │ 2224 │ 0 │ 2141 ││ 142336 │ 133631 │ 0.061158 │
│ 2 │ 552 │ 0 │ 411 ││ 33120 │ 12478 │ 0.623249 │
│ Σ │ 2776 │ 0 │ 2552 ││ 175456 │ 146109 │ 0.167261 │
└────┴──────────┴──────────┴──────────┴┴──────────┴──────────┴──────────┘
We can read these statistics as follows.
- The
fr
line shows the number of blocks that arefree
(not allocated). There are 1320 free blocks. - There is one line for every type. The number in the first column is the type ID (
0
=Agent
,1
=Fish
,2
=Shark
). -
B_alloc
is the number of allocated blocks,B_active
is the number of active blocks,O_alloc
is the number of allocated object slots (all object slots of allocated blocks) andO_used
is the number of object slots that actually contain an object. We can calculate the fragmentation from the last two values as defined in the paper:(O_alloc - O_used) / O_alloc
. - There are no objects of abstract classes, so all values for class
Agent
are0
.
If we watch these statistics for a while, we see that the fragmentation rate is pretty high around iterations 60-80. Sometimes over 80%. We can see this fragmentation spike in the Wa-Tor figures in the paper.
Now let us compare the performance of DynaSOAr and mallocMC.
$ build_scripts/build_wa_tor.sh -x 300 -y 300 -a dynasoar -m 262144
$ bin/wator_dynasoar_no_cell
634949, 159514
$ build_scripts/build_wa_tor.sh -x 300 -y 300 -a mallocmc -m 262144 -s 536870912
$ bin/wator_dynasoar_no_cell
2123027, 1344489
Note that for custom allocators (other than dyansoar
), the heap size in bytes can be specified in addition to the maximum number of objects with the -s
parameter. Custom allocators may then use that amount of memory. The -m
parameter determines the size of the auxiliary data structures that implement parallel enumeration of custom allocators.
We ran the benchmark with a very small problem size. If no problem size is specified, the default problem size (also used in the paper) is used. The first number is the overall benchmark running time. The second number is the time spent on parallel enumeration. We should substract this number from the first one to allow for a fair comparison (see paper).
With DynaSOAr, the benchmark finishes in 475435 microseconds (0.475s). With mallocMC, the benchmark runs for 778538 microseconds (0.779s). We could repeat this experiment with other allocators (bitmap
, cuda
, halloc
).
If we increase the problem size, it may also be necessary to increase the heap size -s
(and/or the max. number of objects -m
) to ensure that the allocator does not run out of memory. If DynaSOAr runs out of memory, it currently goes into an infinite loop (see algorithm in the paper). If the other allocators run out of memory, the program usually crashes.
DynaSOAr was designed for SMMO applicatons and we claim that many applications can be written in SMMO style. In the following, we give an overview of the SMMO applications that we implemented. These applications are located in the example
directory and we recommend following the source code when reading through this section.
Sugarscape is an agent-based social simulation. There are variety of variants that can simulate different aspects of a population. To keep the benchmark simple, we are focusing onthe movement of agents, ageing and mating.
The simulation consists of a 2D grid of cells. Every cell can store a certain amount of "sugar", which is a resource that agents need to survive. Every cell has a certain sugar grow rate. There are male and female agents, which can move to neighboring cells. Agent consume sugar and can mate when they are close enough, spawning a new agent. Every agent has properties such as a certain range of vision, metabolism rate, etc. These properties are passed down to offspring.
Every iteration in this simulation consists of 12 do-all operations. We do not describe every operation in detail here, but there are operations for growing sugar, diffusing sugar to neighboring cells, ageing and consuming sugar, deciding onto which cell to move (similar logic to Wa-Tor, see below) and mating.
$ build_scripts/build_traffic.sh -r -x 8 -m 16384
$ bin/traffic_dynasoar
This is an agent-based traffic flow simulation that implements the Nagel-Schreckenberg model. In this model, streets are divided into cells (class Cell
). Every cell can hold at most one car (class Car
). Every car has a velocity in "number of cells per iteration".
Every iteration in this simulation consists of 3 parallel do-all operations (step()
in dynasoar/traffic.cu
). This simulation also models traffic lights, but they are currently not implemented with DynaSOAr yet.
-
ProducerCell::create_car()
: Certain cells that are classified as "sources", produces a new car with a certain probability if the cell is empty. -
Car::step_prepare_path()
: First, the velocity of the car is increased by 1. Then, the car analyzes the nextv
cells wherev
is the velocity of the car. If one of these cells is currently occupied, the velocity is reduced, such that no occupied cell is entered. -
Car::step_move()
: Every car moves to a cell of distancev
, as comouted by the previous step.
This simulation generates a random street network with -x
intersections. We also experimented with loading real OSM city maps; however, we would like to keep this benchmark as simple as possible.
$ build_scripts/build_wa_tor.sh -r -x 500 -y 500 -m 262144
$ bin/wator_dynasoar_no_cell
This is an agent-based predator-prey simulation. See screenshot above. The green dots represent fish (class Fish
) and the red dots represent sharks (class Shark
). Both are moving on a 2D grid of cells (class Cell
).
Every iteration in this simulation consists of 8 parallel do-all operations (see step()
in dynasoar/wator.cu
):
-
Cell::prepare()
: There is a boolean flagneighbor_request
for every neighbor of a cell. These flags are set tofalse
. -
Fish::prepare()
: Every fish scans its neighboring cells. If there is a free cell among them, it sets the correspondingneighbor_request
flag on the neighboring cell totrue
. -
Cell::decide()
: Every cells checks itsneighbor_request
flags to see which agents from neighboring cells want to enter this cell. It selects one of those cells at random. -
Fish::update()
: If a fish was selected, it moves onto the target cell. Otherwise, it stays where it is.
There are 4 more do-all operations for sharks, which are implemented similarly. The only difference is that sharks can enter cells that are occupied by fish.