University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2
Hannah Bollar: LinkedIn, Website
Tested on: Windows 10 Pro, i7-6700HQ @ 2.60GHz 15.9GB, GTX 980M (Personal)
Introduction - Prefix Sum - Compaction - CPU Implementation - GPU Implementation - Additional Features- Runtime Analysis - Responses
Compaction is the removal of redundant or unnecessary elements from inputted information to speed up runtime and/or shorten the amount of memory being handled.
Throughout this assignment I implement a few different versions of prefix-sum scan to form a better understanding of optimizing its implementation and then use the work-efficient version for the actual compaction method. Additionally, there is a comparison of the CPU and GPU different implementations.
The idea of prefix-sum is that as we hit each index in the array, we are summing up each of the elements prior to that index. For this project, I use exclusive scan, meaning when we sum up all elements up to an index, we store the value directly in that index. For example, let A = {1, 5, 0, 1, 2, 0, 3}
be our input array and B
be our output array. On the 0th
index, we have no sum yet, so B[0] = 0
. On the 1st
index, we have a sum that includes the 0th
element, so our sum is 1
. That is B[i] = A[i-1] + ... + A[0]
. So for our example we end up with B = {0, 1, 6, 6, 7, 9, 9}
and can ignore the value 3 entirely.
You map A
's elements to a boolean representation for whether or not the value at the index is zero. You run prefix sum
on it and then scatter
. That is, given an index in the original array, 1) only keep its value in the final array if the value in the binary mapping is 1
and if you're keeping it 2) place it at the index specified in the scatter array.
There is one basic idea for how to implement this on the CPU - to simply iterate through a for loop over all the elements in A
to make B
using prefix-sum
. Not the most optimal, but has little overhead in regards to method preparation, memory management, etc O(n)
.
Just like prefix-sum, stream compaction on the CPU is also relatively a few lines of code. You map A
's elements to a boolean representation for whether or not the value at the index is zero. You run prefix sum
on it and then scatter
, which is the idea of using one array as a set of indices and one array as the original values and swapping the original values to wherever their appropriate indices are supposed to be in the output array.
Since we're threading the computations here, we'd like a more optimized approach than iterating through every single element (ie we dont want it to still be a O(n)
runtime). The code has two different implementations for this.
When running in parrallel, each thread must have its value added to every element to the right of itself over each iteration. Since each index will need to perform at max n
adds for this case, and our number of elements being summed only decreases by a power of two at each iteration, we have a runtime of O(nlogn)
, which is worse than the CPU runtime of just n
. Additionally, this algorithm has to use two buffers instead of just one, because the threads arent guaranteed to be on the same iteration as any of the others, so there will be cases where threads will be reading and writing from and to the same locations (aka race-conditions).
This implementation has two phases, the upsweep
and the downsweep
.
[upsweep
] - Instead of making all elements sum for every other index to the right in the array, this implementation cuts down all that extraneous summing. For each iteration, the values being summed are at an offset
from one another. For each iteration, this offset increases by a power of 2, so one pair is mapping to the larger index in the pairing, which is the same thing as the next iteration's offset location. Having the values summed through based on whether or not their index value will map to the this location, shortens the number of calculations required at each iteration by powers of two. Additionally, this is just O(n)
total add calculations.
[downsweep
] - First set the last element from the output of upsweep
to 0. Then, working off of the rest of the output from upsweep
, 1) replace the value at the upper index
in an offset pair with the sum of the upper index
value and the lower index
value 2) place the original upper index
value at the lower index. What this does is that it's almost an unwrapping of the binary summing done in the upsweep
while still doing the remaining summing as it traverses downwards passing the values along. This is just O(n)
adds and O(n)
swaps.
Together they properly bring all the elements to the final scan output that we're looking for, plus overall the upsweep
and downsweep
sequences combined is still just O(n)
- the same as our CPU implementation.
This follows the same general idea of stream compaction, just using a parrallel implementation for the boolean and scatter arrays and the work-efficient
version of scan explained in an earlier section.
I also implemented radix sorting using my work-efficient
scan implementation. Currently it only works for arrays of lengths that are a power of 2. The non power of two handling of the feature will be fixed (tbd).
It is called on an input array and at the end of this readme in the outputs section, you can see an example input and output array printed above the confirmation that it works.
This has to do with indexing and thread management. Currently in my implementation for my GPU work-efficient
scan, my indexing follows from one iteration to another. That is for upsweep
and downsweep
, my thread indexing will always be the same as the offset indexing. A possible future optimization for this is to change my indexing for the threads so that as I traverse upwards and downwards for upsweep
and downsweep
respectively and start skipping over more and more memory, I make sure to have the output of each of these traversals be s.t. all the values remain in contiguous memory. This way, as the compiler schedules the warps, we start being able to drop off, not just threads, but entire warps and even later on blocks from our runtime calculations due to CUDA optimizations.
The below visual of the memory difference will more clearly depict the memory usage. For this example, let's say warp size is 2. After the first iteration, the compiler only has 2 warps left and is down from 8 to 4 threads, instead of in the unoptimized case where the incontiguous memory keeps all 4 blocks in play is what's actually creating the blockage.
Before we get into a full runtime analysis explanation, let's pick out the appropriate blocksizes for each algorithm so that we're comparing the runtime of the most configuration of each technique. Note
: this is on an array size of 1<<8 = 256
.
Based on the graph, I chose to have the most optimal configuration for the non power of two implementations since an algorithm is more likely to hit a non-power of two input than a power of two input, especially as the lengths of the input arrays increases. Therefore, I have naive
with a blockSize
of 32
and work-efficient
with 1024
.
Now that we have the proper block sizes, let's view a comparison of all the implementations on varying array lengths.
See next section for observations.
This is mostly just a continuation of Runtime Analysis for this assignment.
Write a brief explanation of the phenomena you see in the comparison between the different implementations. Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
cpu
- since there's no memory overhead and manipulation, it definitely relies mainly on element count O(n)
.
gpu-naive
- the issue with this implementation as explained in the gpu section makes the runtime O(nlogn)
even with optimized indexing, so it will always perform worse than the cpu implementation
gpu-work-efficient
- this is threaded so that it will run at about the same runtime as the cpu implementation O(n)
; however, it will actually run slower than the cpu
implementation, especially for larger array sizes. This is because it does not have the indexing optimizations I mentioned earlier. Thus the memory overhead of working with heavy input and output arrays and proper memory arrangements etc ends up adding to the overall runtime.
thrust
- this is the fastest implementation, yet in debug mode it takes oddly a lot of time
gpu is slower than cpu for all my implementations (which is understandable given my implementation - see Additional Features for explanation)
Additionally for all, the non power of two implementations (NPow2
) run worse than their power of two counter parts, just because of thread optimization since our warps are of size 32 and our algorithms (though they account for non power of two inputs) are more streamlined for 2^x
inputs since they divide more easily.
Note
: added my own tests for radix sorting. This is the output with the optimized block sizes at 1<<8
array elements.
GPU Gems 3 - formula understanding and parts of some images
CIS 565 slide deck - parts of some of the thread images