GitHub - TimMenninger/HeldKarp: Code for CS 179 final project, which implemented the Held-Karp traveling salesperson algorithm on a GPU.

TimMenninger / HeldKarp Public
Notifications You must be signed in to change notification settings
Fork 0
Star 0
Code for CS 179 final project, which implemented the Held-Karp traveling salesperson algorithm on a GPU.
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
TestCases		TestCases
.DS_Store		.DS_Store
CS179FinalProjectProposal.docx		CS179FinalProjectProposal.docx
ComplexTest.txt		ComplexTest.txt
HeldKarp.cc		HeldKarp.cc
HeldKarp.cu		HeldKarp.cu
HeldKarp.cuh		HeldKarp.cuh
Makefile		Makefile
Plotter.py		Plotter.py
README		README
SimpleTest.txt		SimpleTest.txt
ta_utilities.cpp		ta_utilities.cpp
ta_utilities.hpp		ta_utilities.hpp
Repository files navigation

/**---------------------------------------------------------------------------+
|                                                                             |
| README                                                                      |
|                                                                             |
| Description:  This contains information about the HeldKarp cuda project for |
|               CS 179.  Included is its usage, operation details and analyses|
|               of what we did.                                               |
|                                                                             |
| Authors:      Tim Menninger                                                 |
|               Jared Reed                                                    |
|                                                                             |
+----------------------------------------------------------------------------*/


I. File List
------------
CS179FinalProjectProposal.docx      Initial Project Proposal
HeldKarp.cc                         HeldKarp Implementation
HeldKarp.cu                         HeldKarp GPU Kernels
HeldKarp.cuh                        HeldKarp Header File
Makefile                            Makefile for HeldKarp
README                              
ta_utilities.cpp                    TA files
ta_utilities.hpp                    TA files
SimpleTest.txt						Test points with 6 points
ComplexTest.txt						Test points with 20 points
TestCases                           Directory containing test cases

TestCases directory also contains images with the intital data points
as well as images with the outputted path


II. Usage
---------
After making, invoke the following command:
    ./HeldKarp < "input text file" > < threadsPerBlock > < maxBlocks >

Input data text files are stored in the TestCases directory containing:
    SimpleTest.txt    Simple Test containing 6 points
    ComplexTest.txt   Complex Test containing 20 data points
Note that this is very space complex and may not run the whole point set. 
To change how many points are used in the run, change the NUM_POINTS value in
HeldKarp.cuh.

An example runtime line might look like from the HeldKarp directory:
    ./HeldKarp SimpleTest.txt 512 512
    
After running, the path will be saved into out.txt, which you can then plot
by running
    python Plotter.py SimpleTest.txt out.txt

Note that this requires matplotlib.



III. Program Explanation
-------------------------
This program contains two implementions of the Held-Karp Algorithm for solving
the traveling salesman problem.  The first is a CPU implementation, while the
second is a much faster GPU implementation.  The traveling salesman problem
consists of finding the minimum route between a set of points, while ending at
the start point.  Although this seems a simple problem, no algorithm has been
conceived which reduces the time complexity by any large extent.  This is
because the TSP problem works in a fully connected graph, where every point
could be connected.  The brute force approach has a time complexity of O(n!),
but the Held-Karp algorithm utilizes a memoization array that breaks the problem
down into multiple subprobems, and eliminates redundant calculations.  This
implementation reduces the complexity to only O((n^2)(2^n)).  

In this algorithm, there is a memoization array.  This array consists of a row
for every subset of the set of points.  There are 2^n subsets, but because
the only useful subsets are those of cardinality 2 and greater, this ends
up being 2^(n - 1) - 1.  Within each row of the array, there is a value for
each member of the subset.  Thus, the value at index (i, j) would correspond to
the shortest distance through all points in subset i that ends with point j.
If j is not in subset i, then the cell is meaningless.  In addition to the
shortest distance, we also store the second-to-last point (because j is the
last point).  This allows us to iteratively recreate the path after filling
the array and examining the full-set row.

Part of creating this memoization array is knowing the distance between each
set of two points, so the first step is iterating through each n^2 / 2 pairs
and filling in the distance between them.  Then, we create our base case, which
involves filling our memoization array at each row that corresponds to a
subset of length 2.  Finally, we recursively create subsets and fill the
memoization array in order of increasing subset length.

After the base case, we use the following algorithm to fill the array.  For a
subset, s, we consider each point, k, as a possible "last point".  Thus, we can
remove k from s and use our memoization array to find the shortest path through
all points in s \ {k}.  Our array is structured in such a way that we know the
shortest path through s \ {k} ending at any point, m, in s \ {k}.  Then, for
each possible m, we take the distance in the memoization array for s \ {k}
ending at m, and add the distance from m to k.  This is the shortest distance
through our points in s ending with {m, k}.  If we take the smallest over all
possible m, we get the shortest path through all points in s ending at k.

Eventually, when we have filled the array, we can look at the index that
corresponds to the entire set.  We iterate through each value in that row to
find the endpoint, k, which had the smallest distance.  We now know the last
point is k.  This array cell also stores the second-to-last point, j.  This
knowledge will allow us to find the rest of the path in linear time.  Because
we know the last point is k, and the second to last point is j, we look at
the row corresponding to the full set, S, without k (S \ {k}) and look at
the j'th value in the row, which then stores the next previous value.
We follow this until the previous value is our starting point (or until the
size of the subset is 2), at which point we have the entire path.  The final
step is adding the distance from the last point to the source point, and
we get our final loop.

For more information, refer to:
https://en.wikipedia.org/wiki/Held%E2%80%93Karp_algorithm


IV. Design Decisions & Challenges
----------------------------------
One challenge we faced was being able to find a unique index into the array
given a set of numbers.  The final algorithm ended up as a result of a guess-
and-check system that, after countless hours, ended with the realization
that one could construct a balanced tree of subsets.  This tree would have
its root be the smallest possible useful subset: {0, 1}.  Then, each right
node would increment the largest value in the set by 1, and each left node
would add to the set the smallest number greater than the largest value
in the set.  This would happen until no larger value could be added to the
set.  This then allowed us to be able to find an index of a subset in linear
time (2^n subsets, log(2^n) levels in this tree).

Another difficult scenario we ran into was the program was not working for any
inputs larger than 27 points.  Originally, we thought we were incorrectly
reading files.  However, eventually we realized the issue was due to an
extremely large space complexity of O(2^n) because it stores all subsets in
the memoization array.  The program was failing because the computer did not
have enough space to store the array for such a large input.  It then turned
out that the GPU was even more space-limited, keeping us to only 12 point
maximums before running out of memory.

Figuring out the proper way to iterate over the subsets was also more difficult
than we expected.  It was very easy to generate all of the subsets, but the
issue was they were not sorted by size, which is what we needed, as we need to
run our function on all subsets size 1, then 2, and so on until we have the
final set.  This is necessary because each subset builds off of the ones
which are size 1 smaller than they are.  In the end, rather than sort the list
of subsets, we changed our function to take size of subset in as an argument.
Then, we looped from 1 to the max size, and called the function every time.
Also, for this approach we needed to change our function to only update the
memoization array if the current subset was the size of the size argument
passed in.  This let us properly iterate over the subsets by size.  We
considered only keeping memoization rows for the current size and one less
than the current size, but then we lost the data that allows us to construct
the full path in the end.  Thus, our only choice was to use the extremely
space-complex implementation.

Another challenge was attempting to run a recursive function on the GPU.
We had never run into this problem in the earlier labs in this class.  We
decidided the best way to handle this was to have the recursive function run
inside the CPU, and every call would call a single GPU kernel, rather than
trying to have the GPU kernel recursivelly call itself.

Inside this GPU kernel, we needed to find the best way to parallelize the
memoization process.  Originally, it seemed easy; we would designate each
thread to a subset and fill quickly.  However, there were several issues with
that.  First, we found no good way of going from thread index to set values.
Second, the order of the memoization array is not in size-order, so we would
eventually deadlock waiting for other rows to fill.  Thus, we parallelized
by enumerating two nested loops and assigning a thread to each pair of
two values.  We could do this because each iteration in the loop doesn't
depend on another iteration, but parts of the memoization array that were
filled in previously.  Perhaps, we could have


V. Expected Results
--------------------
We expect both the CPU and GPU implementations to return the correct shortest
path from the TSP problem. However, we expect the GPU implementation to be much
faster since we are using multiple threads to update the memoization array.
Unfortunately, there was some amount of overhead involved in creating and
copying as much memory as we were.  Thus, on small cases, the CPU works faster
than the GPU.  As we increase the size of the input, the GPU quickly closes
the gap.  We run out of space on Mako, though, before we get to an input size
that allows us to see the GPU's significance.


VI. Analysis of Results
------------------------
Results when run on 3 points
CPU Implementation: 0.000s 
GPU Implementation: 0.001s
Performance Increase: -11845%

Results when run on 4 points
CPU Implementation: 0.000s 
GPU Implementation: 0.001s
Performance Increase: -5718%

Results when run on 5 points
CPU Implementation: 0.000s
GPU Implementation: 0.001s
Performance Increase: -5015%

Results when run on 10 points
CPU Implementation: 0.006s 
GPU Implementation: 0.107s
Performance Increase: -1877%

Results when run on 22 points
CPU Implementation: 43.056s
GPU Implementation:
Performance Increase:

We ran into memory issues on the CPU when we used more than 27 points, and on
the GPU when we used more than 11 points.  We assume this is because the
algorithm uses O(2^n) space, which grows extremely quickly.  Therefore, we
could only run on small inputs, and thus were unable to see the GPU improve
the performance in any significant way.  However, we did notice that there
is a lot of overhead in the implementation, and the difference went down
significantly as we added more points.  We searched for what others observed,
and found that they observed a similarly poor performance on small inputs.
Additionally, they noted that their biggest limiting factor was space.  For
more information, refer to:
http://www.slideshare.net/DimitrisMavrommatis/travelling-salesman-problem-57460886

We wanted to try to improve ours by using streams, but were unable to because
after approximately every three runs, we would begin seeing an error message
"GPUassert: out of memory", which wouldn't cease for 30-60 minutes.  As a
result, we could debug on average about 5 lines per hour.

We also believe that had we found a good way to assign threads to subsets
instead of to loop iterators within each subset that the GPU speedup would
have been much greater.