-
Notifications
You must be signed in to change notification settings - Fork 0
TimMenninger/HeldKarp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/**---------------------------------------------------------------------------+ | | | README | | | | Description: This contains information about the HeldKarp cuda project for | | CS 179. Included is its usage, operation details and analyses| | of what we did. | | | | Authors: Tim Menninger | | Jared Reed | | | +----------------------------------------------------------------------------*/ I. File List ------------ CS179FinalProjectProposal.docx Initial Project Proposal HeldKarp.cc HeldKarp Implementation HeldKarp.cu HeldKarp GPU Kernels HeldKarp.cuh HeldKarp Header File Makefile Makefile for HeldKarp README ta_utilities.cpp TA files ta_utilities.hpp TA files SimpleTest.txt Test points with 6 points ComplexTest.txt Test points with 20 points TestCases Directory containing test cases TestCases directory also contains images with the intital data points as well as images with the outputted path II. Usage --------- After making, invoke the following command: ./HeldKarp < "input text file" > < threadsPerBlock > < maxBlocks > Input data text files are stored in the TestCases directory containing: SimpleTest.txt Simple Test containing 6 points ComplexTest.txt Complex Test containing 20 data points Note that this is very space complex and may not run the whole point set. To change how many points are used in the run, change the NUM_POINTS value in HeldKarp.cuh. An example runtime line might look like from the HeldKarp directory: ./HeldKarp SimpleTest.txt 512 512 After running, the path will be saved into out.txt, which you can then plot by running python Plotter.py SimpleTest.txt out.txt Note that this requires matplotlib. III. Program Explanation ------------------------- This program contains two implementions of the Held-Karp Algorithm for solving the traveling salesman problem. The first is a CPU implementation, while the second is a much faster GPU implementation. The traveling salesman problem consists of finding the minimum route between a set of points, while ending at the start point. Although this seems a simple problem, no algorithm has been conceived which reduces the time complexity by any large extent. This is because the TSP problem works in a fully connected graph, where every point could be connected. The brute force approach has a time complexity of O(n!), but the Held-Karp algorithm utilizes a memoization array that breaks the problem down into multiple subprobems, and eliminates redundant calculations. This implementation reduces the complexity to only O((n^2)(2^n)). In this algorithm, there is a memoization array. This array consists of a row for every subset of the set of points. There are 2^n subsets, but because the only useful subsets are those of cardinality 2 and greater, this ends up being 2^(n - 1) - 1. Within each row of the array, there is a value for each member of the subset. Thus, the value at index (i, j) would correspond to the shortest distance through all points in subset i that ends with point j. If j is not in subset i, then the cell is meaningless. In addition to the shortest distance, we also store the second-to-last point (because j is the last point). This allows us to iteratively recreate the path after filling the array and examining the full-set row. Part of creating this memoization array is knowing the distance between each set of two points, so the first step is iterating through each n^2 / 2 pairs and filling in the distance between them. Then, we create our base case, which involves filling our memoization array at each row that corresponds to a subset of length 2. Finally, we recursively create subsets and fill the memoization array in order of increasing subset length. After the base case, we use the following algorithm to fill the array. For a subset, s, we consider each point, k, as a possible "last point". Thus, we can remove k from s and use our memoization array to find the shortest path through all points in s \ {k}. Our array is structured in such a way that we know the shortest path through s \ {k} ending at any point, m, in s \ {k}. Then, for each possible m, we take the distance in the memoization array for s \ {k} ending at m, and add the distance from m to k. This is the shortest distance through our points in s ending with {m, k}. If we take the smallest over all possible m, we get the shortest path through all points in s ending at k. Eventually, when we have filled the array, we can look at the index that corresponds to the entire set. We iterate through each value in that row to find the endpoint, k, which had the smallest distance. We now know the last point is k. This array cell also stores the second-to-last point, j. This knowledge will allow us to find the rest of the path in linear time. Because we know the last point is k, and the second to last point is j, we look at the row corresponding to the full set, S, without k (S \ {k}) and look at the j'th value in the row, which then stores the next previous value. We follow this until the previous value is our starting point (or until the size of the subset is 2), at which point we have the entire path. The final step is adding the distance from the last point to the source point, and we get our final loop. For more information, refer to: https://en.wikipedia.org/wiki/Held%E2%80%93Karp_algorithm IV. Design Decisions & Challenges ---------------------------------- One challenge we faced was being able to find a unique index into the array given a set of numbers. The final algorithm ended up as a result of a guess- and-check system that, after countless hours, ended with the realization that one could construct a balanced tree of subsets. This tree would have its root be the smallest possible useful subset: {0, 1}. Then, each right node would increment the largest value in the set by 1, and each left node would add to the set the smallest number greater than the largest value in the set. This would happen until no larger value could be added to the set. This then allowed us to be able to find an index of a subset in linear time (2^n subsets, log(2^n) levels in this tree). Another difficult scenario we ran into was the program was not working for any inputs larger than 27 points. Originally, we thought we were incorrectly reading files. However, eventually we realized the issue was due to an extremely large space complexity of O(2^n) because it stores all subsets in the memoization array. The program was failing because the computer did not have enough space to store the array for such a large input. It then turned out that the GPU was even more space-limited, keeping us to only 12 point maximums before running out of memory. Figuring out the proper way to iterate over the subsets was also more difficult than we expected. It was very easy to generate all of the subsets, but the issue was they were not sorted by size, which is what we needed, as we need to run our function on all subsets size 1, then 2, and so on until we have the final set. This is necessary because each subset builds off of the ones which are size 1 smaller than they are. In the end, rather than sort the list of subsets, we changed our function to take size of subset in as an argument. Then, we looped from 1 to the max size, and called the function every time. Also, for this approach we needed to change our function to only update the memoization array if the current subset was the size of the size argument passed in. This let us properly iterate over the subsets by size. We considered only keeping memoization rows for the current size and one less than the current size, but then we lost the data that allows us to construct the full path in the end. Thus, our only choice was to use the extremely space-complex implementation. Another challenge was attempting to run a recursive function on the GPU. We had never run into this problem in the earlier labs in this class. We decidided the best way to handle this was to have the recursive function run inside the CPU, and every call would call a single GPU kernel, rather than trying to have the GPU kernel recursivelly call itself. Inside this GPU kernel, we needed to find the best way to parallelize the memoization process. Originally, it seemed easy; we would designate each thread to a subset and fill quickly. However, there were several issues with that. First, we found no good way of going from thread index to set values. Second, the order of the memoization array is not in size-order, so we would eventually deadlock waiting for other rows to fill. Thus, we parallelized by enumerating two nested loops and assigning a thread to each pair of two values. We could do this because each iteration in the loop doesn't depend on another iteration, but parts of the memoization array that were filled in previously. Perhaps, we could have V. Expected Results -------------------- We expect both the CPU and GPU implementations to return the correct shortest path from the TSP problem. However, we expect the GPU implementation to be much faster since we are using multiple threads to update the memoization array. Unfortunately, there was some amount of overhead involved in creating and copying as much memory as we were. Thus, on small cases, the CPU works faster than the GPU. As we increase the size of the input, the GPU quickly closes the gap. We run out of space on Mako, though, before we get to an input size that allows us to see the GPU's significance. VI. Analysis of Results ------------------------ Results when run on 3 points CPU Implementation: 0.000s GPU Implementation: 0.001s Performance Increase: -11845% Results when run on 4 points CPU Implementation: 0.000s GPU Implementation: 0.001s Performance Increase: -5718% Results when run on 5 points CPU Implementation: 0.000s GPU Implementation: 0.001s Performance Increase: -5015% Results when run on 10 points CPU Implementation: 0.006s GPU Implementation: 0.107s Performance Increase: -1877% Results when run on 22 points CPU Implementation: 43.056s GPU Implementation: Performance Increase: We ran into memory issues on the CPU when we used more than 27 points, and on the GPU when we used more than 11 points. We assume this is because the algorithm uses O(2^n) space, which grows extremely quickly. Therefore, we could only run on small inputs, and thus were unable to see the GPU improve the performance in any significant way. However, we did notice that there is a lot of overhead in the implementation, and the difference went down significantly as we added more points. We searched for what others observed, and found that they observed a similarly poor performance on small inputs. Additionally, they noted that their biggest limiting factor was space. For more information, refer to: http://www.slideshare.net/DimitrisMavrommatis/travelling-salesman-problem-57460886 We wanted to try to improve ours by using streams, but were unable to because after approximately every three runs, we would begin seeing an error message "GPUassert: out of memory", which wouldn't cease for 30-60 minutes. As a result, we could debug on average about 5 lines per hour. We also believe that had we found a good way to assign threads to subsets instead of to loop iterators within each subset that the GPU speedup would have been much greater.
About
Code for CS 179 final project, which implemented the Held-Karp traveling salesperson algorithm on a GPU.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published