Skip to content

taoito/lcs-parallel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

##Longest Common Subsequence Parallelization with MPI, OpenMP, PThreads

###Design Each thread/process will take care of a column of the F table and compute base on the DP formulation. A slight variation is made in the code as here each thread will be responsible for a row instead, because of the way this matrix is loaded into the cache. For each thread, tow and column are broken into smaller slices for computation and cache reuse, resulting in blocks of sub-matrix for computation. Since each entry in the F matrix is depended on the three entries directly above, to the left, and upper-left diagonal position, the same dependency will be applied to each sub-matrix computation (depending on 3 other blocks, and more specifically: it need the bottom-most row of the block above, right-most column of the block to the left, and bottom-right entry of the previous diagonal block). Because of this restriction, threads will have to synchronize using barriers, in a diagonal line order, starting from top-left position to the bottom-right of the whole F table matrix.

  1. PThreads implementation:
  • For the PThreads specific code, a loop in main runs and starts the specified number of threads, passing in its id, and running the method computeBlock. The column (input text file 2) is broken into subset of 100 rows each, while the row (input text file 1) is broken into NUM_THREADS set of columns.
  • For each thread, computeBlock has an outer loop going through all subset of rows, for each, an inner loop goes through all subset of columns, forming a small submatrix block. The DP algorithm for LCS is applied, with two more inner loops going through each row and column on the sub-matrix. Computed values are stored in the global F Table, and traceback is applied at the end to this table.
  • Notice that here traceback is better doing in serial after the threads have joined, the reason is that the traceback procedure starts from the bottom-right entry and jump back up-left depending on the similarity of its corresponding characters and also its above and left values. Therefore if we divide up the F Table and give it to each thread, they cannot start at the same time with their given sub-table, and have to wait for results from another thread (this is implemented in the MPI solution below). The other option is to perform speculative traceback for each sub-table, in order to start and compute at the same time, but if the speculation is incorrect, the whole sub-table computation is wasted and this overall time/cost penalty may out-weight the time benefit of parallelizing this step.
  1. OpenMP implementation:
  • For the OpenMP specific code, again the column (input text file 2) is broken into subset of 100 rows each, while the row (input text file 1) is broken into NUM_THREADS set of columns. Within the (#pragma omp parallel) structured block, we use the (#pragma omp for) directive with a Dynamic schedule specified in front of the outer loop, so any idle thread is given a subset of rows to work on.
  • For each, an inner loop goes through all subset of columns, forming a small submatrix block. Similar to the Pthreads version above, the DP algorithm for LCS is applied with two more inner loops for rows, columns, and computed values are stored in a global F tables to trace back later. Again, the traceback procedure is performed in serial, with the exact reasoning as the section above.
  1. MPI Implementation:
  • For MPI, the main process with rank 0, reads both text files and broadcast the column index (input text file 1) to all other processes. It then divides up the row index (input text file 2) into num_process subsets of rows (this is stored in listSubY[] array) and send one to each process.
  • All processes including the rank 0 one, create its own sub F Table to store the results, and use the same DP LCS algorithm to compute the values for this table. The received column index is divided into num_process subsets for computation and synchronization. Since they also need the bottom-most row values computed by another process, it MPI_Recv and wait for this result being sent, after computing each block, they also send their own bottom-most row values from their F-Table to the next process. This and barrier are used to ensure synchronization for correct values.
  • At the end the last process start tracing back its own sub F-Table then send the results to its preceding process, which does the same thing until the accumulated results reach the process rank 0, which does its own traceback then output the LCS sequence.

About

Longest common subsequence parallel implementations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published