Skip to content

Parallel construction of generalized suffix tree in Spark

Notifications You must be signed in to change notification settings

sdwalker233/SparkGST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SparkGST

Parallel construction of generalized suffix tree in Spark.

Compile

git clone https://github.com/shad0w-walker233/SparkGST.git
cd SparkGST
sbt package

Execute

${SPARK_HOME/bin}/spark-submit \
--master <spark cluster master uri> \
--class GST.Main \
--executor-memory 15G \
--driver-memory 15G \
--executor-cores 4 \
<jar file path> \
hdfs://input_path \
hdfs://output_path \
TASK_MUL(optional, default 7) \
MAX_PREFIX_LEN(optional, default 4)

Algorithm

  1. Read all the files under the input path.
  2. Pretreatment: Determine which substring can be a key.
  3. Map Stage: For each suffix, generate a node linking to root node with the key of the first several characters which can be a key.
  4. Reduce Stage: Combine trees to generate the subtree of the GST by key.
  5. Recursive traversal and output the information of leaf nodes.

About

Parallel construction of generalized suffix tree in Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages