Parallel construction of generalized suffix tree in Spark.
git clone https://github.com/shad0w-walker233/SparkGST.git
cd SparkGST
sbt package
${SPARK_HOME/bin}/spark-submit \
--master <spark cluster master uri> \
--class GST.Main \
--executor-memory 15G \
--driver-memory 15G \
--executor-cores 4 \
<jar file path> \
hdfs://input_path \
hdfs://output_path \
TASK_MUL(optional, default 7) \
MAX_PREFIX_LEN(optional, default 4)
- Read all the files under the input path.
- Pretreatment: Determine which substring can be a key.
- Map Stage: For each suffix, generate a node linking to root node with the key of the first several characters which can be a key.
- Reduce Stage: Combine trees to generate the subtree of the GST by key.
- Recursive traversal and output the information of leaf nodes.