-
Notifications
You must be signed in to change notification settings - Fork 0
Create dataset for Apache Storm benchmark
In this step, you will create the dataset file necessary for the StormEmailBenchmark.
Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )
To create dataset for Apache Storm benchmark:
-
The dataset naming convention is
name<n>.ext
, where n starts from 0 and goes up till m -1 where, m is the parallelism of the spout -
The dataset should be present on NFS
-
Three different datasets can be generated. The generation code for all three is present within the package com.ibm.storm.email.benchmark.testing.
- Compressed and Serialized: for the main application benchmark
- Generated using CreateDatasetSequential
- For use with topologies: EnronTopology, TrivialTopology1, and TrivialTopology2
- Compressed and Unserialized
- Generated using CreateCompressedDatasetSequential
- For use with topology TrivialTopology3
- Uncompressed and Serialized
- Generated using CreateSerializedDatasetSequential
- For use with topology TrivialTopology4
- Compressed and Serialized: for the main application benchmark
The input to these is the output of the preprocessing stage and their arguments are similar.
For instance, to generate the serialized/compressed data:
java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \
com.ibm.storm.email.benchmark.testing.CreateDatasetSequential \
<input_path: the output of CoalesceEnronDataset> \
<output_file_path>