Create dataset for Apache Storm benchmark

In this step, you will create the dataset file necessary for the StormEmailBenchmark.

Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )

To create dataset for Apache Storm benchmark:

The dataset naming convention is name<n>.ext, where n starts from 0 and goes up till m -1 where, m is the parallelism of the spout
The dataset should be present on NFS
Three different datasets can be generated. The generation code for all three is present within the package com.ibm.storm.email.benchmark.testing.
1. Compressed and Serialized: for the main application benchmark
  - Generated using CreateDatasetSequential
  - For use with topologies: EnronTopology, TrivialTopology1, and TrivialTopology2
2. Compressed and Unserialized
  - Generated using CreateCompressedDatasetSequential
  - For use with topology TrivialTopology3
3. Uncompressed and Serialized
  - Generated using CreateSerializedDatasetSequential
  - For use with topology TrivialTopology4

The input to these is the output of the preprocessing stage and their arguments are similar.

For instance, to generate the serialized/compressed data:

java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CreateDatasetSequential \ <input_path: the output of CoalesceEnronDataset> \ <output_file_path>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset for Apache Storm benchmark

Clone this wiki locally