Skip to content

Create dataset for Apache Storm benchmark

Samantha Chan edited this page Jun 13, 2014 · 3 revisions

In this step, you will create the dataset file necessary for the StormEmailBenchmark.

Before you begin: Make sure you have performed this step: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset )

To create dataset for Apache Storm benchmark:

  • The dataset naming convention is name<n>.ext, where n starts from 0 and goes up till m -1 where, m is the parallelism of the spout

  • The dataset should be present on NFS

  • Three different datasets can be generated. The generation code for all three is present within the package com.ibm.storm.email.benchmark.testing.

    1. Compressed and Serialized: for the main application benchmark
      • Generated using CreateDatasetSequential
      • For use with topologies: EnronTopology, TrivialTopology1, and TrivialTopology2
    2. Compressed and Unserialized
      • Generated using CreateCompressedDatasetSequential
      • For use with topology TrivialTopology3
    3. Uncompressed and Serialized
      • Generated using CreateSerializedDatasetSequential
      • For use with topology TrivialTopology4

The input to these is the output of the preprocessing stage and their arguments are similar.

For instance, to generate the serialized/compressed data:

java -cp target/storm-email-benchmark-1.0-jar-with-dependencies.jar \ com.ibm.storm.email.benchmark.testing.CreateDatasetSequential \ <input_path: the output of CoalesceEnronDataset> \ <output_file_path>