Skip to content

Running Apache Storm benchmark

Samantha Chan edited this page Jun 13, 2014 · 6 revisions

Overview

Before you begin, make sure you compiled the application and created dataset: [Create dataset for Apache Storm benchmark](Create dataset for Apache Storm benchmark)

The Apache Storm benchmark contains the following topologies:

  1. EnronTopology: Complete application benchmark
  2. TrivialTopology1: Same as EnronTopology but filter, modify, and metrics are unity bolts
  3. TrivialTopology2: Same as TrivialTopology1 but filter and modify bolts are removed
  4. TrivialTopology3: Same as TrivialTopology2 but serialization and deserialization bolts are removed This topology requires an unserialized but compressed dataset (create using com.ibm.storm.email.benchmark.testing.CreateCompressedDatasetSequential)
  5. TrivialTopology4: Same as TrivialTopology3 but without compression and decompression This topology requires an uncompressed dataset (create using com.ibm.storm.email.benchmark.testing.CreateSerializedDatasetSequential)

To Run the application benchmark

storm jar target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.storm.email.benchmark.<topology_name> <local_or_remote> <job_id>

These topologies make use of vanilla shuffle grouping. If you want to use localOrShuffle group instead, use com.ibm.storm.email.benchmark.local.<topology_name>.

For some setups, especially single process ones, shuffle seems to perform better than localOrShuffle.

Final Metrics Configuration

  1. The final metrics are emitted by the Global Metrics Bolt
  2. To this end, it needs to know the total number of emails. It gets this number from the configuration file (totalemails)
  3. This number needs to be updated each time the dataset changes. Just uncomment the totalemails for the corresponding dataset in the configuration file

Results Collection

Final number of characters, words, and paragraphs, throughput, elapsed time, and number of processed emails can be retrieved from <logspath>/<job_id>/GlobalMetricsBolt_Final

See Configuration section above for details of "logspath"

  1. Interval metrics can be obtained from <logspath>/<job_id>/GlobalMetricsBolt and <logspath>/<job_id>/GlobalMetricsBolt_Throughput

  2. To collect CPU Time after the job has completed: a. jps Note down the PIDs of all Worker processes a. For each Worker PID ps -e -o pid,cputime | grep <pid>