Skip to content

Running MapReduce Jobs

egonina edited this page Dec 20, 2013 · 6 revisions

The example of video event detection shows how we can use MapReduce (using mrjob to run Hadoop jobs) on a cluster of nodes. This work was done in collaboration with Penporn Koanantakool from UC Berkeley.

Installation

  1. Install Hadoop on your cluster.
  2. Install PyCASP.
  3. Install the mrjob package. For more detailed instructions see steps 15 and 16 here.

Running MapReduce with PyCASP

See the video event detection app example:

  1. First, we create a mapper() function, in our case it takes a video file name and calls the diarizer code on the file (using the cluster.py code for speaker diarization). See the code here.

  2. Then we create a main() function to setup the mrjob parameters and call the mapper function. See code example.

That's it! Once the environment is setup, you should be able to run the main function on the cluster. Each node will the execute the mapper() function, in our case the diarize() function in cluster.py.

In our video event detection system, we need to create a config file for each diarizer job, here is the code for that pre-processing step.