-
Notifications
You must be signed in to change notification settings - Fork 859
Developer Guide
- Setting up Dr. Elephant
- Prerequisites
- Testing Dr. Elephant
- Project Structure
- Heuristics
- Schedulers
- Score Calculation
Follow the quick setup instructions here.
See "Step 3" in the Quick Setup Instructions.
In order to deploy Dr. Elephant on your local box for testing, you need to setup Hadoop(version 2.x) or Spark(Yarn mode, version > 1.4.0) with the Resource Manager and Job History Server daemons running (a pseudo distributed mode). Instructions to run a MapReduce job on YARN in a pseudo-distributed mode can be found here.
Export variable HADOOP_HOME if you haven't already.
$> export HADOOP_HOME=/path/to/hadoop/home
$> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Add hadoop to the system path because Dr. Elephant uses 'hadoop classpath' to load the right classes.
$> export PATH=$HADOOP_HOME/bin:$PATH
Make sure you run the history server as it is required for Dr. Elephant to run.
$> $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
Dr. Elephant requires a database to store information on the jobs and analysis results.
Set up and start mysql locally on your box. You can get the latest version of mysql from https://www.mysql.com/downloads/. Dr. Elephant is currently supported for mysql v. 5.5+, reported by Alex (wget.null@gmail.com) in the google groups forum. Create a database called 'drelephant'.
$> mysql -u root -p
mysql> create database drelephant;
You can configure your database url, db name, user and password in elephant.conf file present in the app-conf directory.
Configuring a different SQL Database:
Dr. Elephant is currently configured to work with Mysql. The evolution files contain the mysql DDL statements. If you wish to configure a different SQL database you may follow the instructions here.
You can run the tests by calling the compile.sh
script which will run all the unit tests for you. You can run one test as follows:
activator "test-only com.linkedin.drelephant.tez.fetchers.TezFetcherTest"
app → Contains all the source files
└ com.linkedin.drelepahnt → Application Daemons
└ org.apache.spark → Spark Support
└ controllers → Controller logic
└ models → Includes models that Map to DB
└ views → Page templates
app-conf → Application Configurations
└ elephant.conf → Port, DB, Keytab and other JVM Configurations (Overrides application.conf)
└ FetcherConf.xml → Fetcher Configurations
└ HeuristicConf.xml → Heuristic Configurations
└ JobTypeConf.xml → JobType Configurations
conf → Configurations files
└ evolutions → DB Schema
└ application.conf → Main configuration file
└ log4j.properties → log configuration file
└ routes → Routes definition
images
└ wiki → Contains the images used in the wiki documentation
public → Public assets
└ assets → Library files
└ css → CSS files
└ images → Image files
└ js → Javascript files
scripts
└ start.sh → Starts Dr. Elephant
└ stop.sh → Stops Dr. Elephant
test → Source folder for unit tests
compile.sh → Compiles the application
Dr. Elephant already has a bunch of heuristics for MapReduce and Spark. Refer the Heuristics guide for details on these heuristics. All these heuristics are pluggable and can be configured easily.
You can write your own heuristics and plug it into Dr. Elephant.
- Create a new heuristic and test it.
- Create a new view for the heuristic. For example, helpMapperSpill.scala.html
- Add the details of the heuristic in the HeuristicConf.xml file.
- The HeuristicConf.xml file requires the following details for each heuristic:
- applicationtype: The type of application analysed by the heuristic. e.g. mapreduce or spark
- heuristicname: Name of the heuristic.
- classname: Fully qualified name of the class.
- viewname: Fully qualified name of the view.
- hadoopversions: Versions of Hadoop with which the heuristic is compatible.
- Run Dr. Elephant. It should now include the new heuristics.
A sample entry in HeuristicConf.xml would look like,
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperGCHeuristic</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>
If you wish to override the threshold values of the severities used in the Heuristic and use custom threshold limits, you can specify them in the HeuristicConf.xml between params tag. See examples below.
A sample entry showing how to override/configure severity thresholds would look like,
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperDataSkewHeuristic</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
<num_tasks_severity>10, 50, 100, 200</num_tasks_severity>
<deviation_severity>2, 4, 8, 16</deviation_severity>
<files_severity>1/8, 1/4, 1/2, 1</files_severity>
</params>
</heuristic>
As of today, Dr. Elephant supports 3 workflow schedulers. They are Azkaban, Airflow and Oozie. All these schedulers are enabled by default and should work out of the box except a few configurations may be required for Airflow and Oozie.
Schedulers and all their properties are configured in the SchedulerConf.xml file present under app-conf directory.
Look at the sample SchedulerConf.xml file below to know what all properties need to be configured for the the respective schedulers.
<!-- Scheduler configurations -->
<schedulers>
<scheduler>
<name>azkaban</name>
<classname>com.linkedin.drelephant.schedulers.AzkabanScheduler</classname>
</scheduler>
<scheduler>
<name>airflow</name>
<classname>com.linkedin.drelephant.schedulers.AirflowScheduler</classname>
<params>
<airflowbaseurl>http://localhost:8000</airflowbaseurl>
</params>
</scheduler>
<scheduler>
<name>oozie</name>
<classname>com.linkedin.drelephant.schedulers.OozieScheduler</classname>
<params>
<!-- URL of oozie host -->
<oozie_api_url>http://localhost:11000/oozie</oozie_api_url>
<!-- ### Non mandatory properties ###
### choose authentication method
<oozie_auth_option>KERBEROS/SIMPLE</oozie_auth_option>
### override oozie console url with a template (only parameter will be the id)
<oozie_job_url_template></oozie_job_url_template>
<oozie_job_exec_url_template></oozie_job_exec_url_template>
### (if scheduled jobs are expected make sure to add following templates since oozie doesn't provide their URLS on server v4.1.0)
<oozie_workflow_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_url_template>
<oozie_workflow_exec_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_exec_url_template>
### Use true if you can assure all app names are unique.
### When true dr-elephant will unit all coordinator runs (in case of coordinator killed and then run again)
<oozie_app_name_uniqueness>false</oozie_app_name_uniqueness>
-->
</params>
</scheduler>
</schedulers>
To leverage the full functionality of Dr. Elephant all the below four IDs must be provided to Dr. Elephant.
- Job Definition ID: A unique reference to the job in the entire flow independent of the execution. This should filter all the mr jobs triggered by the job for all the historic executions of that job.
- Job Execution ID: A unique reference to a specific execution of the job. This should filter all mr jobs triggered by the job for a particular execution.
- Flow Definition ID: A unique reference to the entire flow independent of any execution. This should filter all the historic mr jobs belonging to the flow. Note that if your scheduler supports sub-workflows, then this ID should reference the super parent flow that triggered the all the jobs and sub-workflows.
- Flow Execution ID: A unique reference to a specific flow execution. This should filter all mr jobs for a particular flow execution. Again note that if the scheduler supports sub-workflows, then this ID should be the super parent flow execution id that triggered the jobs and sub-workflows.
Dr. Elephant expects all the above IDs to be available for it to integrate with any scheduler. Without these IDs, Dr. Elephant cannot have the level of integration support that we see for Azkaban. For example, if the job definition Id is not provided, then Dr. Elephant will not be able to capture a historic representation of the job. Similarly, if the Flow definition Id is not provided, then the flow’s history cannot be captured. In the absence of all the above links, Dr. Elephant can only show the job’s performance at the execution level(Mapreduce job level).
In addition to the above 4 IDs, Dr. Elephant requires an optional job name and 4 optional links which will help users navigate easily from Dr. Elephant to the scheduler. Note that this will not affect the functionality of Dr. Elephant.
- Flow Definition Url
- Flow Execution Url
- Job Definition Url
- Job Execution Url
A Score is a value which is used by Dr. Elephant for analyzing and comparing two different executions of workflows or jobs. The score of a MapReduce job can be as simple as the product of unweighted sum of all the severity values and number of tasks.
int score = 0;
if (severity != Severity.NONE && severity != Severity.LOW) {
score = severity.getValue() * tasks;
}
return score;
We can define the following scores,
- Task score: The product of unweighted sum of all the severity values and number of tasks.
- Job score: The sum of scores of all the tasks of the Job
- Flow score: The sum of scores of all the jobs of the flow.