Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

GetStarted_standalone

Andy Feng edited this page Sep 20, 2016 · 2 revisions

Running CaffeOnSpark in Standalone Spark Clusters

  1. Clone CaffeOnSpark code.
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark
  1. Installl Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.
${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export PATH=${HADOOP_HOME}/bin:${PATH}
${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}
  1. Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html

  2. Create a CaffeOnSpark/caffe-public/Makefile.config

pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
popd

Uncomment settings as needed:

CPU_ONLY := 1  #if you havce CPU
USE_CUDNN := 1 #if you want to use CUDNN
  1. Build CaffeOnSpark
pushd ${CAFFE_ON_SPARK}
make build
popd
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/
  1. Install mnist dataset
${CAFFE_ON_SPARK}/scripts/setup-mnist.sh

Adjust ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt to use absolute paths, such as.

source: "file:///home/afeng/CaffeOnSpark/data/mnist_train_lmdb/"
source: "file:///home/afeng/CaffeOnSpark/data/mnist_test_lmdb/"

Adjust data/lenet_memory_solver.prototxt with appropriate mode.

solver_mode: CPU #GPU if you use GPU nodes
  1. Launch standalone Spark cluster

Start master:

${SPARK_HOME}/sbin/start-master.sh

Here is an example of Spark log for the above command, which contains a Spark master URL starting with prefix "spark://".

Strt one or more workers and connect them to the master via master-spark-URL. Go to MasterWebUI, make sure that you have the exact # of workers launched.

export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=1
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER_URL}
  1. Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.
pushd ${CAFFE_ON_SPARK}/data
rm -rf ${CAFFE_ON_SPARK}/mnist_lenet.model
rm -rf ${CAFFE_ON_SPARK}/lenet_features_result
spark-submit --master ${MASTER_URL} \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.cores.max=${TOTAL_CORES} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
	-clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
	-connection ethernet \
        -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
        -output file:${CAFFE_ON_SPARK}/lenet_features_result
ls -l ${CAFFE_ON_SPARK}/mnist_lenet.model
cat ${CAFFE_ON_SPARK}/lenet_features_result/*

Please check the Spark Worker Web UI to see the progress of training. You should see standard Caffe logs illustrated below.

I0215 04:45:41.444522 26306 solver.cpp:237] Iteration 0, loss = 2.45106
I0215 04:45:41.444772 26306 solver.cpp:253]     Train net output #0: loss = 2.45106 (* 1 = 2.45106 loss)
I0215 04:45:41.444911 26306 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0215 04:46:10.320430 26306 solver.cpp:237] Iteration 100, loss = 0.337411
I0215 04:46:10.320597 26306 solver.cpp:253]     Train net output #0: loss = 0.337411 (* 1 = 0.337411 loss)
I0215 04:46:10.320667 26306 sgd_solver.cpp:106] Iteration 100, lr = 0.00992565
I0215 04:46:37.602695 26306 solver.cpp:237] Iteration 200, loss = 0.2749
I0215 04:46:37.602886 26306 solver.cpp:253]     Train net output #0: loss = 0.2749 (* 1 = 0.2749 loss)
I0215 04:46:37.602932 26306 sgd_solver.cpp:106] Iteration 200, lr = 0.00985258
I0215 04:46:59.177289 26306 solver.cpp:237] Iteration 300, loss = 0.165734
I0215 04:46:59.177484 26306 solver.cpp:253]     Train net output #0: loss = 0.165734 (* 1 = 0.165734 loss)
I0215 04:46:59.177533 26306 sgd_solver.cpp:106] Iteration 300, lr = 0.00978075
I0215 04:47:27.075026 26306 solver.cpp:237] Iteration 400, loss = 0.26131
I0215 04:47:27.075108 26306 solver.cpp:253]     Train net output #0: loss = 0.26131 (* 1 = 0.26131 loss)
I0215 04:47:27.075125 26306 sgd_solver.cpp:106] Iteration 400, lr = 0.00971013

The feature result file should look like:

{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}
  1. Access CaffeOnSpark from Python

Get started with python on CaffeOnSpark

  1. Shutdown Spark cluster
${SPARK_HOME}/sbin/stop-slave.sh
${SPARK_HOME}/sbin/stop-master.sh
Clone this wiki locally