-
Notifications
You must be signed in to change notification settings - Fork 355
GetStarted_standalone
- Clone CaffeOnSpark code.
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark
- Installl Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.
${CAFFE_ON_SPARK}/scripts/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export PATH=${HADOOP_HOME}/bin:${PATH}
${CAFFE_ON_SPARK}/scripts/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}
-
Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html
-
Create a CaffeOnSpark/caffe-public/Makefile.config
pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
popd
Uncomment settings as needed:
CPU_ONLY := 1 #if you havce CPU
USE_CUDNN := 1 #if you want to use CUDNN
- Build CaffeOnSpark
pushd ${CAFFE_ON_SPARK}
make build
popd
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/
- Install mnist dataset
${CAFFE_ON_SPARK}/scripts/setup-mnist.sh
Adjust ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt to use absolute paths, such as.
source: "file:///home/afeng/CaffeOnSpark/data/mnist_train_lmdb/"
source: "file:///home/afeng/CaffeOnSpark/data/mnist_test_lmdb/"
Adjust data/lenet_memory_solver.prototxt with appropriate mode.
solver_mode: CPU #GPU if you use GPU nodes
- Launch standalone Spark cluster
Start master:
${SPARK_HOME}/sbin/start-master.sh
Here is an example of Spark log for the above command, which contains a Spark master URL starting with prefix "spark://".
Strt one or more workers and connect them to the master via master-spark-URL. Go to MasterWebUI, make sure that you have the exact # of workers launched.
export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=1
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER_URL}
- Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.
pushd ${CAFFE_ON_SPARK}/data
rm -rf ${CAFFE_ON_SPARK}/mnist_lenet.model
rm -rf ${CAFFE_ON_SPARK}/lenet_features_result
spark-submit --master ${MASTER_URL} \
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-train \
-features accuracy,loss -label label \
-conf lenet_memory_solver.prototxt \
-clusterSize ${SPARK_WORKER_INSTANCES} \
-devices 1 \
-connection ethernet \
-model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
-output file:${CAFFE_ON_SPARK}/lenet_features_result
ls -l ${CAFFE_ON_SPARK}/mnist_lenet.model
cat ${CAFFE_ON_SPARK}/lenet_features_result/*
Please check the Spark Worker Web UI to see the progress of training. You should see standard Caffe logs illustrated below.
I0215 04:45:41.444522 26306 solver.cpp:237] Iteration 0, loss = 2.45106
I0215 04:45:41.444772 26306 solver.cpp:253] Train net output #0: loss = 2.45106 (* 1 = 2.45106 loss)
I0215 04:45:41.444911 26306 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0215 04:46:10.320430 26306 solver.cpp:237] Iteration 100, loss = 0.337411
I0215 04:46:10.320597 26306 solver.cpp:253] Train net output #0: loss = 0.337411 (* 1 = 0.337411 loss)
I0215 04:46:10.320667 26306 sgd_solver.cpp:106] Iteration 100, lr = 0.00992565
I0215 04:46:37.602695 26306 solver.cpp:237] Iteration 200, loss = 0.2749
I0215 04:46:37.602886 26306 solver.cpp:253] Train net output #0: loss = 0.2749 (* 1 = 0.2749 loss)
I0215 04:46:37.602932 26306 sgd_solver.cpp:106] Iteration 200, lr = 0.00985258
I0215 04:46:59.177289 26306 solver.cpp:237] Iteration 300, loss = 0.165734
I0215 04:46:59.177484 26306 solver.cpp:253] Train net output #0: loss = 0.165734 (* 1 = 0.165734 loss)
I0215 04:46:59.177533 26306 sgd_solver.cpp:106] Iteration 300, lr = 0.00978075
I0215 04:47:27.075026 26306 solver.cpp:237] Iteration 400, loss = 0.26131
I0215 04:47:27.075108 26306 solver.cpp:253] Train net output #0: loss = 0.26131 (* 1 = 0.26131 loss)
I0215 04:47:27.075125 26306 sgd_solver.cpp:106] Iteration 400, lr = 0.00971013
The feature result file should look like:
{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}
- Access CaffeOnSpark from Python
Get started with python on CaffeOnSpark
- Shutdown Spark cluster
${SPARK_HOME}/sbin/stop-slave.sh
${SPARK_HOME}/sbin/stop-master.sh