You must be signed in to change notification settings - Fork 355
- Clone CaffeOnSpark code.
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark
- Install Apache Hadoop 2.6 per http://hadoop.apache.org/releases.html, and install Apache Spark 1.6.0 per instruction at http://spark.apache.org/downloads.html.
export HADOOP_HOME=$(pwd)/hadoop-2.6.4
export PATH=${HADOOP_HOME}/bin:${PATH}
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}
- Install caffe prerequists per http://caffe.berkeleyvision.org/installation.html or from http://installing-caffe-the-right-way.wikidot.com/start
For CPU Mode:
Make sure that all the dependent libraries are compiled with libc++.
Check this using otool.
Eg: otool -L /usr/local/Cellar/opencv/2.4.12_2/lib/libopencv_objdetect.dylib.
You should see libc++ linked like:
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)
If you see some dependent libs with stdlibc++, you need to recompile the lib from source with libc++.
- Create a CaffeOnSpark/caffe-public/Makefile.config
Check your $JAVA_HOME is set
pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
Change/Specify the path in INCLUDE_DIRS and LIBRARY_DIRS to the dependent libs as per your local installation. This is a critical step in making sure everything compiles well.
Uncomment settings as needed:
CPU_ONLY := 1 #if you have CPU
For CPU_ONLY, as stated above, make sure all dependent libs are compiled with libc++.
For GPU mode on osx,
If using CUDA > 7.0 nothing special is required except comment CPU_ONLY
But on OS X >= 10.9 with CUDA < 7.0, you may need to compile all dependent libs with stdlibc++
Comment out INFINIBAND in all cases on OSX unless you have libverbs driver for the same (not tested)
- Build CaffeOnSpark
export DYLD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:/usr/local/cuda/lib:/usr/local/mkl/lib/intel64/
make buildosx
Please make sure to put in the right path as per your local installation for cuda libs (if you choose GPU)
and mkl libs if you use MKL
- Install mnist dataset
Adjust ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt to use absolute paths, such as.
source: "file:///home/afeng/CaffeOnSpark/data/mnist_train_lmdb/"
source: "file:///home/afeng/CaffeOnSpark/data/mnist_test_lmdb/"
Adjust data/lenet_memory_solver.prototxt with appropriate mode.
solver_mode: CPU #GPU if you use GPU nodes
- Launch standalone Spark cluster
Start master:
Here is an example of Spark log for the above command, which contains a Spark master URL starting with prefix "spark://".
Strt one or more workers and connect them to the master via master-spark-URL. Go to MasterWebUI, make sure that you have the exact # of workers launched.
export MASTER_URL=spark://$(hostname):7077
${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER_URL}
- Train a DNN network using CaffeOnSpark with 2 Spark executors with Ethernet connection. If you have Infiniband interface, please use "-connection infiniband" instead.
Before launching CaffeOnSpark check that your hostname and other hosts you connect to are resolvable.
You may need to add your/peer host name in /etc/hosts.
pushd ${CAFFE_ON_SPARK}/data
rm -rf ${CAFFE_ON_SPARK}/mnist_lenet.model
rm -rf ${CAFFE_ON_SPARK}/lenet_features_result
spark-submit --master ${MASTER_URL} \
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" \
--conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-train \
-features accuracy,loss -label label \
-conf lenet_memory_solver.prototxt \
-devices 1 \
-connection ethernet \
-model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
-output file:${CAFFE_ON_SPARK}/lenet_features_result
ls -l ${CAFFE_ON_SPARK}/mnist_lenet.model
cat ${CAFFE_ON_SPARK}/lenet_features_result/*
Please check the Spark Worker Web UI to see the progress of training. You should see standard Caffe logs illustrated below.
I0215 04:45:41.444522 26306 solver.cpp:237] Iteration 0, loss = 2.45106
I0215 04:45:41.444772 26306 solver.cpp:253] Train net output #0: loss = 2.45106 (* 1 = 2.45106 loss)
I0215 04:45:41.444911 26306 sgd_solver.cpp:106] Iteration 0, lr = 0.01
I0215 04:46:10.320430 26306 solver.cpp:237] Iteration 100, loss = 0.337411
I0215 04:46:10.320597 26306 solver.cpp:253] Train net output #0: loss = 0.337411 (* 1 = 0.337411 loss)
I0215 04:46:10.320667 26306 sgd_solver.cpp:106] Iteration 100, lr = 0.00992565
I0215 04:46:37.602695 26306 solver.cpp:237] Iteration 200, loss = 0.2749
I0215 04:46:37.602886 26306 solver.cpp:253] Train net output #0: loss = 0.2749 (* 1 = 0.2749 loss)
I0215 04:46:37.602932 26306 sgd_solver.cpp:106] Iteration 200, lr = 0.00985258
I0215 04:46:59.177289 26306 solver.cpp:237] Iteration 300, loss = 0.165734
I0215 04:46:59.177484 26306 solver.cpp:253] Train net output #0: loss = 0.165734 (* 1 = 0.165734 loss)
I0215 04:46:59.177533 26306 sgd_solver.cpp:106] Iteration 300, lr = 0.00978075
I0215 04:47:27.075026 26306 solver.cpp:237] Iteration 400, loss = 0.26131
I0215 04:47:27.075108 26306 solver.cpp:253] Train net output #0: loss = 0.26131 (* 1 = 0.26131 loss)
I0215 04:47:27.075125 26306 sgd_solver.cpp:106] Iteration 400, lr = 0.00971013
The feature result file should look like:
- Access CaffeOnSpark from Python
Get started with python on CaffeOnSpark
- Shutdown Spark cluster