-
Notifications
You must be signed in to change notification settings - Fork 355
GetStarted_yarn_object_storage
This guide aims to enable using datasets stored in online cloud object storage such as OpenStack Swift during training instead of using data stored locally.
For this, we need to reconfigure both Hadoop and Spark with OpenStack credentials like auth URL
, username
, password
, region
etc in core-site.xml
file.
Cloud Object storage nodes differ from a traditional file systems. Once the dataset is uploaded to Swift, one can then access the stored dataset by using this format: swift://<container-name>.PROVIDER/path
(for example, swift://MNISTlmdb.chameleoncloud/mnist_train_lmdb
).
For simplicity, the wiki is separated into sections. Section II installs Hadoop 2.7.1, Spark 2.0.0 and adds necessary jar files into hadoop classpath. Section III configures Hadoop and Spark with OpenStack credentials. Finally in Section IV we will use the GetStarted_yarn guide to start a yarn cluster to train a model in CaffeOnSpark using data stored in Swift.
Please follow Steps 1 - 4 of GetStarted_yarn to build CaffeOnSpark.
- Update to Hadoop version 2.7.1 and Spark version 2.0.0. Update Hadoop Classpath to include
hadoop-openstack-2.7.1.jar
file.
$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.7.1
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tools/lib/*
$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-2.0.0-bin-hadoop2.7
If you cannot ssh to localhost without a passphrase, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
- Copy new configuration template and update
$HADOOP_HOME/etc/hadoop/core-site.xml
.
sudo cp $CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/core-site.xml.template $HADOOP_HOME/etc/hadoop/
Edit the core-site.xml.template
file. All properties starting with fs.swift
like AUTH URL
, USERNAME
etc mentioned in $HADOOP_HOME/etc/hadoop/core-site.xml.template
must be updated. PROVIDER
name should be changed to any custom, preferred name. Please refer Spark Documentation and OpenStack Documentation for more information.
Rename core-site.xml.template
to core-site.xml
.
sudo mv $HADOOP_HOME/etc/hadoop/core-site.xml.template $HADOOP_HOME/etc/hadoop/core-site.xml
- Copy the
core-site.xml
file from Hadoop to Spark's config folder$SPARK_HOME/conf/
.
sudo cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
After making necessary changes for GPU or CPU training in data/lenet_memory_solver.prototxt
and data/cifar10_quick_solver.prototxt
files, follow GetStarted_yarn Step 8 to initiate the training.
Make sure to change the source location to the swift in the format swift://<container-name>.PROVIDER/path
in data/lenet_memory_train_test.prototxt
and data/cifar10_quick_train_test.prototxt
files.
Please note that the current implementation uses lmdb
datasets. Since lmdb
is not a distributed dataset, the implementation should be limited to small and medium size datasets. Please look into Spark Dataframes for large datasets. GetStarted_EC2 covers how to convert lmdb
to dataframe
.
##Appendix
Assuming that there is an object named imagenet_label.txt
in a container called testcontainer
and the PROVIDER name is set as chameleoncloud
, do the following to check the working of hadoop and spark.
hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt
Output should look like:
-rw-rw-rw- 1 741401 2016-10-08 22:18 swift://testcontainer.chameleoncloud/imagenet_label.txt
Next for spark, in spark-shell
:
scala> val data = sc.textFile("swift://testcontainer.chameleoncloud/imagenet_label.txt")
data: org.apache.spark.rdd.RDD[String] = swift://testcontainer.chameleoncloud/imagenet_label.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> data.count()
res1: Long = 21842
If you run into some errors, use
HADOOP_ROOT_LOGGER=DEBUG,console
to get verbose output from hadoop commands. For example,
HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt