GetStarted_yarn_object_storage

Integrating Object Storage nodes such as OpenStack Swift with CaffeOnSpark

This guide aims to enable using datasets stored in online cloud object storage such as OpenStack Swift during training instead of using data stored locally.

For this, we need to reconfigure both Hadoop and Spark with OpenStack credentials like auth URL, username, password, region etc in core-site.xml file.

Cloud Object storage nodes differ from a traditional file systems. Once the dataset is uploaded to Swift, one can then access the stored dataset by using this format: swift://<container-name>.PROVIDER/path (for example, swift://MNISTlmdb.chameleoncloud/mnist_train_lmdb).

For simplicity, the wiki is separated into sections. Section II installs Hadoop 2.7.1, Spark 2.0.0 and adds necessary jar files into hadoop classpath. Section III configures Hadoop and Spark with OpenStack credentials. Finally in Section IV we will use the GetStarted_yarn guide to start a yarn cluster to train a model in CaffeOnSpark using data stored in Swift.

Section I: Build CaffeOnSpark

Please follow Steps 1 - 4 of GetStarted_yarn to build CaffeOnSpark.

Section II: Setup Hadoop and Spark

Update to Hadoop version 2.7.1 and Spark version 2.0.0. Update Hadoop Classpath to include hadoop-openstack-2.7.1.jar file.

$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.7.1
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tools/lib/*
$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-2.0.0-bin-hadoop2.7

If you cannot ssh to localhost without a passphrase, execute the following commands:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Section III: Configure Hadoop and Spark with OpenStack Swift

Copy new configuration template and update $HADOOP_HOME/etc/hadoop/core-site.xml.

sudo cp $CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/core-site.xml.template $HADOOP_HOME/etc/hadoop/

Edit the core-site.xml.template file. All properties starting with fs.swift like AUTH URL, USERNAME etc mentioned in $HADOOP_HOME/etc/hadoop/core-site.xml.template must be updated. PROVIDER name should be changed to any custom, preferred name. Please refer Spark Documentation and OpenStack Documentation for more information.

Rename core-site.xml.template to core-site.xml.

sudo mv $HADOOP_HOME/etc/hadoop/core-site.xml.template $HADOOP_HOME/etc/hadoop/core-site.xml

Copy the core-site.xml file from Hadoop to Spark's config folder $SPARK_HOME/conf/.

sudo cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/

Section IV: Training a model in CaffeOnSpark using dataset stored in OpenStack Swift

After making necessary changes for GPU or CPU training in data/lenet_memory_solver.prototxt and data/cifar10_quick_solver.prototxt files, follow GetStarted_yarn Step 8 to initiate the training.

Make sure to change the source location to the swift in the format swift://<container-name>.PROVIDER/path in data/lenet_memory_train_test.prototxt and data/cifar10_quick_train_test.prototxt files.

Please note that the current implementation uses lmdb datasets. Since lmdb is not a distributed dataset, the implementation should be limited to small and medium size datasets. Please look into Spark Dataframes for large datasets. GetStarted_EC2 covers how to convert lmdb to dataframe.

##Appendix

Test Hadoop and Swift integration with OpenStack Swift.

Assuming that there is an object named imagenet_label.txt in a container called testcontainer and the PROVIDER name is set as chameleoncloud, do the following to check the working of hadoop and spark.

hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt

Output should look like:

-rw-rw-rw-   1     741401 2016-10-08 22:18 swift://testcontainer.chameleoncloud/imagenet_label.txt

Next for spark, in spark-shell:

scala> val data = sc.textFile("swift://testcontainer.chameleoncloud/imagenet_label.txt")
data: org.apache.spark.rdd.RDD[String] = swift://testcontainer.chameleoncloud/imagenet_label.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> data.count()
res1: Long = 21842

If you run into some errors, use HADOOP_ROOT_LOGGER=DEBUG,console to get verbose output from hadoop commands. For example,

HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly