Project to design a Raspberry Pi 4 Cluster using Spark for Distributed Machine Learning.
- (4) Raspberry Pi 4, 4GB Version
- (4) 32GB MicroSD Card
- (4) USB-C Power Supply
- (4) 1ft Ethernet cable
- (1) Raspberry Pi Cluster Case
- (1) Gigabit Ethernet Switch
- (1) Keyboard+Mouse combination
- (1) HDMI to Micro-HDMI Cable
- (1) HDMI Monitor
- Follow the Raspberry Pi Foundation's Official Guide to install the Raspian OS.
- After setting up one Raspberry Pi fully, clone the SD card to each of the others (after formatting each new Micro-SD).
- Physically assemble cluster.
pi@raspberrypi:~$ sudo mousepad /etc/dhcpcd.conf
- Uncomment:
interface eth0
static ip_address=192.168.0.10X/24
- Where X is the respective Raspberry Pi # (e.g. 1, 2, 3, 4)
- From Raspberry Pi dropdown menu (Top left corner of desktop): Preferences -> Config -> Interfaces -> Enable SSH
pi@raspberrypi:~$ sudo mousepad /etc/hosts
- Change
raspberrypi
topiX
- Add all IPs and hostnames to bottom of file
192.168.0.101 pi1
192.168.0.102 pi2
192.168.0.103 pi3
192.168.0.104 pi4
pi@raspberrypi:~$ sudo mousepad /etc/hostname
- Change
raspberrypi
topiX
pi@raspberrypi:~$ reboot
- Upon reopening the command prompt you should see the updated hostname.
pi@piX:~$
- The
ssh
config file is generated after the first timessh
is run, so justssh
into the current pi node.
pi@pi1:~$ ssh pi1
- Then exit the shell.
pi@pi1:~$ exit
- Now the file has been generated and can be modified.
pi@pi1:~$ sudo mousepad ~/.ssh/config
- Add hostname, user, and IP address for each node in the network (repeated 4x in my case).
Host piX
User pi
Hostname 192.168.0.10X
pi@piX:~$ ssh-keygen -t ed25519
- Once completed up to creating authentication pairs, append each Pi's public key to pi1's
authorized_keys
file.
pi@piX:~$ cat ~/.ssh/id_ed25519.pub | ssh pi@192.168.0.101 'cat >> .ssh/authorized_keys'
pi@pi1:~$ cat ~/.ssh/id_ed25519.pub >> .ssh/authorized_keys
pi@pi1:~$ scp ~/.ssh/authorized_keys piX:~/.ssh/authorized_keys
pi@pi1:~$ scp ~/.ssh/config piX:~/.ssh/config
- Now all devices in the cluster will be able to
ssh
into each other without requiring a password.
- To do this, we will create some new shell functions within the
.bashrc
file.
pi@pi1:~$ sudo mousepad ~/.bashrc
- Add to the end of the file:
function otherpis {
grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}
- This function will find and print the hostnames of all other nodes in the cluster (be sure to source
.bashrc
before running).
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ otherpis
pi2
pi3
pi4
- This command will run the specified command on all other nodes in the cluster, and then itself.
- In
~/.bashrc
:
function clustercmd {
for pi in $(otherpis); do ssh $pi "$@"; done
$@
}
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ clustercmd date
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST
- Reboot and shutdown all nodes in the cluster.
function clusterreboot {
clustercmd sudo shutdown -r now
}
function clustershutdown {
clustercmd sudo shutdown now
}
- Copies files from one device to every other node in the cluster.
function clusterscp {
for pi in $(otherpis); do
cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
done
}
- First source it if you haven't already, then copy.
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ clusterscp ~/.bashrc
- The latest Raspian (Buster) comes with Java 11 pre-installed. However, the latest Hadoop version (3.2.1) that we will be using only supports Java 8. To resolve this issue we will install OpenJDK 8 and make this the default Java that will run on each device.
pi@piX:~$ sudo apt-get install openjdk-8-jdk
pi@piX:~$ sudo update-alternatives --config java // Select number corresponding to Java 8
pi@piX:~$ sudo update-alternatives --config javac // Select number corresponding to Java 8
pi@pi1:~$ cd && wget https://www-us.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
pi@pi1:~$ sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/
pi@pi1:~$ rm hadoop-3.2.1.tar.gz && cd /opt
pi@pi1:/opt$ sudo mv hadoop-3.2.1 hadoop
pi@pi1:/opt$ sudo chown pi:pi -R /opt/hadoop
pi@pi1:~$ sudo mousepad ~/.bashrc
- Add (insert at top of file):
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ cd && hadoop version | grep Hadoop
Hadoop 3.2.1
- All of the following files are located within
/opt/hadoop/etc/hadoop
.
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/core-site.xml
- Modify end of file to be:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://pi1:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
pi@pi1:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
pi@pi1:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
pi@pi1:~$ sudo chown pi:pi -R /opt/hadoop_tmp
pi@pi1:~$ hdfs namenode -format -force
pi@pi1:~$ start-dfs && start-yarn.sh
- Verify the setup by using the
jps
command.
pi@pi1:~$ jps
- This command lists all of the Java processes running on the machine, of which there should be at least 6:
NameNode
DataNode
NodeManager
ResourceManager
SecondaryNameNode
jps
- Create temporary directory to test the file system:
pi@pi1:~$ hadoop fs -mkdir /tmp
pi@pi1:~$ hadoop fs -ls /
- Stop the single node cluster using:
pi@pi1:~$ stop-dfs && stop-yarn.sh
- Modify Hadoop environment configuration:
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hadoop-env.sh
- Change:
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
- To:
export HADOOP_OPTS="-XX:-PrintWarnings –Djava.net.preferIPv4Stack=true"
- Now in the
~/.bashrc
, add to the bottom:
export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA"
- Source
~/.bashrc
:
pi@pi1:~$ source ~/.bashrc
- Copy
.bashrc
to other nodes in the cluster:
pi@pi1:~$ clusterscp ~/.bashrc
pi@pi1:~$ clustercmd sudo mkdir -p /opt/hadoop_tmp/hdfs
pi@pi1:~$ clustercmd sudo chown pi:pi –R /opt/hadoop_tmp
pi@pi1:~$ clustercmd sudo mkdir -p /opt/hadoop
pi@pi1:~$ clustercmd sudo chown pi:pi /opt/hadoop
pi@pi1:~$ for pi in $(otherpis); do rsync -avxP $HADOOP_HOME $pi:/opt; done
-Verify install on other nodes:
pi@pi1:~$ clustercmd hadoop version | grep Hadoop
Hadoop 3.2.1
Hadoop 3.2.1
Hadoop 3.2.1
Hadoop 3.2.1
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://pi1:9000</value>
</property>
</configuration>
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
</configuration>
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
</configuration>
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>pi1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
pi@pi1:~$ clustercmd rm -rf /opt/hadoop_tmp/hdfs/datanode/*
pi@pi1:~$ clustercmd rm -rf /opt/hadoop_tmp/hdfs/namenode/*
pi@pi1:~$ cd $HADOOP_HOME/etc/hadoop
pi@pi1:/opt/hadoop/etc/hadoop$ mousepad master
- Add single line to file:
pi1
pi@pi1:/opt/hadoop/etc/hadoop$ mousepad workers
- Add other pi hostnames to the file:
pi2
pi3
pi4
pi@pi1:~$ sudo mousepad /etc/hosts
- Remove the line (All nodes will have identical host configuration):
127.0.1.1 pi1
- Copy updated file to the other cluster nodes:
pi@pi1:~$ clusterscp /etc/hosts
- Now reboot the cluster:
pi@pi1:~$ clusterreboot
pi@pi1:~$ hdfs namenode -format -force
pi@pi1:~$ start-dfs.sh && start-yarn.sh
- Now since we have configured Hadoop on a multi-node cluster, when we use
jps
on the master node (pi1), only the following processes will be running:
Namenode
SecondaryNamenode
ResourceManager
jps
- With the following having been offloaded to the datanodes, as you'll see if you
ssh
into and perform ajps
:
Datanode
NodeManager
jps
function clusterreboot {
stop-yarn.sh && stop-dfs.sh && \
clustercmd sudo shutdown -r now
}
function clustershutdown {
stop-yarn.sh && stop-dfs.sh && \
clustercmd sudo shutdown now
}
pi@pi1:~$ start-dfs.sh && start-yarn.sh
- To test the Hadoop cluster, we will deploy a sample wordcount job to count word frequencies from several books obtained from the Gutenberg Project.
- First, make the HDFS directories for the data.
pi@pi1:~$ hdfs dfs -mkdir -p /user/pi
pi@pi1:~$ hdfs dfs -mkdir books
pi@pi1:~$ cd /opt/hadoop
pi@pi1:/opt/hadoop$ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
pi@pi1:/opt/hadoop$ wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
pi@pi1:/opt/hadoop$ wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
pi@pi1:/opt/hadoop$ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
pi@pi1:/opt/hadoop$ hdfs dfs -ls books
pi@pi1:/opt/hadoop$ hdfs dfs -cat books/alice.txt
-
You can monitor the status of all jobs deployed to the cluster via the YARN web UI: http://pi1:8088
-
And the status of the cluster in general via the HDFS web UI: http://pi1:9870
pi@pi1:/opt/hadoop$ yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "books/*" output
pi@pi1:/opt/hadoop$ hdfs dfs -ls output
pi@pi1:/opt/hadoop$ hdfs dfs -cat output/part-r-00000 | less
pi@pi1:~$ cd && wget https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop-2.7.tgz
pi@pi1:~$ sudo tar -xvf spark-2.4.4-bin-hadoop-2.7.tgz -C /opt/
pi@pi1:~$ rm spark-2.4.4-bin-hadoop-2.7.tgz && cd /opt
pi@pi1:~$ sudo mv spark-2.4.4-bin-hadoop-2.7 spark
pi@pi1:~$ sudo chown pi:pi -R /opt/spark
pi@pi1:~$ sudo mousepad ~/.bashrc
- Add (insert at top of file):
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12, OpenJDK Client VM, 1.8.0_212
Branch
Compiled by user on 2019-08-27T21:21:38Z
Revision
Url
Type --help for more information.
- Similar to Hadoop, Spark also offers the ability to monitor the jobs you deploy. However, with Spark we will have to manually configure the monitoring options.
- Generate then modify the Spark default configuration file:
pi@pi1:~$ cd $SPARK_HOME/conf
pi@pi1:/opt/spark/conf$ sudo mv spark-defaults.conf.template spark-defaults.conf
pi@pi1:/opt/spark/conf$ mousepad spark-defaults.conf
- Add the following lines:
spark.master yarn
spark.executor.instances 4
spark.driver.memory 1024m
spark.yarn.am.memory 1024m
spark.executor.memory 1024m
spark.executor.cores 4
spark.eventLog.enabled true
spark.eventLog.dir hdfs://pi1:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://pi1:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
pi@pi1:/opt/spark/conf$ cd
pi@pi1:~$ hdfs dfs -mkdir /spark-logs
pi@pi1:~$ $SPARK_HOME/sbin/start-history-server.sh
- The Spark history server UI can be accessed at: http://pi1:18080
pi@pi1:~$ spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 7
Part 8: Acquiring Sloan Digital Sky Survey (SDSS) Data.
- The data I will be using to train and test a machine learning classifier is from the Sloan Digital Sky Survey (Data Release 16), a major multi-spectral imaging and spectroscopic redshift survey of different celestial bodies (Stars, Galaxies, Quasars).
- An abundance of data and features are collected each time the telescope captures images. In addition to capturing light in the visible spectrum, the telescope records the galactic coordinates of the body, five distinct wavelength bands emitted from the body, the redshift of the object, and many different metadata regarding how and when the images and data were captured. All of the data is freely obtainable in a variety of ways via the SDSS website, for my use I will use a SQL query to obtain all of my data from their databases. Along with the spectral information stored within their databases, the SDSS also offers helpful image visualization functionality for the objects that are captured. Below is one such example of the visualization with a galaxy.
2. SQL Querying SDSS SkyServer DR16.
- Navigate to the SQL Search page.
- The maximum numbers of entries that can be extracted into CSV format using this tool is 500,000, so we will do that.
SELECT TOP 500000
p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
s.class, s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
- For a more detailed background on the data, refer to the above links, and also here.
- The following commands worked to setup Jupyter and PySpark together on my
pi1
:
pi@pi1:~$ sudo pip3 install jupyter
pi@pi1:~$ pip3 install --upgrade ipython tornado jupyter-client jupyter-core
pi@pi1:~$ python3 -m ipykernel install --user
pi@pi1:~$ pip3 install pyspark findspark
pi@pi1:~$ jupyter-notebook