Skip to content

Project to Design a Hadoop/Spark Raspberry Pi 4 Cluster for Distributed Computing.

Notifications You must be signed in to change notification settings

dmanning21h/pi-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 

Repository files navigation

Distributed Computing with the Raspberry Pi 4

Project to design a Raspberry Pi 4 Cluster using Spark for Distributed Machine Learning.

Part 1: Hardware and Setup.

1. Hardware for my implementation:

  • (4) Raspberry Pi 4, 4GB Version
  • (4) 32GB MicroSD Card
  • (4) USB-C Power Supply
  • (4) 1ft Ethernet cable
  • (1) Raspberry Pi Cluster Case
  • (1) Gigabit Ethernet Switch
  • (1) Keyboard+Mouse combination
  • (1) HDMI to Micro-HDMI Cable
  • (1) HDMI Monitor

2. Preliminary Setup

  • Follow the Raspberry Pi Foundation's Official Guide to install the Raspian OS.
  • After setting up one Raspberry Pi fully, clone the SD card to each of the others (after formatting each new Micro-SD).
  • Physically assemble cluster. My Setup

Part 2: Passwordless SSH.

1. (Starting with Pi #1) Setup wireless internet connection.

2. Assign static IP address to the ethernet interface.

pi@raspberrypi:~$ sudo mousepad /etc/dhcpcd.conf
  • Uncomment:
interface eth0
static ip_address=192.168.0.10X/24
  • Where X is the respective Raspberry Pi # (e.g. 1, 2, 3, 4)

3. Enable SSH.

  • From Raspberry Pi dropdown menu (Top left corner of desktop): Preferences -> Config -> Interfaces -> Enable SSH

4. Modify /etc/hosts to include hostnames and IPs of each Raspberry Pi node.

pi@raspberrypi:~$ sudo mousepad /etc/hosts
  • Change raspberrypi to piX
  • Add all IPs and hostnames to bottom of file
192.168.0.101  pi1
192.168.0.102  pi2
192.168.0.103  pi3
192.168.0.104  pi4

5. Change the Pi's hostname to be its respective Pi #.

pi@raspberrypi:~$ sudo mousepad /etc/hostname
  • Change raspberrypi to piX

6. Reboot node.

pi@raspberrypi:~$ reboot
  • Upon reopening the command prompt you should see the updated hostname.
pi@piX:~$ 

7. (Only on Pi #1) Generate ssh config, then modify.

  • The ssh config file is generated after the first time ssh is run, so just ssh into the current pi node.
pi@pi1:~$ ssh pi1
  • Then exit the shell.
pi@pi1:~$ exit
  • Now the file has been generated and can be modified.
pi@pi1:~$ sudo mousepad ~/.ssh/config
  • Add hostname, user, and IP address for each node in the network (repeated 4x in my case).
Host  piX
User  pi
Hostname 192.168.0.10X

8. Create authentication key pairs for ssh.

pi@piX:~$ ssh-keygen -t ed25519

9. Repeat steps 1-8 on the rest of the Pis.

  • Once completed up to creating authentication pairs, append each Pi's public key to pi1's authorized_keys file.
pi@piX:~$ cat ~/.ssh/id_ed25519.pub | ssh pi@192.168.0.101 'cat >> .ssh/authorized_keys'

10. (Back on Pi1) Append this Pi's public key to the file as well.

pi@pi1:~$ cat ~/.ssh/id_ed25519.pub >> .ssh/authorized_keys

11. Copy authorized_keys file and ssh configuration to all other nodes (2-4) in the cluster.

pi@pi1:~$ scp ~/.ssh/authorized_keys piX:~/.ssh/authorized_keys
pi@pi1:~$ scp ~/.ssh/config piX:~/.ssh/config
  • Now all devices in the cluster will be able to ssh into each other without requiring a password.

12. Create additional functions to improve the ease of use of the cluster.

  • To do this, we will create some new shell functions within the .bashrc file.
pi@pi1:~$ sudo mousepad ~/.bashrc

otherpis

  • Add to the end of the file:
function otherpis {
  grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}
  • This function will find and print the hostnames of all other nodes in the cluster (be sure to source .bashrc before running).
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ otherpis
pi2
pi3
pi4

clustercmd

  • This command will run the specified command on all other nodes in the cluster, and then itself.
  • In ~/.bashrc:
function clustercmd {
  for pi in $(otherpis); do ssh $pi "$@"; done
  $@
}
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ clustercmd date
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST
Sun 26 Jan 2020 12:56:58 PM EST

clusterreboot and clustershutdown

  • Reboot and shutdown all nodes in the cluster.
function clusterreboot {
  clustercmd sudo shutdown -r now
}

function clustershutdown {
  clustercmd sudo shutdown now
}

clusterscp

  • Copies files from one device to every other node in the cluster.
function clusterscp { 
  for pi in $(otherpis); do
    cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
  done
}

13. Copy .bashrc file to all other nodes in the network.

  • First source it if you haven't already, then copy.
pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ clusterscp ~/.bashrc

Part 3: Installing Hadoop.

1. Install Java 8 on each node, make this each node's default Java.

  • The latest Raspian (Buster) comes with Java 11 pre-installed. However, the latest Hadoop version (3.2.1) that we will be using only supports Java 8. To resolve this issue we will install OpenJDK 8 and make this the default Java that will run on each device.
pi@piX:~$ sudo apt-get install openjdk-8-jdk
pi@piX:~$ sudo update-alternatives --config java    // Select number corresponding to Java 8
pi@piX:~$ sudo update-alternatives --config javac   // Select number corresponding to Java 8

2. Download Hadoop, unpack, and give pi ownership.

pi@pi1:~$ cd && wget https://www-us.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
pi@pi1:~$ sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/
pi@pi1:~$ rm hadoop-3.2.1.tar.gz && cd /opt
pi@pi1:/opt$ sudo mv hadoop-3.2.1 hadoop
pi@pi1:/opt$ sudo chown pi:pi -R /opt/hadoop

3. Configure Hadoop Environment variables.

pi@pi1:~$ sudo mousepad ~/.bashrc
  • Add (insert at top of file):
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

4. Initialize JAVA_HOME for Hadoop environment.

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/

5. Validate Hadoop install.

pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ cd && hadoop version | grep Hadoop
Hadoop 3.2.1

Part 4: Setting up Hadoop Cluster

1. Setup Hadoop Distributed File System (HDFS) configuration files (Single Node Setup to start).

  • All of the following files are located within /opt/hadoop/etc/hadoop.

core-site.xml

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/core-site.xml
  • Modify end of file to be:
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://pi1:9000</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration> 

mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration> 

2. Create Datanode and Namenode directories.

pi@pi1:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
pi@pi1:~$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
pi@pi1:~$ sudo chown pi:pi -R /opt/hadoop_tmp

3. Format HDFS.

pi@pi1:~$ hdfs namenode -format -force

4. Boot HDFS and verify functionality.

pi@pi1:~$ start-dfs && start-yarn.sh
  • Verify the setup by using the jps command.
pi@pi1:~$ jps
  • This command lists all of the Java processes running on the machine, of which there should be at least 6:
  1. NameNode
  2. DataNode
  3. NodeManager
  4. ResourceManager
  5. SecondaryNameNode
  6. jps
  • Create temporary directory to test the file system:
pi@pi1:~$ hadoop fs -mkdir /tmp
pi@pi1:~$ hadoop fs -ls /
  • Stop the single node cluster using:
pi@pi1:~$ stop-dfs && stop-yarn.sh

5. Silence Warnings (as a result of 32-bit Hadoop build w/ 64-bit OS)

  • Modify Hadoop environment configuration:
pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hadoop-env.sh
  • Change:
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
  • To:
export HADOOP_OPTS="-XX:-PrintWarnings –Djava.net.preferIPv4Stack=true"
  • Now in the ~/.bashrc, add to the bottom:
export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA" 
  • Source ~/.bashrc:
pi@pi1:~$ source ~/.bashrc
  • Copy .bashrc to other nodes in the cluster:
pi@pi1:~$ clusterscp ~/.bashrc

6. Create Hadoop Cluster directories (Multi-node Setup).

pi@pi1:~$ clustercmd sudo mkdir -p /opt/hadoop_tmp/hdfs
pi@pi1:~$ clustercmd sudo chown pi:pi –R /opt/hadoop_tmp
pi@pi1:~$ clustercmd sudo mkdir -p /opt/hadoop
pi@pi1:~$ clustercmd sudo chown pi:pi /opt/hadoop

7. Copy Hadoop files to the other nodes.

pi@pi1:~$ for pi in $(otherpis); do rsync -avxP $HADOOP_HOME $pi:/opt; done

-Verify install on other nodes:

pi@pi1:~$ clustercmd hadoop version | grep Hadoop
Hadoop 3.2.1
Hadoop 3.2.1
Hadoop 3.2.1
Hadoop 3.2.1

8. Modify Hadoop configuration files for cluster setup.

core-site.xml

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://pi1:9000</value>
  </property>
</configuration>

hdfs-site.xml

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/hadoop_tmp/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>4</value>
  </property>
</configuration> 

mapred-site.xml

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
    <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
    <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
      <value>512</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
      <value>256</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
      <value>256</value>
  </property>
</configuration>

yarn-site.xml

pi@pi1:~$ sudo mousepad /opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
  <property>
    <name>yarn.acl.enable</name>
    <value>0</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
      <value>pi1</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
      <value>1536</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>1536</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>128</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>false</value>
  </property>
</configuration> 

9. Clean datanode and namenode directories.

pi@pi1:~$ clustercmd rm -rf /opt/hadoop_tmp/hdfs/datanode/*
pi@pi1:~$ clustercmd rm -rf /opt/hadoop_tmp/hdfs/namenode/*

10. Create/edit master amd worker files.

pi@pi1:~$ cd $HADOOP_HOME/etc/hadoop
pi@pi1:/opt/hadoop/etc/hadoop$ mousepad master
  • Add single line to file:
pi1
pi@pi1:/opt/hadoop/etc/hadoop$ mousepad workers
  • Add other pi hostnames to the file:
pi2
pi3
pi4

11. Edit hosts file.

pi@pi1:~$ sudo mousepad /etc/hosts
  • Remove the line (All nodes will have identical host configuration):
127.0.1.1 pi1
  • Copy updated file to the other cluster nodes:
pi@pi1:~$ clusterscp /etc/hosts
  • Now reboot the cluster:
pi@pi1:~$ clusterreboot

12. Format and start multi-node cluster.

pi@pi1:~$ hdfs namenode -format -force
pi@pi1:~$ start-dfs.sh && start-yarn.sh
  • Now since we have configured Hadoop on a multi-node cluster, when we use jps on the master node (pi1), only the following processes will be running:
  1. Namenode
  2. SecondaryNamenode
  3. ResourceManager
  4. jps
  • With the following having been offloaded to the datanodes, as you'll see if you ssh into and perform a jps:
  1. Datanode
  2. NodeManager
  3. jps

13. Modify clusterreboot and clustershutdown to shutdown Hadoop cluster gently.

clusterreboot

function clusterreboot {
  stop-yarn.sh && stop-dfs.sh && \
  clustercmd sudo shutdown -r now
}

clustershutdown

function clustershutdown {
  stop-yarn.sh && stop-dfs.sh && \
  clustercmd sudo shutdown now
}

Part 5: Testing the Hadoop Cluster (Wordcount Example)

1. Start cluster, if not active already.

pi@pi1:~$ start-dfs.sh && start-yarn.sh

2. Make data directories.

  • To test the Hadoop cluster, we will deploy a sample wordcount job to count word frequencies from several books obtained from the Gutenberg Project.
  • First, make the HDFS directories for the data.
pi@pi1:~$ hdfs dfs -mkdir -p /user/pi
pi@pi1:~$ hdfs dfs -mkdir books

3. Download books files.

pi@pi1:~$ cd /opt/hadoop
pi@pi1:/opt/hadoop$ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
pi@pi1:/opt/hadoop$ wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
pi@pi1:/opt/hadoop$ wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt

4. Upload book files to the HDFS.

pi@pi1:/opt/hadoop$ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
pi@pi1:/opt/hadoop$ hdfs dfs -ls books

5. Read one of the books from the HDFS.

pi@pi1:/opt/hadoop$ hdfs dfs -cat books/alice.txt

6. Monitor status of cluster and jobs.

  • You can monitor the status of all jobs deployed to the cluster via the YARN web UI: http://pi1:8088 Yarn Web UI

  • And the status of the cluster in general via the HDFS web UI: http://pi1:9870 HDFS Web UI

7. Deploy sample MapReduce job to cluster.

pi@pi1:/opt/hadoop$ yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "books/*" output

8. View output of job.

pi@pi1:/opt/hadoop$ hdfs dfs -ls output
pi@pi1:/opt/hadoop$ hdfs dfs -cat output/part-r-00000 | less

Hadoop Test

Part 6: Install Spark on the Cluster.

1. Download Spark, unpack, and give pi ownership.

pi@pi1:~$ cd && wget https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop-2.7.tgz
pi@pi1:~$ sudo tar -xvf spark-2.4.4-bin-hadoop-2.7.tgz -C /opt/
pi@pi1:~$ rm spark-2.4.4-bin-hadoop-2.7.tgz && cd /opt
pi@pi1:~$ sudo mv spark-2.4.4-bin-hadoop-2.7 spark
pi@pi1:~$ sudo chown pi:pi -R /opt/spark

2. Configure Spark Environment variables.

pi@pi1:~$ sudo mousepad ~/.bashrc
  • Add (insert at top of file):
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

3. Validate Spark install.

pi@pi1:~$ source ~/.bashrc
pi@pi1:~$ spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
                       
Using Scala version 2.11.12, OpenJDK Client VM, 1.8.0_212
Branch
Compiled by user  on 2019-08-27T21:21:38Z
Revision
Url
Type --help for more information.

Part 7: Test Spark on the Cluster (Approximating Pi).

1. Configure Spark job monitoring.

  • Similar to Hadoop, Spark also offers the ability to monitor the jobs you deploy. However, with Spark we will have to manually configure the monitoring options.
  • Generate then modify the Spark default configuration file:
pi@pi1:~$ cd $SPARK_HOME/conf
pi@pi1:/opt/spark/conf$ sudo mv spark-defaults.conf.template spark-defaults.conf
pi@pi1:/opt/spark/conf$ mousepad spark-defaults.conf
  • Add the following lines:
spark.master                       yarn
spark.executor.instances           4
spark.driver.memory                1024m
spark.yarn.am.memory               1024m
spark.executor.memory              1024m
spark.executor.cores               4

spark.eventLog.enabled             true
spark.eventLog.dir                 hdfs://pi1:9000/spark-logs
spark.history.provider             org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory      hdfs://pi1:9000/spark-logs
spark.history.fs.update.interval   10s
spark.history.ui.port              18080

2. Make logging directory on HDFS.

pi@pi1:/opt/spark/conf$ cd
pi@pi1:~$ hdfs dfs -mkdir /spark-logs

3. Start Spark history server.

pi@pi1:~$ $SPARK_HOME/sbin/start-history-server.sh

4. Run sample job (calculating pi).

pi@pi1:~$ spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.4.jar 7

Spark Test

Part 8: Acquiring Sloan Digital Sky Survey (SDSS) Data.

1. The Data

  • The data I will be using to train and test a machine learning classifier is from the Sloan Digital Sky Survey (Data Release 16), a major multi-spectral imaging and spectroscopic redshift survey of different celestial bodies (Stars, Galaxies, Quasars).

SDSS Telescope

  • An abundance of data and features are collected each time the telescope captures images. In addition to capturing light in the visible spectrum, the telescope records the galactic coordinates of the body, five distinct wavelength bands emitted from the body, the redshift of the object, and many different metadata regarding how and when the images and data were captured. All of the data is freely obtainable in a variety of ways via the SDSS website, for my use I will use a SQL query to obtain all of my data from their databases. Along with the spectral information stored within their databases, the SDSS also offers helpful image visualization functionality for the objects that are captured. Below is one such example of the visualization with a galaxy.

Galaxy Visualized Example

2. SQL Querying SDSS SkyServer DR16.

  • Navigate to the SQL Search page.
  • The maximum numbers of entries that can be extracted into CSV format using this tool is 500,000, so we will do that.
SELECT TOP 500000
   p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, 
   s.class, s.z as redshift
FROM PhotoObj AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
  • For a more detailed background on the data, refer to the above links, and also here.

Part 9: Installing Python Packages and Jupyter Notebook on Master node.

  • The following commands worked to setup Jupyter and PySpark together on my pi1:
pi@pi1:~$ sudo pip3 install jupyter
pi@pi1:~$ pip3 install --upgrade ipython tornado jupyter-client jupyter-core
pi@pi1:~$ python3 -m ipykernel install --user
pi@pi1:~$ pip3 install pyspark findspark
pi@pi1:~$ jupyter-notebook

Jupyter Home

Part 10: Stars, Galaxies, and Quasars: Using PySpark to Classify SDSS Data.

2. After starting PySpark Job, the in-progress job can be viewed at the Spark History Server UI (http://pi1:18080).

Spark Job In-Progress

3. Via the "Executors" Tab, statistics regarding the job pertaining to the master and workers can be monitored.

Spark Job Beginning

4. Under the Main "Jobs" tab, the timeline of events and tasks completed over the course of the job are recorded.

Spark Job Timeline

5. Upon finishing, the job statistics relating to the executors can again be reference via the "Executors" tab.

Spark Job Done

About

Project to Design a Hadoop/Spark Raspberry Pi 4 Cluster for Distributed Computing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published