Skip to content

Hadoop setup Single node cluster

Mathan raj edited this page Oct 31, 2016 · 9 revisions

Single cluster with Hadoop 2.7.3 & Java 8

(optional) VMWare Player + Ubuntu 16.04

  • Download and install VMWare player

  • Download Ubuntu and setup in the VMWare player. A RAM of 4096 MB (4 GB) and a fixed hard disk storage of 16 GB is sufficient for the 'Single Cluster' setup for our learning.

Important
Windows PCs were having problems in supporting 64 Bit Ubuntu desktop client. Please try with 32 Bit desktop client version. Remeber to choose 32 Bit JDK if you are choosing 32 Bit OS.

Inside the VM follow the instructions below. These instructions are for NOT specific to VM setup.

User creation

We create a separate user for hadoop environment. Setup the user with password and a default group. Press enter and accept all default options while creating user. Add the new user to sudo.

m@ubuntu:~$ sudo addgroup hadoop
m@ubuntu:~$ sudo adduser --ingroup hadoop hduser
m@ubuntu:~$ sudo adduser hduser sudo

Verify changing to hduser

m@ubuntu:~$ su hduser
Password:
hduser@ubuntu:$ cd ~

Software setup

Update the packages:

hduser@ubuntu:~$ sudo apt-get update

Installing SSH

  • ssh : Is the client program used to connect to remote machine.(most linux distributions has this by default)

  • sshd : The daemon that runs in the server. It listens to client connection requests and facilitates connection of clients to the server. (the main installation installs this daemon)

  • rsync : remote-sync is used to sync files across linux machine. It is not required for single cluster environment. (most linux distributions has this by default)

    hduser@ubuntu:~$ sudo apt-get install ssh

Verify the installation

hduser@ubuntu:~$ which ssh

should display something like '/usr/bin/ssh'

hduser@ubuntu:~$ which sshd

should display something like '/usr/sbin/sshd'

Setup JDK:

We can setup JDK for Hadoop in 2 ways.

  1. Install a separate JDK for Hadoop so that it won’t interfere with system wide settings. (recommended)

  2. Use the existing installation of JDK

Install JDK:

We are going to setup JDK within the user’s home directory so that it will not interfere with system wide settings. Choose your desired version of JDK for your current platform (64 or 32 bit). We have chosen 'jdk-8u112-linux-i586.tar.gz' as we are having a 32 bit Ubuntu OS. I have tested Hadoop 2.7.3 with JDK 8 and it is working with no major issues(so far). We will create a symlink as part of the instruction.

hduser@ubuntu:~$ mkdir /home/hduser/apps
hduser@ubuntu:~$ cd /home/hduser/apps/
hduser@ubuntu:~$ wget --no-cookies --no-check-certificate --header \
                 "Cookie: oraclelicense=accept-securebackup-cookie" \
                 "http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-i586.tar.gz"
hduser@ubuntu:~$ tar  zxvf  jdk-8u112-linux-i586.tar.gz
hduser@ubuntu:~$ rm  jdk-8u112-linux-i586.tar.gz
hduser@ubuntu:~$ cd ~
hduser@ubuntu:~$ ln -s  /home/hduser/apps/jdk1.8.0_112   /home/hduser/apps/jdk_for_hadoop

Follow: Setup Java Home

Setup existing SDK

If you have decided to use a installed java, please follow the steps below. Find the java location by below commands. We then create a symlink to the existing jdk.

hduser@ubuntu:~$ which javac
/usr/bin/javac
hduser@ubuntu:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-8-oracle/bin/javac
hduser@ubuntu:~$ readlink -f /usr/lib/jvm/java-8-oracle/bin/javac
/usr/lib/jvm/java-8-oracle/bin/javac
hduser@ubuntu:~$ cd ~
hduser@ubuntu:~$ ln -s  /home/hduser/apps/jdk1.8.0_112   /home/hduser/apps/jdk_for_hadoop

Use readlink repeatedly with previous response until there is no links i.e no output from the command or same output repeated.

Follow: Setup Java Home

Setup JAVA_HOME

Edit the ~/.bashrc file and append the below content to the end of the file.

hduser@ubuntu:~$ vi ~/.bashrc
export JAVA_HOME=/home/hduser/apps/jdk_for_hadoop
export PATH=$JAVA_HOME/bin:$PATH

Save the file and execute below command to re-load the environment variables.

hduser@ubuntu:~$ source ~/.bashrc

Verify if the installation is successful

hduser@ubuntu:~$ java -version

should output something similar below:

java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
Java HotSpot(TM) Client VM (build 25.112-b15, mixed mode)

User configuration

Setup password less login

We need to setup password less SSH login for the user we just created, so that the Hadoop application running as that particular user can communicate with other machines in the cluster. We still need that setup in our single cluster environment. Now we will create the rsa key. press enter and accept all default options

hduser@ubuntu:~$ ssh-keygen -t rsa  -P ""

Now add the generated key to the key store. In a distributed environment we need to add this key to all other machines.

hduser@ubuntu:~$ cat ~/.ssh/id_rsa.pub  >> ~/.ssh/authorized_keys

verify if ssh works, type 'yes' for the question to 'continue connecting?'

hduser@ubuntu:~$ ssh localhost
hduser@ubuntu:~$ exit

Install Hadoop

Download & Setup.

Download the the desired version of Hadoop binary from your nearest mirror and setup Hadoop in 'usr' directory.

hduser@ubuntu:$ cd ~
hduser@ubuntu:~$ wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
hduser@ubuntu:~$ tar -xvzf hadoop-2.7.3.tar.gz
hduser@ubuntu:~$ sudo mv hadoop-2.7.3 /usr/local/hadoop
hduser@ubuntu:~$ sudo chown -R hduser:hadoop /usr/local/hadoop

Create a separate Hadoop configuration directory (will be setup as HADOOP_CONF_DIR environment variable)

hduser@ubuntu:~$  sudo cp -R /usr/local/hadoop/etc/hadoop /usr/local/hadoop-conf
hduser@ubuntu:~$ sudo chown -R hduser:hadoop /usr/local/hadoop-conf

Environment variables

Edit the ~/.bashrc file and append the below content to the end of the file.

hduser@ubuntu:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop-conf
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
#HADOOP VARIABLES END

Save the file and execute below command to re-load the environment variables.

hduser@ubuntu:~$ source ~/.bashrc

Edit the hadoop-env.sh

hduser@ubuntu:~$ vi /usr/local/hadoop-conf/hadoop-env.sh

append below line at the end of file:

export JAVA_HOME=/home/hduser/apps/jdk_for_hadoop

core-site.xml

Edit the core-site.xml as below and add the configuration:

hduser@ubuntu:~$ vi /usr/local/hadoop-conf/core-site.xml
<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://localhost/</value>
	</property>
</configuration>

hdfs-site.xml

Please create below directories for the namenode and data node

$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop_store

Edit hdfs-site.xml as below:

hduser@ubuntu:~$ vi /usr/local/hadoop-conf/hdfs-site.xml
<configuration>
   <property>
	<name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
     <name>dfs.namenode.name.dir</name>
     <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
   </property>
   <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
   </property>
</configuration>

mapred-site.xml

Copy the mapred-site.xml.template into mapred-site.xml as below:

hduser@ubuntu:~$ cp /usr/local/hadoop-conf/mapred-site.xml.template /usr/local/hadoop-conf/mapred-site.xml

Edit mapred-site.xml as below:

hduser@ubuntu:~$ vi /usr/local/hadoop-conf/mapred-site.xml
<configuration>
	<property>
		<name>mapreduce.framework.name</name>
              <value>yarn</value>
	</property>
</configuration>

yarn-site.xml

hduser@ubuntu:~$ vi /usr/local/hadoop-conf/yarn-site.xml
<configuration>
	<property>
         <name>yarn.resourcemanager.hostname</name>
         <value>localhost</value>
      </property>
     <property>
         <name>yarn.resourcemanager.address</name>
         <value>127.0.0.1:8032</value>
     </property>
     <property>
         <name>yarn.nodemanager.aux-services</name>
         <value>mapreduce_shuffle</value>
	</property>
</configuration>

HDFS format

Please execute the below command to format the HDFS file system.

Warning
This has to be done only once as part of setup. Repeated usage will format the HDFS again and will cause loss of data.
hduser@ubuntu:~$ hdfs namenode -format

Hadoop Daemons

Start the daemons as below.

hduser@ubuntu:~$ start-dfs.sh
hduser@ubuntu:~$ start-yarn.sh
hduser@ubuntu:~$ mr-jobhistory-daemon.sh start historyserver

Check list of Daemons running

hduser@ubuntu:~$ jps
6048 NameNode
6753 NodeManager
6386 SecondaryNameNode
6630 ResourceManager
6202 DataNode
7099 JobHistoryServer
7182 Jps

Stop the daemons in below order.

hduser@ubuntu:~$ mr-jobhistory-daemon.sh stop historyserver
hduser@ubuntu:~$ stop-yarn.sh
hduser@ubuntu:~$ stop-dfs.sh

The above daemons activity can be checked from the UI below.