PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Building a Single Node Hadoop cluster on UBUNTU

Big Data is a big world and so is its use , Considering its complexity and educate everyone on its simplicity this blog post gives a step by step process on installing & Building a Single node hadoop cluster which can be used for Learning and Operational purpose. The post guides a idle professional with basic IT background on various steps to install and know how hadoop works and whats the function of each task.The installation procedure also gives you and idea on how actually hadoop functions.

Here we go !!

Installing Java

Hadoop framework is written in Java!! So you have to install JDK on linux.

dwh4u@pc:~$ cd ~
 # Update the source list
 dwh4u@pc:~$ sudo apt-get update
 # The OpenJDK project is the default version of Java
 # that is provided from a supported Ubuntu repository.
 dwh4u@pc:~$ sudo apt-get install default-jdk
 dwh4u@pc:~$ java -version
 java version "1.7.0_65"
 OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
 OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Adding a dedicated Hadoop user

dwh4u@pc:~$ sudo addgroup hadoop
 Adding group `hadoop' (GID 1002) ...
 Done.
 dwh4u@pc:~$ sudo adduser --ingroup hadoop hduser
 Adding user `hduser' ...
 Adding new user `hduser' (1001) with group `hadoop' ...
 Creating home directory `/home/hduser' ...
 Copying files from `/etc/skel' ...
 Enter new UNIX password:
 Retype new UNIX password:
 passwd: password updated successfully
 Changing the user information for hduser
 Enter the new value, or press ENTER for the default
 Full Name []: dwh4u ghose
 Room Number []: L305
 Work Phone []: 333-333-3333
 Home Phone []: 777777
 Other []: other
 Is the information correct? [Y/n] Y

Install SSH

Hadoop connects with various components through SSH. You can install SSH in your machine using the command:-

sudo apt-get install ssh

This will install ssh on our machine. To check if its installed correctly in your system, execute the following commands:

dwh4u@pc:~$ which ssh
/usr/bin/ssh
dwh4u@pc:~$ which sshd
/usr/sbin/sshd

Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue.

dwh4u@pc:~$ su hduser
Password: 
hduser@pc:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
60:9c:f3:fc:0f:32:bf:30:79:c8:65:71:26:cc:7d:e9 hduser@pc
The key's randomart image is:
+--[ RSA 2048]----+
| .oo.o |
| . .o=. o |
| . + . o . |
| o = E |
| S + |
| . + |
| O + |
| O o |
| o.. |
+-----------------+
hduser@pc:/home/dwh4u$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.

We can check if ssh works:

hduser@pc:/home/dwh4u$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
...

Disabling IPv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 12.04 LTS, open using

sudo nano /etc/sysctl.conf

or in other editor of your choice and add the following lines to the end of the file:

# disable ipv6 
net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1 
net.ipv6.conf.lo.disable_ipv6 = 1

You have to reboot your machine in order to make the changes take effect.

You can check whether IPv6 is enabled on your machine with the following command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

Install Hadoop

To install hadoop you have to download it from the apache website.

hduser@pc:~sudo mkdir -p /usr/local/hadoop
hduser@pc:~sudo cd /usr/local/hadoop
hduser@pc:/usr/local/hadoop$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz 
hduser@pc/usr/local/hadoop:~$ tar xvzf hadoop-2.6.0.tar.gz

We want to move the Hadoop installation to the /usr/local/hadoop directory using the following command

hduser@pc:/usr/local/hadoop$ su dwh4u
Password: 
dwh4u@pc:/home/hduser$ sudo adduser hduser sudo
[sudo] password for k: 
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.

To extract it to the usr/local/hadoop directory we have to change ownership for the directory for hduser

dwh4u@pc:/home/hduser$ sudo su hduser
hduser@pc:$ cd /usr/local/ 
hduser@pc:/usr/local$ sudo mv /usr/local/hadoop/hadoop-2.6.0 /usr/local/hadoop 
hduser@pc:/usr/local$ sudo chown -R hduser:hadoop /usr/local/

Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:

~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:

Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable using the following command:

#install nano incase its not preinstalled
hduser@pc:~$sudo apt-get install nano
hduser@pc:~$nano ~/.bashrc

Append the following code in the end of the file ./bashrc

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

Use source command to save the environment variables.

hduser@pc:~$ source ~/.bashrc
hduser@pc:~$nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file.
change the value of a line from:
export JAVA_HOME=${JAVA_HOME}
To:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value 
of JAVA_HOME variable will be available to Hadoop whenever it is started up.

3. /usr/local/hadoop/etc/hadoop/core-site.xml:

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up.Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation.
This can be done using the following commands:
This file can be used to override the default settings that Hadoop starts with.

hduser@pc:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
hduser@pc:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@pc:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/tmp
hduser@pc:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
hduser@pc:~$ nano /usr/local/hadoop/etc/hadoop/core-site.xml
Open the file and enter the following in between the <configuration></configuration> tag:
<configuration>
 <property>
 <name>hadoop.tmp.dir</name>
 <value>/app/hadoop/tmp</value>
 <description>A base for other temporary directories.</description>
 </property>
 
 <property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default file system. A URI whose
 scheme and authority determine the FileSystem implementation. The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class. The uri's authority is used to
 determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains file which has to be renamed/copied with the name mapred-site.xml:

/usr/local/hadoop/etc/hadoop/mapred-site.xml.template

hduser@pc:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template 
/usr/local/hadoop/etc/hadoop/mapred-site.xml
 
hduser@pc:~$ nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:

<configuration>
 <property>
 <name>mapred.job.tracker</name>
 <value>localhost:54311</value>
 <description>The host and port that the MapReduce job tracker runs
 at. If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
 </property>
</configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on that host.

hduser@pc:~$ nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Open the file and enter the following content in between the <configuration></configuration> tag:
<configuration>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
 The actual number of replications can be specified when the file is created.
 The default is used if replication is not specified in create time.
 </description>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
 </configuration>

This is all that is needed to install hadoop in a system.

Add a Comment

Your email address will not be published. Required fields are marked *