In this article, we’ll explore how to set up a custom Hadoop single node cluster in the pseudo-distributed mode.
Apache Hadoop provides a software framework for distributed storage and processing of big data using the MapReduce programming model. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently. Hadoop can be installed in 3 different modes: Standalone mode, Pseudo-Distributed mode, and Fully-Distributed mode.
Standalone mode is the default mode in which Hadoop runs. Standalone mode is mainly used for debugging where you don’t really use HDFS.
Pseudo-distributed mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.
Fully-distributed mode is the production mode of Hadoop where multiple nodes will be running.
This draft helps you create your own custom Hadoop pseudo mode cluster. The environment used in this setup is ubuntu 18.04 and the Hadoop version is 3.1.2.
Prerequisites
Create A New User [optional]
Note: Follow this step if you want to install hadoop for a new user else skip to next part.
Open a new terminal (Ctrl+Alt+T) & type the following commands.
Firstly, create a group hadoop
sudo addgroup hadoop
And add a new user hdfsuser within the same hadoop group.
sudo adduser --ingroup hadoop hdfsuser
Note:
If you want to add an existing user to the group then use the following command
usermod -a -G hadoop username
Now give the hdfsuser necessary root permissions for file installations. Root user privileges can be provided by updating the sudoers file. Open the sudoers file by running the following command in the terminal
sudo visudo
Add or edit the following line
hdfsuser ALL=(ALL:ALL) ALL
Now, save the changes (Ctrl+O & press enter) and close the editor (Ctrl+X).
So, now lets switch to our newly created user for further installations.
su - hdfsuser
Java installation
Hadoop is built using Java and Java is required to run MapReduce code. The Java version should be Java 8 or above i.e Java 1.8+ for the latest hadoop installs. If you already have java running in your system then check whether you have the required version by running the following command in the terminal
java -version
If you have the required version, skip to the next step.
Note: If you plan on installing hive as well, then prefer java 8, as newer versions no longer have URLClassLoader.
You can either install java from your OS package manager or from the official oracle website (https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html).
Installation using aptitude (java 8)
sudo apt-get update sudo apt install openjdk-8-jre-headless openjdk-8-jdk
To verify your installation, run the following command in terminal
java -version
Setting up SSH keys
The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires a password-less SSH connection between the master and all the slaves and the secondary machines.
We need a password-less SSH because when the cluster is LIVE and running, the communication is too frequent. The job tracker should be able to send a task to the task tracker quickly.
Note: Do not skip this step unless you already have a password-less SSH setup. This step is essential for starting the hadoop services like Resource Manager & Node Manager.
Installing required packages
Run the following commands in your terminal
sudo apt-get install ssh sudo apt-get install sshd
Generating Keys …
ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod og-wx ~/.ssh/authorized_keys
Now, we’ve successfully generated ssh keys and copied the key value to authorized_keys.
Verify the secure connection by running the following command in your terminal.
ssh localhost
Note: If it doesn’t ask for a password and logs you in, then the configuration was successful, else remove the generated key and follow the steps again.
Don’t forget to exit from the localhost (Type exit in the terminal and press enter)
Now our prerequisites for hadoop installation has been completed successfully.
Hadoop 3.x Installation
Now. let’s begin the installation process of hadoop by downloading the latest stable release.
To download the release of your choice, use the following commands. (Change the directory and download link according to your preference).
cd /usr/local sudo wget http://archive.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
Extract the hadoop file in same location.
sudo tar xvzf hadoop-3.1.2.tar.gz
Rename the extracted folder
sudo mv hadoop-3.1.2 hadoop
Setup Hadoop in Pseudo-Distributed Mode
Now let’s provide ownership of hadoop to our hdfsuser [skip if you do not want to change ownership]
sudo chown -R hdfsuser:hadoop /usr/local/hadoop
Change the mode of hadoop folder to read, write & execute modes of working.
sudo chmod -R 777 /usr/local/hadoop
Disable IPv6
Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.
Take a look at HADOOP-3437 and HADOOP-6056 to understand why it is necessary to disable IPv6 to get Hadoop working.
You can check the status of your IPv6 configuration by running the following command in the terminal
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
If the result is not 1 then follow the below steps to disable IPv6
sudo nano /etc/sysctl.conf
Now add the following lines to the end of the file
# Disable ipv6 net.ipv6.conf.all.disable_ipv6=1 net.ipv6.conf.default_ipv6=1 net.ipv6.conf.lo.disable_ipv6=1
Save the file and exit
If ipv6 is still not disabled, then the problem will be /etc/sysctl.conf not being activated. To solve this, activate the conf by running
sudo sysctl -p
Adding Hadoop Environment Variables
Adding hadoop path to the environment is necessary, else you would have to move to the hadoop directory to run commands.
Open the bashrc file by running
sudo nano ~/.bashrc
Add the following lines to the end of the bashrc file
# HADOOP ENVIRONMENT export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_COMMON_HOME=/usr/local/hadoop export HADOOP_HDFS_HOME=/usr/local/hadoop export YARN_HOME=/usr/local/hadoop export PATH=$PATH:/usr/local/hadoop/bin export PATH=$PATH:/usr/local/hadoop/sbin # HADOOP NATIVE PATH export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=-Djava.library.path=$HADOOP_PREFIX/lib
Now, load the hadoop environment variables by running the following command
source ~/.bashrc
Configuring Hadoop …
Change the working directory to hadoop configurations location
cd /usr/local/hadoop/etc/hadoop/
- hadoop-env.sh
Open the hadoop-env file by running the following command
sudo nano hadoop-env.sh
Add the following configurations to the end of the file (Change the java path and user name according to your setup)
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export JAVA_HOME=/usr export HADOOP_HOME_WARN_SUPPRESS="TRUE" export HADOOP_ROOT_LOGGER="WARN,DRFA" export HDFS_NAMENODE_USER="hdfsuser" export HDFS_DATANODE_USER="hdfsuser" export HDFS_SECONDARYNAMENODE_USER="hdfsuser" export YARN_RESOURCEMANAGER_USER="hdfsuser" export YARN_NODEMANAGER_USER="hdfsuser"
- yarn-site.xml
Open the yarn-site file by running the following command
sudo nano yarn-site.xml
Add the following configurations between the configuration tag (<configuration></configuration>)
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
- hdfs-site.xml
Open the hdfs-site file by running the following command
sudo nano hdfs-site.xml
Add the following configurations between the configuration tag (<configuration></configuration>)
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/yarn_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/yarn_data/hdfs/datanode</value> </property> <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> </property>
- core-site.xml
Open the core-site file by running the following command
sudo nano core-site.xml
Add the following configurations between the configuration tag (<configuration></configuration>)
<property> <name>hadoop.tmp.dir</name> <value>/bigdata/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property>
- mapred-site.xml
Open the core-site file by running the following command
sudo nano mapred-site.xml
Add the following configurations between the configuration tag (<configuration></configuration>)
<property> <name>mapred.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>localhost:10020</value> </property>
All done… Now let’s create some directories for hadoop to save data
Creating directories
Let’s create a temp directory for dfs as mentioned in our core-site.xml file
sudo mkdir -p /bigdata/hadoop/tmp sudo chown -R hdfsuser:hadoop /bigdata/hadoop/tmp sudo chmod -R 777 /bigdata/hadoop/tmp
Note: If you don’t want to change ownership, skip the chown command. Remember this in further steps also.
Now as mentioned in our yarn-site.xml file create directories for saving data
sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/namenode sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/datanode sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/namenode sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/datanode sudo chown -R hdfsuser:hadoop /usr/local/hadoop/yarn_data/hdfs/namenode sudo chown -R hdfsuser:hadoop /usr/local/hadoop/yarn_data/hdfs/datanode
Okay, so far so good. We’ve done all the necessary configurations and now let’s start our Resource Manager & Node Manager.
Finishing up
Before we start the hadoop core services, we need to clean the cluster by formatting the namenode. Whenever you make a change in the namenode or datanode configuration don’t forget to do this.
hdfs namenode -format
Now we can start all the hadoop service by running the following commands
start-dfs.sh start-yarn.sh
Note:
You can also use
start-all.sh
to start all the services
Extras
You can check if your namenode is up and running by navigating to the following url.
http://localhost:50070
To access ResourceManager, navigate to the ResourceManager web UI at
http://localhost:8088
To check if HDFS is running, you can use the Java Process Status tool.
jps
This gives a list of running process in java.
In case of a successful installation, you should see these services listed,
ResourceManager DataNode SecondaryNameNode NodeManager NameNode
Note: If you find namenode not listed in the services, then ensure that you have formatted the namenode.
If datanode is missing then it might due to the insufficient permissions of the user for datanode directory, so change directory to a location where the user has read & write access or use the chown method described earlier.
To stop all the hadoop core services, try any of the following methods
stop-dfs.sh stop-yarn.sh
OR
stop-all.sh