BLOG
BLOG
  • Design
  • Data
  • Discernment

We believe in AI and every day we innovate to make it better than yesterday. We believe in helping others to benefit from the wonders of AI and also in extending a hand to guide them to step their journey to adapt with future.

Know more

Our solutions in action for customers

DOWNLOAD

Featured Post

MENU

  • Visit Accubits Website.
  • Artificial Intelligence
  • Blockchain
  • Cloud Computing
  • Entertainment
  • Fintech

How to set up a custom Hadoop single node cluster in the pseudo-distributed mode

  • by Adarsh M S on Thu Aug 13

In this article, we’ll explore how to set up a custom Hadoop single node cluster in the pseudo-distributed mode.

Apache Hadoop provides a software framework for distributed storage and processing of big data using the MapReduce programming model. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently. Hadoop can be installed in 3 different modes: Standalone mode, Pseudo-Distributed mode, and Fully-Distributed mode.

Standalone mode is the default mode in which Hadoop runs. Standalone mode is mainly used for debugging where you don’t really use HDFS.

Pseudo-distributed mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.

Fully-distributed mode is the production mode of Hadoop where multiple nodes will be running.

This draft helps you create your own custom Hadoop pseudo mode cluster. The environment used in this setup is ubuntu 18.04 and the Hadoop version is 3.1.2.

Prerequisites

Create A New User [optional]

Note: Follow this step if you want to install hadoop for a new user else skip to next part.

Open a new terminal (Ctrl+Alt+T) & type the following commands.

Firstly, create a group hadoop

sudo addgroup hadoop

And add a new user hdfsuser within the same hadoop group.

sudo adduser --ingroup hadoop hdfsuser

Note:

 If you want to add an existing user to the group then use the following command
usermod -a -G hadoop username

Now give the hdfsuser necessary root permissions for file installations. Root user privileges can be provided by updating the sudoers file. Open the sudoers file by running the following command in the terminal

sudo visudo

Add or edit the following line

hdfsuser ALL=(ALL:ALL) ALL

Now, save the changes (Ctrl+O & press enter) and close the editor (Ctrl+X).

So, now lets switch to our newly created user for further installations.

su - hdfsuser

Java installation

Hadoop is built using Java and Java is required to run MapReduce code. The Java version should be Java 8 or above i.e Java 1.8+ for the latest hadoop installs. If you already have java running in your system then check whether you have the required version by running the following command in the terminal

java -version

If you have the required version, skip to the next step.

Note: If you plan on installing hive as well, then prefer java 8, as newer versions no longer have URLClassLoader.

You can either install java from your OS package manager or from the official oracle website (https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html).

Installation using aptitude (java 8)

sudo apt-get update
sudo apt install openjdk-8-jre-headless openjdk-8-jdk

To verify your installation, run the following command in terminal

java -version

Setting up SSH keys

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires a password-less SSH connection between the master and all the slaves and the secondary machines.

We need a password-less SSH because when the cluster is LIVE and running, the communication is too frequent. The job tracker should be able to send a task to the task tracker quickly.

Note: Do not skip this step unless you already have a password-less SSH setup. This step is essential for starting the hadoop services like Resource Manager & Node Manager.

Installing required packages

Run the following commands in your terminal

sudo apt-get install ssh
sudo apt-get install sshd

Generating Keys …

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys

Now, we’ve successfully generated ssh keys and copied the key value to authorized_keys.

Verify the secure connection by running the following command in your terminal.

ssh localhost

Note: If it doesn’t ask for a password and logs you in, then the configuration was successful, else remove the generated key and follow the steps again.

Don’t forget to exit from the localhost (Type exit in the terminal and press enter)

Now our prerequisites for hadoop installation has been completed successfully.

Hadoop 3.x Installation

Now. let’s begin the installation process of hadoop by downloading the latest stable release.

To download the release of your choice, use the following commands. (Change the directory and download link according to your preference).

cd /usr/local
sudo wget http://archive.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

Extract the hadoop file in same location.

sudo tar xvzf hadoop-3.1.2.tar.gz

Rename the extracted folder

sudo mv hadoop-3.1.2 hadoop

Setup Hadoop in Pseudo-Distributed Mode

Now let’s provide ownership of hadoop to our hdfsuser [skip if you do not want to change ownership]

sudo chown -R hdfsuser:hadoop /usr/local/hadoop

Change the mode of hadoop folder to read, write & execute modes of working.

sudo chmod -R 777 /usr/local/hadoop

Disable IPv6

Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.

Take a look at HADOOP-3437 and HADOOP-6056 to understand why it is necessary to disable IPv6 to get Hadoop working.

You can check the status of your IPv6 configuration by running the following command in the terminal

cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If the result is not 1 then follow the below steps to disable IPv6

sudo nano /etc/sysctl.conf

Now add the following lines to the end of the file

# Disable ipv6
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

Save the file and exit

If ipv6 is still not disabled, then the problem will be /etc/sysctl.conf not being activated. To solve this, activate the conf by running

sudo sysctl -p

Adding Hadoop Environment Variables

Adding hadoop path to the environment is necessary, else you would have to move to the hadoop directory to run commands.

Open the bashrc file by running

sudo nano ~/.bashrc

Add the following lines to the end of the bashrc file

# HADOOP ENVIRONMENT
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export YARN_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin
export PATH=$PATH:/usr/local/hadoop/sbin

# HADOOP NATIVE PATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_PREFIX/lib

Now, load the hadoop environment variables by running the following command

source ~/.bashrc

Configuring Hadoop …

Change the working directory to hadoop configurations location

cd /usr/local/hadoop/etc/hadoop/
  • hadoop-env.sh
    Open the hadoop-env file by running the following command
sudo nano hadoop-env.sh

Add the following configurations to the end of the file (Change the java path and user name according to your setup)

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export JAVA_HOME=/usr
export HADOOP_HOME_WARN_SUPPRESS="TRUE"
export HADOOP_ROOT_LOGGER="WARN,DRFA"
export HDFS_NAMENODE_USER="hdfsuser"
export HDFS_DATANODE_USER="hdfsuser"
export HDFS_SECONDARYNAMENODE_USER="hdfsuser"
export YARN_RESOURCEMANAGER_USER="hdfsuser"
export YARN_NODEMANAGER_USER="hdfsuser"
  • yarn-site.xml
    Open the yarn-site file by running the following command
sudo nano yarn-site.xml

Add the following configurations between the configuration tag (<configuration></configuration>)

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
  • hdfs-site.xml
    Open the hdfs-site file by running the following command
sudo nano hdfs-site.xml

Add the following configurations between the configuration tag (<configuration></configuration>)

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/yarn_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
</property>
  • core-site.xml
    Open the core-site file by running the following command
sudo nano core-site.xml

Add the following configurations between the configuration tag (<configuration></configuration>)

<property>
<name>hadoop.tmp.dir</name>
<value>/bigdata/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
  • mapred-site.xml
    Open the core-site file by running the following command
sudo nano mapred-site.xml

Add the following configurations between the configuration tag (<configuration></configuration>)

<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>

All done… Now let’s create some directories for hadoop to save data

Creating directories

Let’s create a temp directory for dfs as mentioned in our core-site.xml file

sudo mkdir -p /bigdata/hadoop/tmp
sudo chown -R hdfsuser:hadoop /bigdata/hadoop/tmp
sudo chmod -R 777 /bigdata/hadoop/tmp

Note: If you don’t want to change ownership, skip the chown command. Remember this in further steps also.

Now as mentioned in our yarn-site.xml file create directories for saving data

sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/namenode
sudo mkdir -p /usr/local/hadoop/yarn_data/hdfs/datanode
sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/namenode
sudo chmod -R 777 /usr/local/hadoop/yarn_data/hdfs/datanode
sudo chown -R hdfsuser:hadoop /usr/local/hadoop/yarn_data/hdfs/namenode
sudo chown -R hdfsuser:hadoop /usr/local/hadoop/yarn_data/hdfs/datanode

Okay, so far so good. We’ve done all the necessary configurations and now let’s start our Resource Manager & Node Manager.

Finishing up

Before we start the hadoop core services, we need to clean the cluster by formatting the namenode. Whenever you make a change in the namenode or datanode configuration don’t forget to do this.

hdfs namenode -format

Now we can start all the hadoop service by running the following commands

start-dfs.sh
start-yarn.sh

Note: 

You can also use 

start-all.sh

 to start all the services

Extras

You can check if your namenode is up and running by navigating to the following url.

http://localhost:50070

To access ResourceManager, navigate to the ResourceManager web UI at

http://localhost:8088

To check if HDFS is running, you can use the Java Process Status tool.

jps

This gives a list of running process in java.

In case of a successful installation, you should see these services listed,

ResourceManager
DataNode
SecondaryNameNode
NodeManager
NameNode

Note: If you find namenode not listed in the services, then ensure that you have formatted the namenode.
If datanode is missing then it might due to the insufficient permissions of the user for datanode directory, so change directory to a location where the user has read & write access or use the chown method described earlier.

To stop all the hadoop core services, try any of the following methods

stop-dfs.sh
stop-yarn.sh

OR

stop-all.sh
I hope we all agree that our future will be highly data-driven. In today’s connected and digitally transformed the world, data collected from several sources can help an organization to foresee its future and make informed decisions to perform better. Here is an interesting article to learn more about Big Data and Big Data Ingestion

Related Articles

  • The Disruptive Impact of AI and Blockchain on BFSI
    By
    Nick
  • Can Blockchain Perfect the Carbon Credit Systems
    By
    Rahul
  • Smart Agriculture: The Next Frontier in Sustainable Farming
    By
    Rahul
  • How can Blockchain aid the Circular Economy Model?
    By
    Rahul

ASK AUTHOR

Adarsh M S

Technology enthusiast with an urge to explore into vast areas of advancing technologies. Experienced in domains like Computer Vision, ... Read more

Ask A Question
Error
Cancel
Send

Categories

View articles by categories

  • General

Subscribe now to get our latest posts

  • facebook
  • linkedin
  • twitter
  • youtube
All Rights Reserved. Accubits Technologies Inc