Updated June 7, 2023

Introduction to HDFS Commands

Big data is a word for datasets that are so huge or compound that conventional data processing application software cannot pact with them. Hadoop is an open-source, Java-based programming framework that chains enormously bulky data sets’ processing and storage space in a disseminated computing environment. Apache software foundation is the key to installing Hadoop. In this topic, we will learn about the different HDFS commands.

Features of HDFS

HDFS runs on Master/slave architecture
HDFS uses files to store user-related data
and holds a huge set of directories and files in a hierarchical format.
A file is ripped into smaller blocks inside and stored in a set of Datanodes.
Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.

Namenode

Here the file system is maintained by name node.
Namenode is also responsible for logging all the file system changes moreover maintains an image of the complete file system namespace and file Blockmap in memory.
Checkpointing is done periodically. Hence easy recovery to the stage before the crash point can be achieved here.

Datanode

A Datanode provisions data in files in its local file system
To intimate on its existence, the data node sends the heartbeat to the namenode
A block report will be generated for every 10th heartbeat received
Replication is implied in the data stored in these data nodes.

Data Replication

Here the sequence of blocks form a file with a default block size of 128 MB
All blocks in the file, apart from the final, are of a similar size.
The namenode element receives a heartbeat from every data node in the cluster.
BlockReport contains all the blocks on a Datanode.
It holds a huge set of directories and files stored in a hierarchical format.
A file is ripped into smaller blocks inside, and these blocks are stored in a set of Datanodes.
Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.

Job tracker: JobTracker debates with the NameNode to conclude the position of the data. Also, locate the most acceptable TaskTracker nodes to carry out tasks based on the data locality.

Task tracker: A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce, and Shuffle operations – from a JobTracker.

Secondary Name node (or) checkpoint node: Gets the EditLog from the name node regularly and applies it to its FS image. And copies back a completed FS image to the name node during restart. The Secondary Name node’s whole purpose is to have a checkpoint in HDFS.

YARN

YARN has a central resource manager component that manages resources and assigns the resources to every application.
Here the Resource Manager is the master who adjudicates the resources associated with the cluster; the resource manager is coiled of two components: the application manager and a scheduler. These two components manage the jobs on the cluster systems. Another Node Manager (NM) component contains the users’ jobs and workflow on a given node.
The Standby NameNode holds an exact replication of the data inactive namenode. It acts as a slave and maintains enough state to supply a fast failover, if essential.

Basic HDFS Commands

Given Below are the basic commands:

Basic HDFS Commands
Sr.No	HDFS Command Property	HDFS Command
1	Print Hadoop version	$ Hadoop version
2	List the contents of the root directory in HDFS	$ Hadoop fs -ls
3	Report the amount of space used and available on a currently mounted filesystem	$ Hadoop fs -df hdfs:/
4	The HDFS balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes.	$ Hadoop balancer
5	Help command	$ Hadoop fs -help

Intermediate HDFS Commands

Given Below are the intermediate commands:

Intermediate HDFS Commands
Sr.No	HDFS Command Property	HDFS Command
6	creates a directory at the specified HDFS location	$ Hadoop fs -mkdir /user/Cloudera/
7	Copies of data from one location to another	$ Hadoop fs -put data/sample.txt /user/training/Hadoop
8	See the space occupied by a particular directory in HDFS	$ Hadoop fs -du -s -h /user/Cloudera/
9	Remove a directory in Hadoop	$ Hadoop fs -rm -r /user/cloudera/pigjobs/
10	Removes all the files in the given directory	$ hadoop fs -rm -skipTrash hadoop/retail/*
11	To empty the trash	$ hadoop fs -expunge
12	copies data from and to local to HDFS	$ hadoop fs -copyFromLocal /home/cloudera/sample/ /user/cloudera/flume/ $ hadoop fs -copyToLocal /user/cloudera/pigjobs/* /home/cloudera/oozie/

Advanced HDFS Commands

Given Below are the advanced commands:

Intermediate HDFS Commands
Sr.No	HDFS Command Property	HDFS Command
13	change file permissions	$ sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/flume/
14	set data replication factor for a file	$ hadoop fs -setrep -w 5 /user/cloudera/pigjobs/
15	Count the number of directories, files, and bytes under hdfs	$ Hadoop fs -count hdfs:/
16	make namenode exit safe mode	$ sudo -u hdfs hdfs dfsadmin -safemode leave
17	Hadoop format a namenode	$hadoop namenode -format

Tips and Tricks to Use HDFS Commands

1) We can recover faster when the cluster node count is higher.

2) The increase in storage per unit time increases the recovery time.

3) Namenode hardware has to be very reliable.

4) Sophisticated monitoring can be achieved through ambari.

5) System starvation can be decreased by increasing the reducer count.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to HDFS Commands

Features of HDFS

Namenode

Datanode

Data Replication

YARN

Basic HDFS Commands

Intermediate HDFS Commands

Advanced HDFS Commands

Tips and Tricks to Use HDFS Commands

Recommended Articles

Follow us!

APPS

Blog

Courses

Email