Updated June 7, 2023
Introduction to HDFS Commands
Big data is a word for datasets that are so huge or compound that conventional data processing application software cannot pact with them. Hadoop is an open-source, Java-based programming framework that chains enormously bulky data sets’ processing and storage space in a disseminated computing environment. Apache software foundation is the key to installing Hadoop. In this topic, we will learn about the different HDFS commands.
Features of HDFS
- HDFS runs on Master/slave architecture
- HDFS uses files to store user-related data
- and holds a huge set of directories and files in a hierarchical format.
- A file is ripped into smaller blocks inside and stored in a set of Datanodes.
- Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.
Namenode
- Here the file system is maintained by name node.
- Namenode is also responsible for logging all the file system changes moreover maintains an image of the complete file system namespace and file Blockmap in memory.
- Checkpointing is done periodically. Hence easy recovery to the stage before the crash point can be achieved here.
Datanode
- A Datanode provisions data in files in its local file system
- To intimate on its existence, the data node sends the heartbeat to the namenode
- A block report will be generated for every 10th heartbeat received
- Replication is implied in the data stored in these data nodes.
Data Replication
- Here the sequence of blocks form a file with a default block size of 128 MB
- All blocks in the file, apart from the final, are of a similar size.
- The namenode element receives a heartbeat from every data node in the cluster.
- BlockReport contains all the blocks on a Datanode.
- It holds a huge set of directories and files stored in a hierarchical format.
- A file is ripped into smaller blocks inside, and these blocks are stored in a set of Datanodes.
- Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.
Job tracker: JobTracker debates with the NameNode to conclude the position of the data. Also, locate the most acceptable TaskTracker nodes to carry out tasks based on the data locality.
Task tracker: A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce, and Shuffle operations – from a JobTracker.
Secondary Name node (or) checkpoint node: Gets the EditLog from the name node regularly and applies it to its FS image. And copies back a completed FS image to the name node during restart. The Secondary Name node’s whole purpose is to have a checkpoint in HDFS.
YARN
- YARN has a central resource manager component that manages resources and assigns the resources to every application.
- Here the Resource Manager is the master who adjudicates the resources associated with the cluster; the resource manager is coiled of two components: the application manager and a scheduler. These two components manage the jobs on the cluster systems. Another Node Manager (NM) component contains the users’ jobs and workflow on a given node.
- The Standby NameNode holds an exact replication of the data inactive namenode. It acts as a slave and maintains enough state to supply a fast failover, if essential.
Basic HDFS Commands
Given Below are the basic commands:
Basic HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
1 | Print Hadoop version | $ Hadoop version |
2 | List the contents of the root directory in HDFS | $ Hadoop fs -ls |
3 | Report the amount of space used and available on a currently mounted filesystem | $ Hadoop fs -df hdfs:/ |
4 | The HDFS balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes. | $ Hadoop balancer |
5 | Help command | $ Hadoop fs -help |
Intermediate HDFS Commands
Given Below are the intermediate commands:
Intermediate HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
6 | creates a directory at the specified HDFS location | $ Hadoop fs -mkdir /user/Cloudera/ |
7 | Copies of data from one location to another | $ Hadoop fs -put data/sample.txt /user/training/Hadoop |
8 | See the space occupied by a particular directory in HDFS | $ Hadoop fs -du -s -h /user/Cloudera/ |
9 | Remove a directory in Hadoop | $ Hadoop fs -rm -r /user/cloudera/pigjobs/ |
10 | Removes all the files in the given directory | $ hadoop fs -rm -skipTrash hadoop/retail/* |
11 | To empty the trash | $ hadoop fs -expunge |
12 | copies data from and to local to HDFS | $ hadoop fs -copyFromLocal /home/cloudera/sample/ /user/cloudera/flume/
$ hadoop fs -copyToLocal /user/cloudera/pigjobs/* /home/cloudera/oozie/ |
Advanced HDFS Commands
Given Below are the advanced commands:
Intermediate HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
13 | change file permissions | $ sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/flume/ |
14 | set data replication factor for a file | $ hadoop fs -setrep -w 5 /user/cloudera/pigjobs/ |
15 | Count the number of directories, files, and bytes under hdfs | $ Hadoop fs -count hdfs:/ |
16 | make namenode exit safe mode | $ sudo -u hdfs hdfs dfsadmin -safemode leave |
17 | Hadoop format a namenode | $hadoop namenode -format |
Tips and Tricks to Use HDFS Commands
1) We can recover faster when the cluster node count is higher.
2) The increase in storage per unit time increases the recovery time.
3) Namenode hardware has to be very reliable.
4) Sophisticated monitoring can be achieved through ambari.
5) System starvation can be decreased by increasing the reducer count.
Recommended Articles
This has been a guide to HDFS Commands. We discussed HDFS commands, features, basic, intermediate, and advanced commands with pictorial representation, with valuable tips and tricks. You can also go through our other suggested articles to learn more –