Updated May 4, 2023
Hadoop Cluster Interview Questions and Answers
This article aims to help all the Big Data aspirants answer all the Hadoop Cluster Interview questions related to setup Big Data Environment in an organization. This questionnaire will help set up Data Nodes, Name nodes and define the capacity of Big Data daemons’ hosted server.
Suppose you have finally found your dream job in Hadoop Cluster but are wondering how to crack the 2023 Hadoop Cluster interview and what could be the probable Hadoop Cluster Interview Questions. In that case, every discussion and the job scope are different too. Keeping this in mind, we have designed the most common Hadoop Cluster Interview Questions and Answers to help you get success in your interview.
Some of the most essential Hadoop Cluster Interview Questions that are frequently asked in an interview are as follows:
Top 10 Hadoop Cluster Interview Questions and Answers
The top 10 Hadoop Cluster interview questions and answers are listed below.
1. What are the major Hadoop components in the Hadoop cluster?
Answer:
Hadoop is a framework where we process big data or a platform where one can process a vast amount of data on commodity servers. Hadoop is a combination of many components. The following are the major components of a Hadoop environment.
Name Node: The Controller Node handles all the Data Nodes information and data storage location in the metadata format.
Secondary Name Node: It works as the primary Name Node if the Primary Name Node goes down.
HDFS (Hadoop Distributed File System) handles all Hadoop cluster storage.
Data Nodes: Data Nodes are agent nodes. Actual data gets saved on Slave Nodes for processing.
YARN (Yet Another Resource Negotiator): An Software framework for writing applications and processing vast amounts of data. It provides the same features as MapReduce; additionally, it would allow each batch job to run parallelly in the Hadoop cluster.
2. How to plan data storage in Hadoop Cluster?
Answer:
Storage is based on the formula {Storage = Daily data ingestion*Replication}.
If the Hadoop cluster is getting data 120 TB daily and we have a default replication factor so the daily data storage requirement would be
Storage requirement = 120 TB (daily data ingestion) *3 (default replication) => 360 TB
As a result, we need to set up at least a 360 TB data cluster for daily data ingestion requirements.
Storage also depends upon the data retention requirement. If we want data to be stored for 2 years in the same cluster, we need to arrange data nodes as per the retention requirement.
3. Calculate Numbers of Data Node
Answer:
We need to calculate the number of data nodes required for the Hadoop cluster. Suppose we have servers with JBOD of 10 disks; each disk has 4 TB storage size, so each server has 40 TB storage. The Hadoop cluster gets data 120 TB per day and 360 TB after applying the default replication factor.
No of Data Nodes = Daily data ingestion/data node capacity
No of Data Nodes = 360/40 => 9 Data Nodes
Hence for the Hadoop cluster to get 120 TB data with the above configuration, set up 9 data Nodes only.
4. How to change the replication factor in the Hadoop cluster?
Answer:
Edit hdfs-site.xml file. The default path is under conf/ folder of the Hadoop installation directory. change/add the following property in hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
It’s not mandatory to have replication factor 3. It can be set as 1 also. Replication factor 5 also works in the Hadoop cluster. Setting up a default value makes the cluster more efficient, and minimum hardware is required.
Increasing the replication factor would increase Hardware requirements cause the data storage gets multiplied by the replication factor.
5. What is the default data block size in Hadoop, and how to modify it?
Answer:
Block size cut down/divide the data into blocks and save it onto different-different data nodes.
By default, the Block size is 128 MB (in Apache Hadoop), and we can modify the default block size.
Edit hdfs-site.xml file. The default path is under conf/ folder of the Hadoop installation directory. change/add the following property in hdfs-site.xml:
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>
block size in bytes is 134,217,728 or 128MB. Also, Specify the size with suffixes (case-insensitive) such as k (kilo-), m (mega-), g (giga-), or t (tera-) to set the block size in KB, MB, TB, etc…
6. How long should the Hadoop cluster keep a deleted HDFS file in the delete/trash directory?
Answer:
The “fs.trash.interval” is the parameter that specifies how long HDFS can keep any deleted file in the Hadoop environment to retrieve the deleted file.
The interval period can be defined in minutes only. For 2 days retrieval interval, we need to specify the property in a flowing format.
Edit the file core-site.xml and add/modify it using the following property
<property>
<name>fs.trash.interval</name>
<value>2880</value>
</property>
By default, the retrieval interval is 0, but Hadoop Administrator can add/modify the above property as per requirement.
7. What are the basic commands to Start and Stop Hadoop daemons?
Answer:
All the commands to start and stop the daemons are stored in sbin/ folder.
./sbin/stop-all.sh – To stop all the daemons at once.
Hadoop-daemon.sh start name node
Hadoop-daemon.sh start data node
yarn-daemon.sh, start resource manager
yarn-daemon.sh, start node manager
mr-jobhistory-daemon.sh start history server
8. What is the property to define memory allocation for tasks managed by YARN?
Answer:
Property “yarn.nodemanager.resource.memory-mb” needs to be modified/added to change the memory allocation for all the tasks that are managed by YARN.
It specifies the amount of RAM in MB. Data Nodes take 70% of actual RAM to be used for YARN. Data node with 96 GB will use 68 GB for YARN, the rest of the RAM is used by the Data Node daemon for “Non-YARN-Work”
Edit the file “yarn.xml file” and add/modify the following property.
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>68608</value>
</property>
yarn.nodemanager.resource.memory-mb default value is 8,192MB (8GB). If Data Nodes have large RAM capacity, we must change the value up to up to 70% else, we’ll be wasting our memory.
9. What are the recommendations for Sizing the Name Node?
Answer:
The following details are recommended for setting up Master Node at a very initial stage.
Processors: For processes, a single CPU with 6-8 cores is enough.
RAM Memory: For data and job processing server should have at least 24-96GB RAM.
Storage: Since no HDFS data is stored on the Master node. You can 1-2TB as local storage.
Since it’s difficult to decide future workloads, design your cluster by selecting hardware such as CPU, RAM, and memory that is easily upgradeable over time.
10. What are the default ports in the Hadoop cluster?
Answer:
Daemon Name | Default Port No |
Name Node. | 50070 |
Data Nodes. | 50075 |
Secondary Name Node. | 50090 |
Backup/Checkpoint node. | 50105 |
Job Tracker. | 50030 |
Task trackers. | 50060 |
Recommended Articles
We hope that this EDUCBA information on “Hadoop Cluster Interview Questions” was beneficial to you. You can view EDUCBA’s recommended articles for more information.