Updated March 1, 2023
Introduction To Hadoop Admin Interview Questions And Answers
So you have finally found your dream job in Hadoop Admin but are wondering how to crack the 2023 Hadoop Admin Interview and what could be the probable Hadoop Admin Interview Questions. Every interview is different, and the scope of a job is different too. Keeping this in mind, we have designed the most common Hadoop Admin Interview Questions and Answers to help you get success in your interview.
Following are the Hadoop Admin Interview Questions that will help you in cracking an interview with Hadoop.
Hadoop Admin Interview Questions & Answers
Below are some useful Hadoop Admin Interview Question and Answers
1. What is Rack awareness? And why is it necessary?
Answer:
Rack awareness is about distributing data nodes across multiple racks.HDFS follows the rack awareness algorithm to place the data blocks. A rack holds multiple servers. And for a cluster, there could be multiple racks. Let’s say there is a Hadoop cluster set up with 12 nodes. There could be 3 racks with 4 servers on each. All 3 racks are connected so that all 12 nodes are connected and that form a cluster. While deciding on the track count, the important point to consider is the replication factor. Suppose there is 100GB of data that will flow every day with the replication factor 3. Then it’s 300GB of data that will have to reside on the cluster. It’s a better option to have the data replicated across the racks. Even if any node goes down, the replica will be in another rack.
2. What is the default block size, and how is it defined?
Answer:
128MB, and it is defined in hdfs-site.xml, and also this is customizable depending on the volume of the data and the level of access. Say, 100GB of data flowing in a day, and the data gets segregated and stored across the cluster. What will be the number of files? 800 files. (1024*100/128) [1024 à converted a GB to MB.] There are two ways to set the customize data block size.
- hadoop fs -D fs.local.block.size=134217728 (in bits)
- In hdfs-site.xml add this property à block.size with the size of the bits.
If you change the default size to 512MB as the data size is huge, then the no.of files generated will be 200. (1024*100/512)
3. How do you get the report of the hdfs file system? About disk availability and no.of active nodes?
Answer:
Command: sudo -u hdfs dfsadmin –report
These are the list of information it displays,
- Configured Capacity – Total capacity available in hdfs
- Present Capacity – This is the total amount of space allocated for the resources to reside beside the metastore and fsimage usage of space.
- DFS Remaining – It is the amount of storage space still available to the HDFS to store more files
- DFS Used – It is the storage space that HDFS has used up.
- DFS Used% – In percentage
- Under replicated blocks – No. of blocks
- Blocks with corrupt replicas – If any corrupted blocks
- Missing blocks
- Missing blocks (with replication factor 1)
4. What is Hadoop balancer, and why is it necessary?
Answer:
The data spread across the nodes are not distributed in the right proportion, meaning each node’s utilisation might not be balanced. One node might be over-utilized, and the other could be under-utilized. This leads to having high costing effect while running any process, and it would end up running on heavy usage of those nodes. To solve this, Hadoop balancer is used to balance the utilization of the data in the nodes. So whenever a balancer is executed, the data gets moved across where the under-utilized nodes get filled up, and the over-utilized nodes will be freed up.
5. Difference between Cloudera and Ambari?
Answer:
Cloudera Manager | Ambari |
Administration tool for Cloudera | Administration tool for Horton works |
Monitors and manages the entire cluster and reports the usage and any issues | Monitors and manages the entire cluster and reports the usage and any issues |
Comes with Cloudera paid service | Open-source |
6. What are the main actions performed by the Hadoop admin?
Answer:
Monitor health of cluster -Many application pages have to be monitored if any processes run. (Job history server, YARN resource manager, Cloudera manager/ambary depending on the distribution)
turn on security – SSL or Kerberos
Tune performance – Hadoop balancer
Add new data nodes as needed – Infrastructure changes and configurations.
Optional to turn on MapReduce Job History Tracking Server à Sometimes restarting the services would help release up cache memory. This is when the cluster with an empty process.
7. What is Kerberos?
Answer:
It’s an authentication required for each service to sync up to run the process. It is recommended to enable Kerberos. Since we are dealing with distributed computing, it is always good practice to have encryption while accessing the data and processing it. As each node are connected, and any information passage is across a network. As Hadoop uses Kerberos, passwords not sent across the networks. Instead, passwords are used to compute the encryption keys. The messages are exchanged between the client and the server. In simple terms, Kerberos provides identity to each other (nodes) in a secure manner with the encryption.
Configuration in core-site.xml
Hadoop.security.authentication: Kerberos
8. What is the important list of hdfs commands?
Answer:
Commands | Purpose |
hdfs dfs –ls <hdfs path> | To list the files from the hdfs filesystem. |
Hdfs dfs –put <local file> <hdfs folder> | Copy file from the local system to the hdfs filesystem |
Hdfs dfs –chmod 777 <hdfs file> | Give a read, write, execute permission to the file. |
Hdfs dfs –get <hdfs folder/file> <local filesystem> | Copy the file from hdfs filesystem to the local filesystem |
Hdfs dfs –cat <hdfs file> | View the file content from the hdfs filesystem |
Hdfs dfs –rm <hdfs file> | Remove the file from the hdfs filesystem. But it will be moved to trash file path (it’s like a recycle bin in windows) |
Hdfs dfs –rm –skipTrash <hdfs filesystem> | Removes the file permanently from the cluster. |
Hdfs dfs –touchz <hdfs file> | Create a file in the hdfs filesystem |
9. How to check the logs of a Hadoop job submitted in the cluster and how to terminate already running process?
Answer:
yarn logs –applicationId <application_id> — The application master generates logs on its container, and it will be appended with the id it generates. This will be helpful to monitor the process running status and the log information.
Yarn application –kill <application_id> — If an existing process that was running in the cluster needs to be terminated, kill command is used where the application id is used to terminate the job in the cluster.
Recommended Articles
This has been a guide to List Of Hadoop Admin Interview Questions and Answers. Here we have listed the most useful 9 interview sets of questions so that the jobseeker can crack the interview with ease. You may also look at the following articles to learn more.