Hadoop Interview Questions and Answers Updated 2020

Q1 : Whatis Hadoop and its components.

A : When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us with various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)

Q2 : What happens when two clients try to access the same file in the HDFS?

A : HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

Q3 : What are side data distribution techniques in Hadoop?
A : The extra read-only data required by a Hadoop job to process the main dataset is referred to as side data. Hadoop has two side data distribution techniques –
i) Using the job configuration – This technique should not be used for transferring more than few kilobytes of data as it can pressurize the memory usage of Hadoop daemons, particularly if your system is running several Hadoop jobs.
ii) Distributed Cache – Rather than serializing side data using the job configuration, it is suggested to distribute data using Hadoop’s distributed cache mechanism.

Q4 : How does NameNode tackle DataNode failures?

A : NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of a dead node to another DataNode using the replicas created earlier.

Q5 : How is HDFS fault tolerant?

A : When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

Q6 : What are active and passive “NameNodes”?

A : In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.

Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.

When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

Q7 : What does ‘jps’ command do?

A : The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e name node, data node, resource manager, node manager etc. that are running on the machine.

Q8 : What are the different operational commands in HBase at record level and table level?
A : Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

Q9 : What is “speculative execution” in Hadoop?

A : If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

Q10 : Name the three modes in which Hadoop can run.

A : The three modes in which Hadoop can run are as follows:

1. Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such as NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses the local filesystem.
2. Pseudo-distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.

Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as a fully distributed mode.

Q11 : What is the purpose of “RecordReader” in Hadoop?

A : The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.

Q12 : What is a “Combiner”?

A : A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

Q13 : Why do we need Hadoop?
A : The picture of Hadoop came into existence to deal with Big Data challenges. The challenges with Big Data are-

Storage – Since data is very large, so storing such a huge amount of data is very difficult.
Security – Since the data is huge in size, keeping it secure is another challenge.
Analytics – In Big Data, most of the time we are unaware of the kind of data we are dealing with. So analyzing that data is even more difficult.
Data Quality – In the case of Big Data, data is very messy, inconsistent and incomplete.
Discovery – Using a powerful algorithm to find patterns and insights are very difficult.

Hadoop is an open-source software framework that supports the storage and processing of large data sets. Apache Hadoop is the best solution for storing and processing Big data because:

Apache Hadoop stores huge files as they are (raw) without specifying any schema.
High scalability – We can add any number of nodes, hence enhancing performance dramatically.
Reliable – It stores data reliably on the cluster despite machine failure.
High availability – In Hadoop data is highly available despite hardware failure. If a machine or hardware crashes, then we can access data from another path.
Economic – Hadoop runs on a cluster of commodity hardware which is not very expensive

Q14 : What is Row Key?
A : Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

Q15 : What are the different operational commands in HBase at record level and table level?
A : Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

Q16 : Explain the process of row deletion in HBase.
A : On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

Q17 : Explain the differences between Hadoop 1.x and Hadoop 2.x
A :

In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
Hadoop 1.x has a single point of failure problem and whenever the NameNode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

Q18 : What are the core components of Hadoop?
A : Hadoop is an open-source software framework for distributed storage and processing of large datasets. Apache Hadoop core components are HDFS, MapReduce, and YARN.

HDFS- Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. HDFS store very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. HDFS stores data reliably even in the case of hardware failure. It provides high throughput access to an application by accessing in parallel.
MapReduce- Map Reduce is the data processing layer of Hadoop. It writes an application that processes large structured and unstructured data stored in HDFS. MapReduce processes a huge amount of data in parallel. It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing, where we specify all the complex logic code. Reduce is the second phase of processing. Here we specify light-weight processing like aggregation/summation.
YARN- YARN is the processing framework in Hadoop. It provides Resource management and allows multiple data processing engines. For example real-time streaming, data science, and batch processing.

Q19 : What is Apache Hadoop?
A : Hadoop emerged as a solution to the Big Data problems. It is a part of the Apache project sponsored by the Apache Software Foundation (ASF). It is an open source software framework for distributed storage and distributed processing of large data sets. Open source means it is freely available and even we can change its source code as per our requirements. Apache Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure. Apache Hadoop provides:

Storage layer – HDFS
Batch processing engine – MapReduce
Resource Management Layer – YARN

Q20 : What are the modes in which Hadoop run?
A : Apache Hadoop runs in three modes:

Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.

Q21 : What are the features of Pseudo mode?
A : Just like the Standalone mode, Hadoop can also run on a single node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.

Q22 : What are the features of Standalone (local) mode?
A : By default, Hadoop runs in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. One can also use it for debugging purpose. It does not support the use of HDFS. Standalone mode is suitable only for running programs during development for testing. Further, in this mode, there is no custom configuration required for configuration files. Configuration files are:

core-site.xml
hdfs-site.xml files.
mapred-site.xml
yarn-default.xml

Q23 : What is Safemode in Hadoop?
A : Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks. At the startup of NameNode:

It loads the file system namespace from the last saved FsImage into its main memory and the edits log file.
Merges edits log file on FsImage and results in new file system namespace.
Then it receives block reports containing information about block location from all datanodes.

In SafeMode NameNode perform a collection of block reports from datanodes. NameNode enters safemode automatically during its start up. NameNode leaves Safemode after the DataNodes have reported that most blocks are available. Use the command:
hadoop dfsadmin –safemode get: To know the status of Safemode
bin/hadoop dfsadmin –safemode enter: To enter Safemode
hadoop dfsadmin -safemode leave: To come out of Safemode
NameNode front page shows whether safemode is on or off.

Q24 : How is security achieved in Hadoop?
A : Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
Service Request – The client uses the service ticket to authenticate itself to the server.

Q25 : What is throughput in Hadoop?
A : The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:

The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.