Big Data Hadoop Interview Questions and Answers

Q1 : What is the use of RecordReader in Hadoop?
A : Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into the single record. For instance, if our input data is split like:
Row1: Welcome toRow2: Intellipaat
It will be read as “Welcome to Intellipaat” using RecordReader.

Q2 : What Are The Basic Features Of Hadoop?
A : Inscribed in Java, Hadoop framework has the competence of resolving questions involving Big Data analysis. Its program design model is based on Google MapReduce and substructure is based on Google’s Big Data and dispersed file systems. Hadoop is ascendable and more nodes can be implemented to it.

Q3 : What companies use Hadoop, any idea?
A : Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis, Amazon, Netflix, Adobe,eBay, Spotify, Twitter, Adobe.

Q4What is the usage Of Hadoop?
A : With Hadoop, the employer can run requests on the systems that have thousands of bulges scattering through countless terabytes. Rapid data dispensation and assignment among nodes helps continuous operation even when a node fails to avert system let-down.

Q5 : What is the difference between Map Side join and Reduce Side Join?
A : Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

Q6 :  What Is Hadoop And Its Workings?
A : When “Big Data” appeared as problematic, Apache Hadoop changed as an answer to it. Apache Hadoop is a context which offers us numerous facilities or tools to store and development of Big Data. It benefits from analyzing Big Data and creation business decisions out of it, which can’t be done professionally and successfully using old-style systems.

Q7 : How can you debug Hadoop code?
A : First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

  1. Run: “ps –ef | grep –I ResourceManager”
    and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is an error message associated with that job.
  2. On the basis of RM logs, identify the worker node that was involved in the execution of the task.
  3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
  4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

Q8 : What Is A Block?
A : The minute’s amount of data that can be delivered or written is largely mentioned as a “block” in HDFS. The defaulting size of a block in HDFS is 64MB. 

Q9 :  What is Speculative Execution in Hadoop?
A : One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. There are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent tasks as backup. This backup mechanism in Hadoop is Speculative Execution.
It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.

Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false.

Q10 : What Is Block Scanner In HDFS?
A : Block Scanner is something that pathways the list of blocks contemporary on a Data Node and confirms them to find any kind of checksum blunders. Block Scanners use a regulating device to standby disk bandwidth on the data node

Q11 : What is Job Tracker role in Hadoop?
A : Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task lifecycle management (tracking the task progress and fault tolerance).

  • It is a process that runs on a separate node, not on a DataNode often.
  • Job Tracker communicates with the NameNode to identify data location.
  • Finds the best Task Tracker Nodes to execute tasks on given nodes.
  • Monitors individual Task Trackers and submits the overall job back to the client.
  • It tracks the execution of MapReduce workloads local to the slave node.

Q12 : What Do You Understand By Distributed Cache In MapReduce Framework?
A : Distributed Cache feature of MapReduce framework is very important. When you wish to share any of the files across all the nodes in a given Hadoop Cluster, Distributed Cache is used for that.  

Q13 : What are the core methods of a Reducer?
A : The three core methods of a Reducer are:

  1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
    public void setup (context)
  2. reduce(): heart of the reducer always called once per key with the associated reduced task
    public void reduce(Key, Value, context)
  3. cleanup(): this method is called to clean temporary files, only once at the end of the task
    public void cleanup (context)

Q14 :  What Is Heartbeat In HDFS?
A : Heartbeat concept is referred to the signal which is used between a data node and a Name node, and also between task tracker as well as the job tracker, in case either the Name node or job tracker does not respond well to the signal sent, then it is automatically considered that there is some issue with the data node or the task tracker.

Q15 : What is SequenceFile in Hadoop?
A : Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:

  1. Uncompressed key/value records.
  2. Record compressed key/value records – only ‘values’ are compressed here.
  3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

Q16 : How Do Namenode Challenge Datanode Letdowns?
A : NameNode occasionally obtains a signal from each of the DataNode in the bunch, which suggests DataNode is operative properly.
A block report comprises a list of all the chunks on a DataNode. If DataNode flops to send a signal message, after an exact period it is noticeable dead.
The NameNode duplicates the blocks of the dead node to additional DataNode using the imitations created earlier.

Q17 : What happens if you try to run a Hadoop job with an output directory that is already present?
A : It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.
To delete the directory before running the job, you can use shell:Hadoop fs –rmr /path/to/your/output/Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

Q18 : Explain The Difference Between HDFS And Nas.
A : In HDFS Data Blocks are dispersed across all the machinery in a cluster. Whereas in NAS data is stored on an enthusiastic hardware.

Q19 : How to compress mapper output but not the reducer output?
A : To achieve this compression, you should set:

conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)

Q20 : What Is Checkpoint Node?
A : Checkpoint Node retains track of the up-to-date checkpoint in a directory that has the same erection as that of NameNode’s directory. Checkpoint node produces checkpoints for the namespace at stable intervals by moving the edits and fs image file from the NameNode and integration it locally. The new-fangled image is then again modernized back to the active NameNode.

Q21 : How can you transfer data from Hive to HDFS?
A : By writing the query:

hive> insert overwrite directory '/' select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

Q22 :  What Is The Finest Hardware Configuration To Run Hadoop?
A : The finest formation for performing Hadoop jobs is double core machines or dual mainframes with 4GB or 8GB RAM that practice ECC memory. Hadoop extremely assistances from using ECC recollection though it is not low – end. ECC memory is suggested for running Hadoop since most of the Hadoop users have skilled various checksum faults by using non-ECC memory. Though, the hardware formation also is subject to on the workflow necessities and can change consequently.

Q23 : What Happens In Text Format?
A : In the text input format, each and every line in the text file is a valid record.  In Hadoop, environment value is the content of a line under process whereas the key is the byte offset of the same line.

Q24 : Differentiate Between An Input Split And HDFS Block?
A : The Logical division of data in Hadoop framework is known as Split whereas the physical division of data in Hadoop is known as the HDFS Block. 

Q25:  What Is The Function Of Mapreducer Partitioner?
A : The actual function of MapReduce partitioner is to ensure that all the specified values of a single key go to the same reducer, sooner or later which helps in an even distribution of the map output over the output of the reducer.