Q1 : How can I setup Hadoop nodes (data nodes/name nodes) to use multiple volumes/disks?
A : Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to set up multiple directories, one needs to specify a comma-separated list of pathnames as values under config parameters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place an equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the namespace image and edit logs. In order to set up multiple directories, one needs to specify a comma-separated list of pathnames as values under config parameters dfs.name.dir/dfs.namenode.data.dir. The name node directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.
Q2 : What is the Best Operating System to Run Hadoop?
A : Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop it will lead to several problems and is not recommended.
Q3 : What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?
A : Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the job tracker. The three types of schedulers are:
- FIFO (First in First Out) Scheduler
- Fair Scheduler
- Capacity Scheduler
Q4 : How often should the Namenode be Reformatted?
A : The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.
Q5 : Explain About the different Configuration Files and where are they located.?
A : The configuration files are located in “conf” subdirectory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml, and mapred-site.xml.
Q6 : What is the best practice to deploy a Secondary Namenode?
A : It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.
Q7 : What are the network requirements to Run Hadoop?
A :
- SSH is required to run – to launch server processes on the slave nodes.
- A passwordless SSH connection is required between the master, secondary machines and all the slaves.
Q8 : List Some Use Cases Of The Hadoop Ecosystem?
A : Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.
Q9 : How Many Namenodes Can You Run On A Single Hadoop Cluster?
A : Only one.
Q10 : What is Jps Command Used for?
A : jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker, and JobTracker.
Q11 : How will you restart a Namenode?
A : The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.
Q12 : Apart from using the Jps command is there any other way that you can check whether the Namenode is working or not?
A : Use the command -/etc/init.d/hadoop-0.20-namenode status.
Q13 : How do you decide which scheduler to use?
A : The CS scheduler can be used in the following situations:
- When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
- When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
- When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
- When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
- When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
- When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
- When you require jobs within a pool to make equal progress rather than running in FIFO order.
Q14 : What are the two main modules which help you interact with HDFS and what are they used for?
A : user@machine:hadoop$ bin/Hadoop moduleName-cmdargs…
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.
Q15 : What are the important Hardware considerations when deploying Hadoop in Production Environment?
A : Memory-System’s memory requirements will vary between the worker services and management services based on the application.
Operating System – a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
Capacity- Large Form Factor (3.5”) disks cost less and allow to store more when compared to Small Form Factor disks.
Network – Two TOR switches per rack provide better redundancy.
Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.
Q16 : Explain about the different Schedulers available in Hadoop.?
A : FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number ofmaps and reduces slots on a resource. Each user can use their own pool to execute the jobs.
Q17 : What is the Conf/hadoop-env.sh File and which variable in the file should be set for Hadoop to work?
A : This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME, and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.
Q18 : What is default block size in HDFS and what are the benefits of having smaller block sizes?
A : Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.
Q19 ; How Can You Kill A Hadoop Job?
A : Hadoop job –kill jobID
Q20 : What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?
A : FileSystem checking utility FSCK is used to check and display the health of file system, files, and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default, FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over-replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks, and missing replicas.
Q21 : Is it possible to copy Files across Multiple Clusters? If yes, How can you accomplish this?
A : Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying
Q22 : I want to see all the jobs running in a Hadoop Cluster. How can you do this?
A : Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.
Q23 : The Mapred.output.compress property is set to true, to make sure that all output files are compressed for Efficient Space Usage on the Hadoop Cluster. In case under a particular condition if a Cluster User does not require Compressed Data for a job. What would you suggest that he do?
A : If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.
Q24 : If Hadoop Spawns 100 Tasks For A Job And One Of The Job Fails. What Does Hadoop Do?
A : The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.
Q25 : Which Command Is Used To Verify If The Hdfs Is Corrupt Or Not?
A : Hadoop FSCK (File System Check) command is used to check missing blocks.