Apache Flume Interview Questions and Answers

Q1 : What is Flume?
A : A distributed service for collection, aggregating, and moving giant amounts of log knowledge, is Flume.

Q2 : Why we are using Flume?
A : Most often Hadoop developer use this tool to get log data from social media sites. It’s developed by Cloudera for aggregating and moving very large amount of data. The primary use is gather log files from different sources and asynchronously persists in the Hadoop cluster.

Q3 : What Is Flumeng?
A : A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase. You’ll want to get started with FlumeNG, which improves on the original flume.

Q4 : What is Flume event?
A : A unit of data with set of string attributes called Flume event. The external source like web-server sends events to the source. Internally Flume has inbuilt functionality to understand the source format. For example Avro sends events from Avro sources to the Flume.
Each log file is consider as an event. Each event has header and value sectors, which has header information and appropriate value that assign to the particular header.

Q5 : What Is Apache Flume?
A : Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data source. Review this Flume use case to learn how Mozilla collects and Analyse the Logs using Flume and Hive.

Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Q6 : Will Apache Flume give support for third-party plug-ins?
A : Apache Flume has plug-in primarily based design. Basically, it will load knowledge from external sources and transfer it to external destinations most of the information analysts use it.

Q7 : What is that the reliable channel in Flume to confirm that there’s no knowledge loss?
A : Among the three channels JDBC, FILE and MEMORY, FILE Channel is that the most reliable channel.

Q8 : What are the complicated steps in Flume configuration?
A : Flume can processing streaming data, so if started once, there is no stop/end to the process. asynchronously it can flows data from source to HDFS via Agent. First of all, Agent should know individual components how they are connected to load data. So configuration is trigger to load streaming data. For example consumerKey, consumerSecret, accessToken and accessTokenSecret are key factors to download data from Twitter.

Q9 : Can you explain Consolidation in Flume?
A : The beauty of Flume is Consolidation, it collects data from different sources even it’s different flume Agents. Flume source can collect all data flow from different sources and flows through channel and sink. Finally, send this data to HDFS or target destination.
Flume consolidation

Q10 : What are interceptors?
A : Interceptors are used to filter the events between source and channel, channel and sink. These channels can filter un-necessary or targeted log files. Depends on requirements you can use n number of interceptors.

Q11 : What is sink processors?
Sink processors is a mechanism by which you can create a fail-over task and load balancing.

Q12 : Can Flume can distributes data to multiple destinations?
A : Yes, it supports multiplexing flow. The event flows from one source to multiple channels and multiple destinations. It’s achieved by defining a flow multiplexer.
In the above example, data flows and replicated to HDFS and another sink to a destination and another destination is input to another agent.

Q13 : Can Flume provide 100% reliability to the data flow?
A : Yes, it provides end-to-end reliability of the flow. By default, Flume uses a transactional approach in the data flow. Sources and sinks encapsulated in a transactional repository provide by the channels. This channels are responsible to pass reliably from end to end in the flow. So it provides 100% reliability to the data flow.

Q14 : What are the important steps in the configuration?
A : The configuration file is the heart of the Apache Flume’s agent.
Every Source must have at least one channel.
Every Sink must have only one channel.
Every Component must have a specific type.

Q15 : Agent communicate with other Agents?
A : No, each agent runs independently. Flume can easily scale horizontally. As a result there is no single point of failure.

Q16 : What are Channel selectors?
A : Channel selectors control and separating the events and allocate to a particular channel. There are default/ replicated channel selectors. Replicated channel selectors can replicated the data in multiple/all channels.
Multiplexing channel selectors used to separate and aggregate the data based on the event’s header information. It means based on Sink’s destination, the event aggregate into the particular sink.
Leg example: One sink connected with Hadoop, another with S3 another with Hbase, at that time, Multiplexing channel selectors can separate the events and flow to the particular sink.

Q17 : Apache Flume support third-party plugins also?
A : Yes, Flume has 100% plugin-based architecture. It can load and ships data from external sources to external destinations which separately from Flume. So that most of the big data analysts use this tool for streaming data.

Q18 : What are Flume Core components?
A : Source, Channels and Sink are core components in Apache Flume.
When Flume source receives event from external sources, it stores the event in one or multiple channels.
Flume channel is temporarily store & keeps the event until it’s consumed by the Flume sink. It acts as Flume repository.
Flume Sink removes the event from channel and put into an external repository like HDFS or Move to the next Flume agent.