Predictive Analytics with Spark in Azure HDInsight Interview Questions and Answers

Q1 : What are the various programming languages supported by Spark?

A : Though Spark is written in Scala, it lets the users code in various languages such as:

  • Scala
  • Java
  • Python
  • R (Using SparkR)
  • SQL (Using SparkSQL)

Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.

Q2 : What are the various storages from which Spark can read data?

A : Spark has been designed to process data from various sources. So, whether you want to process data stored in HDFS, Cassandra, EC2, Hive, HBase, and Alluxio (previously Tachyon). Also, it can read data from any system that supports any Hadoop data source.

Q3 : What are the various libraries available on top of Apache Spark?

A : Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX and Spark Streaming. You can combine these libraries seamlessly in the same application.

  • MLlib: It is machine learning library provided by Spark. It basically has all the algorithms that internal are wired to use Spark Core (RDD Operations) and the data structures required. For example, it provides ways to translate the Matrix into RDD and recommendation algorithms into sequences of transformations and actions. MLLib provides the machine learning algorithms that can run parallelly on many computers.
  • GraphX: GraphX provides libraries which help in manipulating huge graph data structures. It converts graphs into RDD internally. Various algorithms such PageRank on graphs are internally converted into operations on RDD.
  • Spark Streaming: It is a very simple library that listens on unbounded data sets or the datasets where data is continuously flowing. The processing pauses and waits for data to come if the source isn’t providing data. This library converts the incoming data streaming into RDDs for the “n” seconds collected data aka batch of data and then run the provided operations on the RDDs.

Q4 : What is sparkContext?

A : SparkContext is the entry point to Spark. Using sparkContext you create RDDs which provided various ways of churning data.

Q5 : What do we mean by Paraquet?

A : Apache Paraquet is a columnar format for storage of data available in Hadoop ecosystem. It is space efficient storage format which can be used in any programming language and framework.

Apache Spark supports reading and writing data in Paraquet format.

Q6 :What does map transformation do? Provide an example.

Map transformation on an RDD produces another RDD by translating each element. It translates each element by executing the function provided by the user.