Exploratory Data Analysis in Machine Learning

Exploratory Data Analysis (EDA) is a task of analyzing our dataset using simple tools from statistics, from linear algebra and many other plotting tools to understand what our data is conveying us. The term Exploratory here means we try to explore the given data like a detective that we have never seen before. EDA is a very important step in a Machine Learning/Data Science project and for a given problem we first perform EDA and try to get insights from the given data. Exploratory Data Analysis is done before modeling i.e. before we apply actual machine learning algorithms on our dataset.

Dataset Overview:

Let us consider a small part of a well-known dataset called the IRIS dataset. The given dataset is labelled. Each row is called an observation/a data point and each column is a feature/input variable. The column ‘species’ is called as output/class label. The data below says sepal_length, sepal_width, petal_length, and petal_width of an IRIS flower and its corresponding species.

No. sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 7.0 3.2 4.7 1.4 Iris-versicolor
6 6.4 3.2 4.5 1.5 Iris-versicolor
7 6.9 3.1 4.9 1.5 Iris-versicolor
8 5.5 2.3 4.0 1.3 Iris-versicolor
9 6.3 3.3 6.0 2.5 Iris-virginica
10 5.8 2.7 5.1 1.9 Iris-virginica
11 7.1 3.0              5.9 2.1 Iris-virginica
12 6.3 2.9 5.6 1.8 Iris-virginica

(a)An example dataset

  1. Sepal length – Length of sepal part of Iris flower
  2. Sepal width – Width of sepal part of Iris flower
  3. Petal length – Length of petal of Iris flower
  4. Petal width – Width of petal of Iris flower
  5. Iris-setosa, Iris-versicolor, and Iris-virginica are three different species of Iris flower.

Let us see EDA on the Iris dataset and try to get insights into it. Below are some of the initial checks and can be easily done using python built-in methods.

  1. Balanced/Unbalanced Dataset: It can be determined based on the class label. If the class labels are equal in number then we can say that the dataset is balanced dataset otherwise unbalanced.

Example: Our example dataset is balanced because we have 3 types of class labels which are equal in number (Iris-setosa- 4, Iris-versicolor- 4, Iris-virginica- 4). In the real world, we will have unbalanced datasets also.

  1. The number of data points present for each class label: In our case, we have 3 classes and for each class, we have 4 data points.
  2. The shape of the dataset: It determines the number of data points and its features. The shape of our dataset is (12,5).
  3. Column names in the dataset: We have 5 columns/features namely sepal_length, sepal_width, petal_length, petal_width, species.
  4. Dropping duplicate rows and null values if any. We can fill missing values or we can drop them based on the problem in hand.
  5. Checking the various types of data present in the dataset and we can drop irrelevant columns also.

Python methods for the above-mentioned checks: 

  1. Value_counts () – Determines whether our dataset is balanced/unbalanced and the number of data points present for each class label.
  2. Shape () – Determines the shape of a dataset.
  3. Columns () – Gives the list of column names
  4. Drop_duplicates () – drops the duplicate rows in the dataset
  5. Types () – Results in the datatype of each column in the dataset like int, str, Boolean, etc.

Exploratory Data Analysis can be done using both statistical and graphical concepts. Each of them has its importance and limitations. Most of the time one can prefer a graphical way of doing it because one can visualize patterns in data.

Some of the plotting tools are Scatter Plots, Pair plots, Histograms, PDF, CDF, Box plots, Whisker plots, Correlation, and many more.

Some of the statistical concepts: Mean, Variance, Standard Deviation, Median, Percentile, Inter Quartile range (IQR), Mean absolute Deviation (MAD).

Advantages of doing Exploratory Data Analysis:

  1. Helps in finding better features/variables for the Machine Learning model by visualizing through plots.
  2. Finding patterns in data.
  3. Determines the relationship between the features.
  4. Detecting outlier data points.