Dimensionality Reduction in Machine Learning
In the real world, we will have very huge amounts of data which is used by a Data Scientist to get insights from it. The data can be structured, unstructured and semi-structured. The structured data represented in a tabular format is called as Dataset. The dataset consists of columns ranging from hundreds to thousands and even more than that. These columns are called features/variables.
What is Dimensionality?
First of all, what is dimensionality? It is nothing but the number of columns/features in the dataset. For example, let us say we have a dataset containing 100 rows and 4 columns (100 records with 4 features). We can say that dataset is 4-Dimensional(4D). If there are ‘n’ columns then we can say that the dataset is n-dimensional and represented as nD.
The Curse of Dimensionality:
The curse of dimensionality is a situation when we are analyzing data with high-dimensional spaces i.e. when we are dealing with a large number of features. The ML models will struggle to generalize if the number of features is higher when compared to the number of observations/records in the dataset.
Consider an example as below: Let M1, M2, M3, M4, M5, M6 and M7 are ML models trained on a different set of features.
Model | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
No of dimensions | 2 | 5 | 10 | 20 | 50 | 100 | 1000 |
Accuracy | 64% | 72% | 79% | 85% | 88% | 76% | 60% |
The trend of accuracy will be like above if the number of dimensions increases the accuracy increases to a certain limit. If the number of features increases after some threshold (which is a problem specific), the accuracy starts decreasing. This is the curse of dimensionality.
While training a machine learning model, overfitting is a phenomenon that occurs due to an increase in high dimension of data which means that the model we are training is perfectly fitting to the training samples. In return, the model cannot generalize well while predicting, the model accuracy is poor on new samples of data.
To avoid such a phenomenon of overfitting due to the curse of dimensionality, we have a solution which is called Dimensionality Reduction. Dimensionality reduction means reducing the dimensions of the dataset i.e., reducing the number of columns/features. It is used to get good features that are useful to solve ML problems (both classification and regression) and also to understand insights out of data. There are two ways of getting it.
-
- Feature Selection
- Feature Extraction
Feature Selection means selecting a subset of features from the original features/columns. This will majorly come with domain experience where one will know whether a particular feature contributes to the output or not.
Feature Extraction is a process of creating new features from different combinations of original features. Also, both Feature Selection and Feature Extraction can be used together on a dataset to get lower dimensions.
Dimensionality Techniques:
There are many techniques designed for dimensionality reduction. Some of the important among them are:
In the Feature Selection category:
- Filter methods – It measures the correlation between features with an output variable and filters the best features that are suitable for the output variable.
- Wrapper methods – A subset of features are selected initially and train a Machine Learning model. Based on the accuracy of this trained model, we decide to add/remove features. Huge computing power is needed while using this method.
- Ensemble methods – Usually it is a combination of the above two methods and recursively check for good accuracy.
In the Feature Extraction category:
- Linear Dimensionality Reduction (Principle Component Analysis PCA)
- Non-Linear Dimensionality Reduction (t Distributed Stochastic Neighbor Embedding SNE)
Other uses of reducing the dimensions of the data:
- It will be helpful in the Visualization of data. We humans can visualize 2D,3D data on a sheet of paper but dimensions like 10D,50D,1000D cannot be visualized by humans. Hence, we use some Dimensionality reducing techniques to reduce higher dimensional data to lower dimensions especially 2D/3D.
- In a classification ML model, the output variable depends on the features. The complexity of the model increases if there are a greater number of features. So, reducing the features to lower dimensions will decrease memory, space and training time of the ML model.
- A simple rule of thumb in designing ML model – If we put noise in the model, we get the noise out. We can remove unwanted features or noise which will not contribute to deciding the output variable.
Cons: Each dimensionality technique has its weakness. The most common thing is that we may lose some information which is contributing to the output variable.
Suggested Course : Machine Learning
Improve your career by taking our machine learning courses.