Semi-Supervised Learning in Machine Learning Updated 2020

All the machine learning algorithms need data to process and learn from it. We have large amounts of data in the form of text, video, audio, images, etc. And most of the available data is labeled partially or completely unlabeled. It takes some manual effort or algorithms to label the data. For supervised learning algorithms, we need to labeled data most of the time. We already knew that for unsupervised learning algorithms we do not need the data to be labeled and can learn from unlabeled data by discovering some patterns. In Semi-Supervised Learning, we will have data that is labeled partially, and based on these partially labeled data, we want our machine learning model to classify the remaining unlabeled data present in the dataset. Hence semi-supervised learning will come in between Supervised Learning and Unsupervised Learning such that the semi-supervised learning algorithms can learn from partially labeled data.

Semi-supervised learning has become very popular these days and this is because of the huge volume of unlabeled data available in the world. We can use a semi-supervised learning algorithm to label the data and construct a new dataset with all the labels.

Examples of Semi-Supervised Learning:

For example, consider some huge amounts of text data like a description of some 10000 books. Since we cannot go through all the data because imagine a person going through thousands and thousands of books to label a single class to each book, it is very expensive. So, let us say by using some manual effort we classified some of the text data into different categories based on genres like thriller, horror, romantic, crime, action, etc. So, by using this partially labeled text data descriptions, we will classify the unlabeled text data descriptions into those categories.

Consider another example, a teacher solving some math problems in the class and asking students to solve the remaining ones as an exercise. So, the teacher here is giving some hints about the problems beforehand and we can think about it as labeling the data manually. The problems which are yet to solve by the students are considered to be unlabeled.

There are three different types of assumptions in semi-supervised learning namely Continuity assumption, Cluster assumption, and, Manifold assumption. A semi-supervised learning algorithm will make at least one of the assumptions to make use of unlabeled data.

Methods in Semi-Supervised Learning:

Generative models
Low-Density Separation
Graph-based methods
Heuristic approaches

The wrapper method for semi-supervised learning is called self-training. There is an extension for self-training which is called Co-training where the algorithm generates labeled examples one another.

Advantage of semi-supervised learning:

There will be some improvements in learning accuracy if unlabeled data is used in conjunction with a little amount of labeled data.
Manual labeling requires a skilled expert or an expensive experiment, we can reduce this amount of time and effort using semi-supervised learning algorithms.
Semi-supervised learning data can be used either by discarding the labeled data and perform unsupervised learning or discard unlabeled data and perform supervised learning.

The disadvantage of Semi-Supervised Learning:

The biggest disadvantage of a semi-supervised learning algorithm is that it cannot correct its own mistakes. An outlier point will corrupt the whole model. So, if a semi-supervised model predicts the label as an outlier point then our final model will give us very bad results.

Applications of Semi-Supervised Learning:

Genetics (Protein Sequence classification)
Speech analysis
Website data analysis
Internet content Classification
Text analysis

We can refer to Semi-supervised learning algorithms either transductive learning and Inductive learning. In transductive learning, we will infer the correct labels for the unlabeled data whereas, in inductive learning, we will infer the correct mappings from A to B. The algorithms that are designed for transductive and inductive learning are often used interchangeably.

Finally, semi-supervised models are observed to help in improving the performance of the base model and worked best on a small number of labeled data samples. With the right strategy, a semi-supervised learning algorithm will help in reducing the cost in labeling a significant amount of data, and also based on the already labeled data, it helps in classifying the data as accurately as possible.