Cross Validation in Machine Learning

In this article, we will understand the concept of Cross Validation and different types of Cross Validation. Suppose, we have some data regarding symptoms of patients suffering from a disease. When a new patient shows up then we can compare these data and predict whether that person is suffering from that disease or not. For that, we have to choose the best machine learning method. Cross validation allows us to compare different machine learning methods and help us to know how well they will work in practice. It gives us an approximate value of the error rate when we implement our model.

Now, let’s consider all the data regarding symptoms of patients suffering from the disease or not. We need to do two things with this data –

  • Estimate the parameters for the machine learning methods

In other words, we use logistic regression, we have to use some of the data to estimate the shape of the curve. In machine learning, estimating the parameter is called ‘training the algorithm’.

  • Evaluate how well the machine learning methods work– In other words, we have to find out if this curve will do a good job in categorizing new data. In machine learning, evaluating a method is called ‘ testing the algorithm’.

So basically, we need the data to –

  1. Train the machine learning methods.
  2. Test the machine learning methods.

A terrible approach would be to use all of the data to train the algorithm because then, we wouldn’t have any data left to test the method. Reusing the same data for both training and testing is a bad idea because we need to know how the method will work on data it wasn’t trained on. A better idea would be to use the first 75% of data for training and the last 25% of data for testing. We could then compare methods by seeing how well each one categorized the test data. But, how do we know which block of data can be used for testing? Cross validation uses all the blocks of data for testing, one at a time and summarizes the result in the end. At last, every block of data is used for testing and we can compare methods by seeing how well they performed.

The basic purpose of cross validation is how the model will perform within unknown data sets. In other words, imagine that we are trying to score a goal in an empty goal, it looks pretty easy and we can also score from a considerable distance too, but the real test actually starts when there is a goalkeeper and a bunch of defenders so that’s why are trained in a real match facing all the defenders, goalkeeper and weather conditions and we still score the goal.

Types of cross validation –

Following are the different types of cross validation –

  • Hold Out method –

In this type of cross validation, we divide the data into two subtypes –

  1. a) Training sample – We use this sample to build a model.
  2. b) Hold out sample – We test our model in this sample.

If the model statistics are consistent across these two samples, then the model is going to perform well in the given data and in the new data as well.

  • K- fold Cross validation –

In this type of cross validation, we split the data into k- equal size subsets. We take one subset and treat it as the validation or testing set for the model. And the remaining k-1 subset for training the model.

  • Leave p-out cross validation

In this cross validation technique, we leave out p data set for validation that is not used for the training.  Let us consider that we have m data points in the data set, then m-p data points are used for the training phase. We use remaining p data points for validation.  This process is exhaustive because the above process is repeated for all the possible combinations in the original data set and to check the effectiveness of the model, the error is averaged for all the trials. The model needs to train for all possible combinations and for a considerably large P as well, so that makes it infeasible.

  • Leave-one-out cross validation

This type of cross validation is similar to leave – p -out, but the only difference is that in this case p=1. It saves a lot of time which is a big advantage. But, if the sample data is too large it will take a lot of time but still, it will be quicker than the leave-one-out cross validation.

Suggested Course : Machine Learning

Improve your career by taking our machine learning courses.

Learn More