Skip main navigation

Splitting datasets and cross-validation

An article describing different methods of splitting datasets and cross-validation.

It’s good practice for all machine learning projects to split your datasets so that the data you use to evaluate your models is separate to that used to train the model.

In this article we give a quick review of the important topics discussed in the previous videos, including how to split your data into training, validation and test sets, and cross-validation.

Training and validation sets

At the very least, for any machine learning task you should split your dataset into two subsets. One subset, known as the training set should be used in the initial step during training of model parameters and the establishment of the model. The data in this subset should never be used to evaluate the performance of your models (unless you are using cross-validation as we will see later).

The second subset, known as the validation set should be used to evaluate the performance of the model learned from the training data. This data should not have been used during the training step. In other words, the validation data is ‘new’ data that the model learned using the training data has not seen before.

In general, the data selected for each subset should be a random representative sample from across the entire dataset. Usually the majority of the data is allotted to the training set, with the remaining minority allotted to the validation set. A common choice is 80% training and 20% validation data, but this can vary according to the size of your dataset and the type of problem and model you are using.

Diagram showing the split of a complete dataset into training and validation sets. The training set is around 80% of the full dataset.

It’s important to remember to split both your features matrix and target vector, and make sure your random selection is the same from both so they match one-another. Scikit-learn has a convenient function you can use to split your datasets called train_test_split(). Suppose you have a features matrix named X, and a target vector y, then to split them both in an identical way you can use the following code:

from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(
X, y, test_size=0.2)

This assigns values from X to two new variables X_train and X_validation (respectively the training and validation sets), and values from y to two new variables y_train and y_validation (again respectively the training and validation sets).

The proportion of the dataset allotted to the validation set is set with the parameter test_size. Here we have set it to be 0.2 or 20%, but it can be any value between 0.0 and 1.0 (obviously values at those extremes are unlikely to be very useful!). The default value if you omit the test_size= parameter is 0.25 or 25% validation to 75% training data

Test sets

Often, as well as the training and validation subsets, an additional split of the dataset is made, making a third subset known as the test set.

In many machine learning models and algorithms there are a number of parameters and settings that you need to select prior to training, independent from your data, that determine how the algorithm proceeds. These are known as hyperparameters.

A common method to set or tune these hyperparameters is to repeat the training process a number of times with different hyperparameter values, produce lots of models, and select the set of hyperparameter values that result in the best performance from the validation set. The trouble with this approach is that it’s then difficult to say that your validation data is independent of the training process when reporting your results.

This is where the test set comes in. This is a subset of your data that is not used either to train the models themselves (that’s the training data), or to select hyperparameters (that’s the validation data). Since the test data has not been seen prior to the final evaluation step, using it to evaluate your final model ought to be a truer test of their quality (or otherwise).

Diagram showing the split of a complete dataset into training, validation and test sets. The training set is most of the full dataset, with the validation and test sets are smaller splits.

While at time of writing there’s no direct way to make a three-way split into training, validation and test sets in scikit-learn, you can just use train_test_split() twice to make the two splits as follows (supposing we start with X and y as before):

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1)

X_train, X_validation, y_train, y_validation = train_test_split(
X_train, y_train, test_size=0.1)

Here we have split the test sets away from X and y in the first step, and then split the remaining data into training and validation sets as before. Ten percent of the entire dataset is allotted to the test set, while the remaining data is also split in the same ratio (90% training, 10% validation).

K-fold cross-validation

As discussed above, the use of validation sets to tune hyperparameters and then test sets to evaluate the final is generally good practice. But can we be sure the hyperparameters aren’t just tuned to a particular random selection of training and validation subsets?

A good way to check this is the use of cross-validation. In cross-validation, as before, we separate some portion of data and use it purely to test the model, which we refer to the test data.

However, when using K-fold cross-validation, rather than making a single further split to the remaining data, we split it into K equally sized subsets. A common choice for K is ten, so in that case there would be ten equally sized subsets of the training data.

Then, for each of the K subsets, we hold back that data, train the model on the remaining data, and then evaluate the performance of the trained model on the data we held back. We can then take an average of the model performance over all the K-folds.

Diagram showing a possible cross-validation scheme. The cross-validation splits are five repeated and equal splits of the training data, with a different validation split used in each. The test data is held out for testing of the final run using all the training split.

If we then repeat this process with different model hyperparameters, we can be more sure when selecting the best performing set of hyperparameters that we haven’t just tuned them to a particular set of validation data.

Once we have chosen the optimal value for our hyperparmameters we can then recombine all our training data, train the model one last time on all the training data, and perform a final evaluation on the test data. Remember, this test data should never be used at any time during the training process.

Cross-validation in scikit-learn

As ever, there’s a quick way to perform cross-validation in scikit-learn, this time using the cross_val_score() function.

Let’s suppose you have some training data named X_train and y_train already loaded in memory, as well as some scikit-learn machine learning model and associated hyperparameters initialised and named eg_model. Then to perform cross-validation on that model you can use the following code:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(eg_model, X_train, y_train, cv=5)

And this will output an evaluation score for each the K-folds. In this case the code cv=5 sets the number of folds to be five, so it will output five cross-validation scores.

For a working example of using K-fold cross validation see the practical at the end of this week.

This article is from the free online

Machine Learning for Image Data

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now