Skip main navigation

How to Evaluate the Performance of Statistical Models

In this article, we look at how to evaluate the performance of statistical models to help you with your machine learning and data science projects.
© Dr Michael Ashcroft

Model Evaluation

Standard data-science methodology involves:
  1. Model Generation: Generating a set of models.
  2. Model Validation: Evaluating these models to choose the best.
  3. Model Testing: Estimating the expected performance of the chosen model on new data.

Validation Techniques

In validation based techniques, which we assume you are familiar with, steps two and three involve calculating a chosen loss function, such as misclassification rate or mean square error, on new data. Validation techniques include hold-out validation, where the loss functions are calculated on hold-out sets of data (data not used during model training). In this case we would split our original data into three sets, training, validation, and testing, using each in the corresponding step. It is important to randomize the rows of data when splitting them into subsets like this, as the order may be the result of data collection methods that group like cases.
Cross-validation (or k-folds validation) combines the training and validation data in steps one and two. This combined set is divided into k subsets. For each model-type intended to be considered, k models are trained, each from all subsets but one (with different subsets held out for each model). Each model is then used to calculate the loss function on the data subset not used in its training, and the results are averaged. The extreme case is when k is equal to the number of cases in the data, so that there are as many models as rows in the data and each model is trained on all rows except for one and tested on this hold-out row. This is known as all-but-one validation.
If the model-type is selected as best, there are then a number of options for what to do to obtain the final model, including averaging the model parameters of all k models, using them as an ensemble, or training a final model on the entire cross-validation data. We prefer the final option, treating the validation performance as one for the model-type and relying on the fact that increased training data can be expected to improve a model.

Validation and Test?

It sometimes surprises students that step three is required even when validation involves selecting the best model based on expected performance on new data. This is because the fact that we are selecting the best performing model of a set of model biases the result. In order to get an unbiased estimate of the chosen model’s performance on new data this must be undertaken separately. To put it another way, the fact that we are choosing the best model biases the resulting estimate of the model we chose.
Since this explanation is seldom sufficient to convince doubting students, we include a simple example illustrating this phenomenon. We will go through it here, and also demonstrate it in a video. Consider the case of trying to estimate the results of tossing a random coin. The expected misclassification rate of all classifiers is 50%. This is because there is no pattern to be found, there is only randomness. Nonetheless, we collect some data, split it into training, validation and test data sets of 50 rows each. We train one hundred classifiers on the training data.
The best performing model obtains 64% accuracy in the validation data! We chose this model. But it then only performs at 46% accuracy on the test data.
What has happened? We know that all models have an expected performance of 50% on new data. The 64% accuracy on the validation data was entirely down to chance. But by selecting the best we are building in a bias in that we are more likely to select a model that was lucky in this way. The performance of a model on a finite sample of data will diverge from its expected performance on the population data. This is simply variance, and we will be biased towards selecting models with positive variance. In this case, since there is no pattern, we are going to select a model only on chance or this variance.
Of course, as the size of the validation data increases, this variance will decrease, and so will the bias of the validation performance estimate for our chosen model. Likewise as the number of models decreases. Moreover, if our chosen model performs much better than others, such that it is very unlikely the difference was the result of variance, the probability that our choice was based on variance decreases and the reasonableness of treating the estimate as unbiased increases. This means that it can sometimes be reasonable to accept the validation estimate of the chosen model’s performance as a good estimate of its expected performance on new data.
If you do find that the chosen model performs significantly worse on the test data compared with the validation data, this indicates that the model was chosen on the basis of variance. This is an extremely bad situation to be in, since you have now used all your hold-out data sets in the training/selection process. In theory, the only legitimate thing to do is to obtain new data which you can then use as a hold-out test set on a new set of models when you redo the entire process. Otherwise you will risk overfitting on the test data. In practice, it can be difficult, expensive or impossible to obtain new data and a data scientist may have no choice but to simply redo the entire process again without new test data.

Skipping validation or testing

There may be cases where steps two or three are omitted. For example, you may only create one model, and so there is no need to perform an evaluation step to select the best model of those created (though typically this would be indicative of poor practice). Alternatively, you may not care to obtain an unbiased estimate of the chosen model’s expected performance on new data, so long as you are certain that it is the best model (and have some indication from the validation results that it performs sufficiently well).

Maximum Likelihood Techniques

The disadvantage of validation techniques is that they are expensive. Hold-out validation is expensive in terms of data: Data that could be used in training is instead used for validation. Cross-validation can minimize this data ‘waste’, but only at the cost of training multiple (and often very many) models. It is, therefore, expensive in terms of computation.
One alternative is to not bother about validation at all, and evaluate the performance of a model on the same data that it was trained on. This in-sample performance is, of course, a biased estimate of the expected performance of the model on new data. But, as number of cases in the training approaches infinity, in-sample performance will approach the expected performance of the model on new data. Where the loss function is based on measuring the probability of the data, this is the maximum likelihood method.
What bias there is in the in-sample estimate of the performance of a model on new data is the result of over-fitting, which we have discussed previously and which is the fitting of the model to the particularities of the training data at the expense of generalizable patterns in the population. We know that models tend to overfit more as they become more complex. This leads to the idea of penalized maximum likelihood approaches where the maximum likelihood estimate is combined with a complexity penalty.
Two common penalized maximum likelihood scores are the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). Both use the number of parameters in a model as a measure of model complexity. They are defined:
\[AIC(M)=2k-2log(\hat{L})\] \[BIC(M)=log(n)k-2log(\hat{L})\]
Where \(k\) is the number of parameters in the model, \(\hat{L}\) is the probability of the training data given the model, and \(n\) is the number of rows in the training data.
Both criteria have large sample optimality guarantees (i.e. guarantees about the result as the data approaches infinity). In practice, the AIC tends to overfit with large amounts of data, and the BIC tends to underfit with small amounts of data (choses a model that is too simple).
Equivalent to the penalized maximum likelihood approaches are penalized maximum entropy methods. They are useful in situations where calculating the entropy of a model is simpler than its likelihood given the training data.
© Dr Michael Ashcroft
This article is from the free online

Advanced Machine Learning

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education