Skip main navigation

Overfitting

In this article, Prof. Hao Ni illustrates the meaning and consequences of overfitting.

Overfitting Issue

Overfitting refers to the case where the model prediction corresponds too closely or exactly to a particular data set (typically the training data), but may fail to fit additional data or predict future observations reliably.

Overfitting plot Figure 1. an illustrative example of overfitting. The blue circles represent the training data of input-output pairs simulated by a linear model. The red solid line and the yellow dash curves are the ground truth mean function and the estimator for the mean function by a polynomial regression of a high degree.

In the example shown in Figure 1, when one applies the polynomial regression to a small dataset generated by a linear model, the yellow dashed curve is the estimated conditional mean function, which is highly oscillatory and far from the ground truth mean function marked in red. The reason for such a phenomenon is the overfitting caused by the limited sample size relative to the model complexity.

Even for the simplest linear model, the overfitting issue may happen. For example, when the sample size is much smaller than the input dimension (d), there are infinitely many (theta) such that (L({color{blue}theta} vert X, Y) = (Y- X {color{blue}theta})^{T} (Y- {Xcolor{blue}theta}) = 0,) and it leads to little predictive power of the estimated model in the testing dataset.

This article is from the free online

An Introduction to Machine Learning in Quantitative Finance

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now