Want to keep learning?

This content is taken from the The Open University & Persontyle's online course, Advanced Machine Learning. Join the course to learn more.
1.16

Statistical Models, Loss Functions and Training as Optimization

Statistical Models

A statistical model is simply a mathematical function. Different models are functions of different forms, but all require the specification of a number of parameters that are found in the function. We will normally use $$\beta_i$$ to represent the ith parameter of a model. We can think of the general form of statistical models arising from supervised learning as being one of:

$\hat Y=f(X,\beta)$ $\hat P(Y)=f(X,\beta)$

Where different model types utilize different functions for $$f$$.

Note that although the discussion in this article works with supervised learning cases, the process discussed is true of all types of learning.

The distinction between a model that outputs a ‘point estimates’ of Y, giving the estimated value(s) of the target variable(s), and one that outputs a probability distribution over the values of the target variable(s) is not really required: We can think of the former as a special case of the latter, where all probability is located at a specific value. Nonetheless it is often important, so we emphasise it here.

As concrete examples, we examine two models that you should already be familiar with: linear regression and logistic regression. With a single input feature and a single target variable (the target variable in the logistic regression case is binary), they take the forms:

Linear Regression: $$\hat Y = \beta_0 + \beta_1 X$$

Logistic Regression: $$\hat P (Y=1$$|$$X)=\frac{1}{1+e^{-\beta_0 + \beta_1 X}}$$

We remind you that logistic regression can also be understood as linear classifier (rather than a regression model of the probability) with the form:

Logistic Regression: $$\hat Y = \mathbf{I}(\beta_0 + \beta_1 X \geq 0)$$

Where $$\mathbf{I}(p)=1$$ if $$p$$ is true and 0 otherwise. For the remainder of this step we assume the first form (which is in any case used in the optimization process to find the parameters of the second form).

We see that the two types of statistical models have different forms, and also that each formula will produce infinitely many different models depending on the values assigned to the parameters.

Deciding upon the values to assign to the parameters of a model based on available data is the process of learning a model from, or fitting a model to, data.

Training Models

Selecting values for model parameters given data is an optimization problem. This process requires the specification of a loss function, $$L(\beta,X,Y)$$. Examples include MSE or negative Log Likelihood:

$L_{MSE}(\beta,X,Y)=\frac{1}{N}\sum_{i=1}^{N}(f(\beta,X_i)-Y_i)^2$ $L_{NLL}(\beta,X,Y)=-log(P(Y|X,\beta))$

In the simplest case, we seek to find values for the parameters that minimize the Loss function on the training data.

$\hat{\beta}=argmin_{\beta}L(\beta,X,Y)$

In the cases we will look at in this course, such optimization problems will be either solvable analytically or numerically. We give examples of both cases here.

For instance, optimal parameter values for a linear regression model given a MSE loss function and given data (X and Y matrices) can be solved analytically, in the familiar derivation:

$L_{MSE}(\beta)= ( Y - X \beta )^T ( Y - X \beta )$ $= Y^TY - Y^T X \beta - \beta^T X^T Y + \beta^T X^T X \beta$

To find the optimal $$\beta$$ values, we set the derivative to 0:

$\frac{dL}{d\beta}= 2X^TX\beta - 2X^T Y = 0$

Therefore:

$2X^TX\beta = 2X^T Y$ $X^TX\beta = X^T Y$ $\beta = (X^TX)^{-1} X^T Y$

You should memorize the final formula, since we will see variants on it in a number of places later.

Optimal (or near optimal) solutions to the optimization problem for logistic regression, on the other hand, requires a numerical solution. We briefly review a basic gradient descent algorithm that could be used to solve the logistic regression optimzation problem:

A constant learning rate, R, or a learning rate schedule, R(i), where i is the step of the algorithm is specified. We assume a constant rate for simplicity.

1. Initial random parameter values are assigned.

2. Repeat until convergence:

i. Calculate the partial derivatives of the parameters with respect to the Loss function.

ii. Update the parameter values given their partial derivatives and the learning rate: $$\beta = \beta - \frac{dL}{d\beta} \cdot R$$

In fact, logistic regression would normally use second order information (estimating the Hessian). But we will see this basic gradient descent algorithm again soon, in circumstances where it would be infeasible to calculate or even estimate second order information. So try and get comfortable with it.