Linear regression

An article describing linear regression in more detail.

As we saw in the video, linear regression is the machine learning equivalent of drawing a straight line through your data.

In one dimension

The simplest linear regression is when you have a single explanatory variable (x) that we want to use to predict the value of a single response variable (y). Linear regression assumes that a good model to this data is a straight line that we can write mathematically as:

[y = p_0 + p_1x]

Where (p_0) and (p_1) are constant values, (p_0) is known as the y-intercept value, and is the value of (y) when (x) is zero, while (p_1) is the gradient of the line.

The aim of linear regression is to find the best straight line that fits the data. For one-dimension this then boils down to finding the two numbers (p_0) and (p_1) for that best fit.

Measuring the fit – the residuals

So how do we measure how well a line does or does not fit the data? The answer is by adding up the squares of the so-called residual values for every data point. The residual value is the distance of each data point from the straight line given by the model.

So if we have (n) observations in our dataset and label them (x_i) and (Y_i) for (i = 1, …., n), each residual value is:

[Y_i – y_i = Y_i – p_0 – p_1x_i]

where (y_i) is the value predicted by the model at point (i). If we label the measure of the fit as (F) we have the following equation:

[F = sum_{i=1}^n(Y_i – y_i)^2]

The aim of linear regression in one-dimension is to find the choice of (p_0) and (p_1) for which (F) is the least.

Linear regression in more than one dimension

Many datasets have more than one variable of course, but linear regression still works in the same way. Rather than (y) being dependent on one explanatory variable (x), we have might have a set of (m) features (x_1, x_2, …, x_m), and assume (y) is a linear combination of all these features with the parameters (p_0, p_1, … , p_m):

[y = p_0 + p_1x_1 + p_2x_2 + … p_mx_m]

But beyond that, the process is the same. The function for (F) is the same:

[F = sum_{i=1}^n(Y_i – y_i)^2]

but rather than finding just two parameters (p_0) and (p_1), we need to find (m+1) parameters (p_0, p_1, … , p_m) to fit the model.

Polynomial regression

As we mentioned in the video, it is possible to use linear regression to fit quadratic, cubic and even higher order polynomial curves to data. This uses the trick of adding extra features to your dataset that are just calculated as the powers of the original data.

For example, you might choose to try and fit a quadratic equation to your data rather than a straight line:

[y = p_0 + p_1x_1 + p_2x_1^2]

To do this using linear regression you can just add an extra column to your data matrix, labelled (x_2), that is just the square of all the values in the first column. Then, fitting your model is as simple as fitting a linear regression model to the equation in two variables: (y = p_0 + p_1x_1 + p_2x_2)

How is (F) minimized?

There’s a range of methods for picking the best model (i.e. what values for (p_0) and (p_1) minimize (F)), which we won’t go into in any detail about here. These include:

• The Normal equation – this is an algebraic method of finding the best fit directly. The main drawback is the computation time for large datasets with lots of variables.
• Gradient descent – this is an algorithm that takes an initial estimate and iteratively improves it by heading in the direction of the steepest descent of (F). This method can be faster for large datasets with lots of variables, but can get stuck in so-called local minima or fail to converge at all.

Linear regression in scikit-learn

In practice, you won’t need to worry about how to minimise (F) as you will use a software implementation such as scikit-learn. This uses the LinearRegression() function in sklearn.linear_model. Assuming you have a features matrix (X) and target values (y) in memory, to fit a linear regression model you can use the following code:

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression().fit(X, y)print(reg.coef_) # displays the gradient parametersprint(reg.intercept_) # displays the y-intercept value