# Derivation of linear regression

In this article, we walk through the derivation of the model parameter estimator of linear regression.

One great advantage of OLS is that it yields an analytic formula for optimal model parameters. Let (hat{theta}) denote the estimator for linear coefficients (theta) of the linear regression model. By the matrix computation, one may easily obtain the following formulae for (hat{theta}) in the following lemma.

Lemma 1 (OLS estimator) Let (mathcal{D} = (X, Y)) denote the collection of input-output pairs. Assume the standard setting of the OLS model holds and the existence of inverse of (X^{T}X). Then the estimator of the optimal parameter (hat{theta}) is given as follows:

[hat{theta} = (X^{T}X)^{-1}X^{T}Y.]

Proof: Recall that the loss function of OLS yields that

[L(theta vert X, Y)= (Y- X theta)^{T} (Y- {Xtheta}) = Y^{T}Y – 2 theta^TX^{T}Y + theta^{T}X^{T}X theta]

We use the fact that (Y^{T} Xtheta = (Xtheta)^{T} Y) as both sides are scalars and the transpose of a scalar remains unchanged. It is noted that the loss function of OLS is a quadratic function with respect to the parameter (theta), and it is thus a convex function which ensures that the uniqueness and existence of the global minimum of the optimal parameters. Besides (L(thetavert X, Y)) is differentiable with respect to (theta) and thus the optimal parameter (hat{theta}) should satisfy that the derivative of (L(theta vert X, Y)) evaluated at (theta = hat{theta}) is equal to zero, i.e.

[frac{partial L(theta vert X, Y)}{partial theta}vert_{theta = hat{theta}} = 0.]

As a consequence, the following equation holds

[frac{partial L(theta vert X, Y)}{partial theta} = – 2 X^{T} Y + 2 X^{T}X theta.]

By setting (frac{partial L(theta vert X, Y)}{partial theta}) to zero, we have an linear equation system for (hat{theta}), i.e.

[-2 X^{T} Y + 2 X^{T}X hat{theta} = 0.]

By assumption that (X^{T}X) is invertible, the above equation implies that

(hat{theta} = (X^{T}X)^{-1}X^{T}Y.) (square)