Gradient Descent (GD) is a first-order iterative optimization algorithm for finding the local minimum of a function, which can be applied to tackle numerical optimization. We start with Gradient Descent (GD) method and explain the main idea and intuition behind it. The main idea is to find a local minimum of a function using GD by taking steps, which is proportional to the negative of the gradient of the function at the current point.

Intuitively, imagine that you are lost in the mountains in a dense fog, and you only feel the slope of the ground below your feet. A reasonable strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. Mathematically what we aim to do is to construct a convergent sequence of ((theta_{n})_{n = 0}^{infty}) such that (theta^{*} = lim_{n rightarrow infty} theta_{n},) where (theta^{*}) is a local minimum. A sufficient condition for such such ((theta_{n})_{n = 0}^{infty}) is that

[a] there exists an integer (N_{0}) large enough, such that ((f(theta_{n}))_{n geq N_{0}}) is a non-increasing sequence w.r.t (n);

[b] when (lim_{n rightarrow infty }theta_{n} = theta^{*}), (lim_{n rightarrow infty} triangledown f(theta_{n}) = 0.)

where (triangledown f(theta)) is the derivative of (f) at (theta), i.e. (triangledown L(theta) = (partial_{theta_{1}} L(theta),cdots, partial_{theta_{p}} L(theta)))..

The continuity of (triangledown L(theta)) and condition (b) imply that

[triangledown f(theta^{*}) = triangledown f(lim_{n rightarrow infty}theta_{n}) = 0,]

i.e. (f(theta^{*})) is the local minimum of (f).

In GD algorithm, we first Initialize (theta_{0}). For any integer (n geq 1), we update the ((n+1)^{th}) estimator (theta_{n+1}) based on the previous step estimator (theta_{n}) by

[boxed{theta_{n + 1} = theta_{n} – {color{blue}eta} triangledown L(theta_{n})},]

where ({color{blue}eta} >0) is a constant, which is also called the learning rate.

In the next, let us explain why the above update can fulfil the sufficient condition.

• When (eta) is small enough, by the first order Taylor expansion,

[L(theta_{n+1}) – L(theta_{n}) approx triangledown L(theta_{n})^{T} underbrace{(theta_{n+1} – theta_{n})}_{{color{red}- eta triangledown L(theta_{n} )}} = – eta left|triangledown L(theta_{n} )right|^{2}leq 0.]

It follows that for some (N), ((L(theta_{n}))_{n geq N}) is a decreasing sequence. – Suppose that ({theta_{n}}) is a convergent series, then

[lim_{n rightarrow infty} theta_{n+1} = lim_{n rightarrow infty} theta_{n} – eta lim_{n rightarrow infty} triangledown L(theta_{n} ) = theta_{*} = theta_{*} – eta lim_{n rightarrow infty} triangledown L(theta_{n} ).]

This implies that (lim_{n rightarrow infty}triangledown L(theta_{n}) = 0). If (triangledown L) is continuous, then it follows that (lim_{n rightarrow +infty} theta_{n}) is a local minimun.

The GD algorithm is often called the steepest gradient descent. Let us explain to you the reason behind this name. By Taylor expansion, we have that

[L(theta) approx L(theta_{0}) + triangledown L(theta_{0}) (theta – theta_{0}).]

The above Tayler expansion approximation (L(theta)) decreases fastest on the optimal direction, which is equivalent to the minimization of (triangledown L(theta_{0}) (theta – theta_{0})). We can show that the gradient direction (nabla L(theta_{0})) is the optimal direction, given the constraint that the distance between (theta_{0}) and (theta) is a positive constant (eta). Mathematically, it is equivalent to show the following statement:

Theorem: Let (theta^{*}) be the solution to the following constraint optimization problem

[hat{L}(theta):=triangledown L(theta_{0}) (theta – theta_{0}) rightarrow text{min},\ text{subject to}vertvert theta – theta_{0}vertvert_{2} = eta,]

then there exists (lambda_{*} in mathbb{R}) such that

[theta^{*} = theta_{0} – lambda_{*} triangledown L(theta_{0}),]

where (lambda_{*} = frac{eta}{vertvert triangledown L(theta_{0}) vertvert_{2}}).

Proof

This constraint optimization problem can be rewritten by the unconstrained problem with Lagrange multiplier:

[tilde{L}(theta, lambda) = triangledown L(theta_{0}) (theta – theta_{0}) – lambda (vertvert theta – theta_{0}vertvert_{2}^{2}- eta^{2}) rightarrow text{min},]

where (lambda in mathbb{R}).

Then the optimal ((theta^{*}, lambda^{*})) satisfies that

[triangledown tilde{L}(theta^{*}, lambda^{*}) = 0.]

Thus we have that

[triangledown L(theta_{0}) – 2lambda^{*} (theta^{*} – theta_{0}) = 0.]

By rearranging the above equation, we have the formula for (theta^{*}) as follows:

[theta^{*} = theta_{0} +frac{1}{2lambda^{*}} triangledown L(theta_{0})~~~~~~~~~~~~~~~~~(1)]

It is noted that as (lambda^{*}) is the scalar, the optimal direction (theta^{*}) from (theta_{0}) is along the gradient of (triangledown L(theta_{0})).

The only remaining part is to find the scalar (lambda^{*}). Equation (1) ensures that

[vertvert theta^{*} – theta_{0}vertvert_{2} = frac{1}{2vert lambda^*vert} vertvert triangledown L(theta_{0}) vertvert_{2} = eta.]

Thus it implies that (2vert lambdavert^* = frac{1}{eta} vertvert triangledown L(theta_{0}) vertvert_{2}). Then we have that (lambda^* =pm frac{1}{2eta} vertvert triangledown L(theta_{0}) vertvert_{2}). Thus there are only two possibilities for (lambda^{*}), which is either (frac{eta}{2} triangledown L(theta_{0})) and (-frac{eta}{2} triangledown L(theta_{0})). It follows that

[hat{L}(theta^{*}) = begin{cases} eta, & text{ if } lambda^{*} = frac{1}{2eta} triangledown L(theta_{0}); \ -eta, & text{ if } lambda^{*} = -frac{1}{2eta} triangledown L(theta_{0}). end{cases}] [square]

In conclusion, The summary of GD is given as follows.

• Goal: Find the local optimum (theta^{*}) to minimize a differentiable function (L).
• Algorithm: Initialize (theta_{0}). For (n = 1:N_{e}), (~~~~~~~theta_{n + 1} = theta_{n} – eta triangledown L(theta_{n})). where (N_{e}) is the maximum number of iterations and (eta) is the learning rate.
• Idea: We construct a sequence of ({theta_{n}}_{n geq 0}) such that
• For some (N), ((L(theta_{n}))_{n geq N}) is a decreasing sequence, i.e. & (~~~~~L(theta_{N}) geq L(theta_{N+1}) geq cdots);
• (lim_{n}triangledown L(theta_{n}) = 0). It implies that ({theta_{n}}_{n geq 0}) converges to the local minimum (theta_{*}).

The learning rate (eta) is an important hyperparameter in the GD algorithm. A hyperparameter is a model parameter whose value is set before the learning process begins. By contrast, the parameters of the model can be trained from data, like (theta). Most of machine learning algorithms require hyperparameters.

Figure 1(a) and 1(b) show that there is the trade-off between the scale of the learning rate; when the learning rate is too small, the convergence of the parameters ((theta_{n})_{n}) is relatively slow; however, if the learning rate is too large, there may be the possibility that ((theta_{n})_{n}) is bouncing between two valleys, which may also take a very long time to converge.

It is important to note that the GD algorithm can not ensure the global minimum in a general setting, which makes the initialization of the parameters and learning rate important. The GD algorithm may be stuck at some local minimum, which is depicted in Figure 1(b). In this case, a sufficiently large learning rate can help with escaping the local minimum.

1(a) The Effect of Learning Rate 1(b) Global Minimum and Local Minimum