# Summary of Week 3

Summary of Week 3

# Artifical Neural Network

## Network Architecture

Artificial Neural Network (ANN) is composed of the input layer, hidden layer and output layer.

• Input layer ((h^{(0)}: mathbb{R}^{d} rightarrow mathbb{R}^{n_{0}})): (h^{(0)}(x) = x.)
• Hidden layer ((h^{(l)}: mathbb{R}^{d} rightarrow mathbb{R}^{n_{l}})): (forall l in {1, 2, cdots, L-1}).
• Output layer ((h^{(L)}: mathbb{R}^{d} rightarrow mathbb{R}^{e})). where ((h^{(l)})_{l = 1}^{L}) is defined in the following recursive way that for any (x in mathbb{R}^{d}),

[z^{(l+1)}(x) = W^{(l)} h^{(l)}(x) + b^{(l)};]

(h^{(l+1)}(x) = sigma(z^{(l+1)}(x)).) Here (W^{(l)}) is a (n_{l+1} times n_{l}) matrix, (b^{(l)}) is a (n_{l+1}) dimensional vector, (sigma) is an activation function, e.g. the sigmoid function (sigma(x) = frac{1}{1+e^{-x}}). The parameter set is (theta = (W^{(l)}, b^{(l)})_{l =1}^{L}).

Neural Network Building Block

The building block of artificial neural networks is given in the following figure. Let (W^{(l)}_{i, j}) denote the weight of the incoming node at the (l^{th}) layer and the (j^{th}) outgoing node at the ((l+1)^{th}) layer and (b^{(l)}_j) bias term of (j^{th}) outgoing node at the ((l+1)^{th}) layer. The (l^th) and (l^{th}) layer is fully connected. More precisely, for any (l in {2, cdots, (L-1)}),

[h^{(l)}_{j} = sigma(z^{(l)}_{j})\ z^{(l+1)}_{j} = sum_{ i = 1}^{n_{l}} W_{i, j}^{(l)} h^{(l)}_{i} + b_{j}^{(l)},]

where (j in {1, cdots, n_{l+1}}), (W_{i, j}^{(l)}) is the weight from incoming node $i$ to output node (j) on the layer (l).

Figure 1: The illustration of a neural network building block.

## Model Training

There are three major types of gradient descent-based optimization methods for optimizing model parameters of neural networks, i.e., – Batch gradient descent; – Stochastic gradient descent; – Mini-batch gradient descent.

• Goal: Find a local optimum (theta) such as to minimize the loss function (L(theta vert mathcal{D})) in the following additive form: (L(theta vert mathcal{D}) = frac{1}{N} sum_{i = 1}^{N} Q_{theta}(x_{i}, y_{i}).)
• Algorithm: (theta_{n+1} = theta_{n} – eta triangledown L(theta_{n}vert mathcal{D}) =) with the gradient term (triangledown L(theta_{n}vert mathcal{D}) = frac{1}{N} sum_{i = 1}^{N} triangledown_{theta} Q_{theta_{n}}(x_{i}, y_{i}).)
• Idea: Direct application of GD to the empirical loss function.

• Algorithm: At each stepc(n), randomly choose the index (i_{n}) from ({1, cdots, N}), and update the weights (theta_{n}) as follows:

(theta_{n+1} = theta_{n}- {color{blue}eta_{n}} underbrace{triangledown_{theta} Q_{theta_{n}}(x_{i_{n}}, y_{i_{n}})}_{text{Stochastic Gradient Term}},) for a suitably chosen decreasing step-size sequence ({{color{blue}eta_{n}}}_{n}).

• Idea: (mathbb{E}_{x, y}[ triangledown_{theta} Q_{theta}(x_{i_{n}}, y_{i_{n}})] = triangledown L(theta vert mathcal{D}),) where ((x_{i_{n}}, y_{i_{n}})) is sampled from the empirical distribution of ((x_{i}, y_{i})_{i = 1}^{N}).

Mini-batch gradient descent (Mini-batch GD) is another variant of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Mini-batch GD can be viewed as a combination of BGD and SGD.

At one iteration, instead of going over all examples, Mini-batch GD updates the gradients based on a subset of samples (called mini-batches) for the given batch size (b). When (b=1), Mini-batch GD is the SGD; When (b=N), the Mini-GD is BGD. In Mini-batch GD, the typical way of creating mini-batches includes the two steps:

• Shuffle the dataset to avoid the existing order of samples.
• Split the entire training data set into several non-overlapping mini-batches of the batch size (b). If the sample size is not divisible by the batch size, the remaining samples will be their own batch.

Then we apply the batch gradient descent for each mini-batch until going throughout all the samples (this is called one epoch); we repeat this procedure until the number of epochs reaches the maximum epoch number (N_{e}).

Comparison of BGD, SGD and Mini-GD

The comparison table of the above three methods is given as follows:

BGD SGD Mini-Batch GD
Update Frequency Low High Medium
Update Complexity High Low Medium
Fidelity of error gradient High Low Medium
Stuck the local minimum Easy Difficult Difficult
Easy to Converge Yes No No

The hierarchical structure of ANN enables the efficient calculation of the gradients of ANN in an inductive way, which is called the backpropagation of ANN. We summarize the backpropagation algorithm of ANN as follows:

Input: ({color{blue}theta}) and ((x, y)).

Forward Phase:

For each (l in {1, cdots, L-1}), compute (~~~~~~~~~z^{(l+1)}(x) = {color{blue}W^{(l)}} h^{(l)}(x) + {color{blue}b^{(l)}},\ h^{(l+1)}(x) = sigma(z^{(l+1)}(x)).)

Backward Phase:

For (l = L), (delta^{L} = partial_{z^{(L)}} Q_{theta}(x, y) = partial_{z^{(L)}} Q(h^{(L)}(x) – y) = Q'(h^{(L)}(x) – y)sigma'(z^{(L)}(x));)

For (l = L-1: -1: 1), (~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~delta^{(l)} = sigma'(z^{(l+1)}(x)) odot (W^{l} delta^{l+1});\ ~~~~partial_{W^{(l)}_{i, j}} Q_{theta}(x, y) = h_{i}^{(l)} delta^{(l+1)}_{j}; \ partial_{b^{(l)}_{j}}Q_{theta}(x, y) = delta^{(l+1)}_{j}.)

Output: (triangledown_{theta} Q_{theta}(x, y)).