# Backpropagation through time (BPTT)

Backpropagation through time is the algorithm to optimize the model parameters of recurrent neural networks. Watch Hao Ni explain more.

In this video, Hao introduces Backpropagation through time, an efficient algorithm to compute the gradients of RNNs, and explains the step-by-step derivation.

Similar to DNNs, RNNs training is conducted by minimizing the loss function via stochastic/mini-batch gradient descent methods. Backpropagation through Time is an efficient algorithm to compute the gradients in RNNs.

The main challenge of computing the gradients of RNNs model lies in the recurrence structure of hidden neurons.

To be more specific, let us consider the task of predicting the sequential output using RNNs. First, recall the network architecture of RNNs. Let (x_t), (s_t) and (o_t) denote the input neuron, hidden neuron and output neuron at time (t) respectively, satisfying the following equations

• (s_{t+1} = g(Ux_t + Ws_t),)
• (o_t = h(V s_t),)

where (U, W) and (V) are model parameters and (g / h) are activation functions.

The loss function is often in the additive form as follows:

[text{Loss Function:}L(bar{o}, bar{y})= sum_{t=1}^{T}E(o_{t}, y_{t}),]

where (y_{t}) and (o_{t}) are the actual/estimated output at time (t) respectively, (bar{o}=(o_{t})_{t = 1}^{T}), (bar{y} = (y_{t})_{t = 1}^{T}) and (E(o, y) := Vert o – yVert_2^2). We denote the error term at time (t) by (E_t:= E(o_t, y_t))

By Chain’s rule, to compute the gradients of (E_t) with respect to the model parameters (U, W) and (V), one needs to compute

• (frac{d E_{t}}{ d V} = frac{partial E_{t}}{ partial o_{t}} cdot frac{ partial o_{t}}{ partial V})
• (frac{d E_{t}}{ d U} = frac{partial E_{t}}{ partial o_{t}}cdot frac{ partial o_{t}}{ partial U} = frac{partial E_{t}}{ partial o_{t}} frac{partial o_{t}}{ partial s_{t}} cdot frac{ d s_{t}}{ d U})
• (frac{d E_{t}}{ d W}= frac{partial E_{t}}{ partial o_{t}} frac{partial o_{t}}{ partial s_{t}} cdot frac{ d s_{t}}{ d W}).

Then the computation of total derivatives boils down to

• (frac{partial E_{t}}{ partial o_{t}}), (frac{partial o_{t}}{ partial s_{t}}), (frac{ partial o_{t}}{ partial V}) (Easy to compute);
• (frac{ d s_{t}}{ d W}), (frac{ d s_{t}}{ d U}) (More care to be taken).

Note the total derivatives (frac{ds_T}{dU}) has the recurrence structure, because (s_{t}) is a function of (U) and (s_{t-1}), which depends on (U) again. This is the same to the total derivatives (frac{ds_T}{dW}). This recurrence structure enables the computation of these total derivatives in an inductive manner. You can find the algorithms of computing (frac{ds_T}{dU}) and (frac{ds_T}{dW}) in the attached slides or lecture video.