Skip main navigation

An incremental action update rule for nonstationary problems

Programming or coding concept

In the previous step, we introduced an update rule as follows:

[Q_{j+1} = Q_j + frac{1}{j} (R_j – Q_j)]

Sutton and Barto’s (2018) Reinforcement Learning: An Introduction, makes the observation that this could be interpreted as:

NewEstimate = OldEstimate + [Target – Old Estimate]

[Target – Old Estimate] is an error in the estimate and we reduce this by taking a step towards the target – which we assume is where we want to go, despite it being noisy.

In this example, the target is the (Q_{j+1}) reward. The step size (alpha) changes from step to step as it is (frac{1}{j}) and starts out as 1, and gradually as (j) increases goes to zero. The target is our desired direction of travel, and indeed for (j=1) we take our first reward as our first data-updated value of (Q). Thereafter we take increasingly smaller steps towards the target because, over time, our sample average becomes a better approximation of our true desired target which is (q(a)), so we begin to reduce the weight given to new data. This is a nice idea.

Pseudocode for a complete bandit algorithm using incrementally computed sample averages and the (varepsilon)-Greedy Method is shown below. The function bandit(a) is assumed to take an action and return a corresponding reward.

A simple bandit algorithm.Initialize, for a = 1 to k: Q(a)←0, N (a)←0, Loop forever: A←argmax(small a) Q(a) with probability 1 - e (breaking ties randomly), a randowm action with probability e. R←bandit (A). N(A)←N(A)+1. Q(A)←Q(A)+1overN(A)[R-Q(A)] (Sutton and Barto, 2018)

What if we have a nonstationary problem? i.e. the statistics of (q(a)) is changing over time? In such a case new data is much more important as time goes by compared to the previous situation. So let’s reflect this in a weighting scheme where (alpha) is now a fixed size.

We can show this to be a weighted average which has increasing forgetfulness the further back in time a reward value was encountered. Let’s look at this.

(Q_{j+1} = Q_j + alpha(R_j – Q_j))
(= alpha R_j + (1-alpha)Q_j)
(= alpha R_j + (1-alpha)[alpha R_{j-1} + (1-alpha) Q_{j-1}]) (= alpha R_j + (1-alpha)alpha R_{j-1} + (1-alpha)^2Q_{j-1}) (= alpha R_j + (1-alpha)alpha R_{j-1} + (1-alpha)^2alpha R_{j-2}+cdots+(1-alpha)^{j-1} alpha R_1 + (1-alpha)^j Q_1)

[= (1-alpha)^j Q_1 + sum_{i=1}^{j}alpha(1-alpha)^{j-i}R_i]

We choose our step size (alphain[0,1])

Note, because ((1-alpha<1)), then weight decreases for rewards further back in time. Also, by choosing (alpha) we can make our algorithm more or less sensitive to recent rewards. For example, for (alpha = 1) then the only term that survives is when (i=j) as (0^0 = 1). This yields (Q_{j+1} = R_j) so only the last reward is used. For very small (alpha) we have a longer weighting over past rewards. Consequently, through choice of (alpha) we can determine how responsive our value estimate is to changes in the statistical properties of the true action value.

It is clear that a useful extension of the above approach is to vary (alpha) over time. We have already seen that setting (alpha(a) = frac{1}{j}) converges to (q(a)), i.e. the true value of the action after infinite steps. You can imagine that allowing (alpha) to vary over time could possibly be useful especially in a dynamic environment.


Sutton, R. and Barto, A. (2018) Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA: MIT Press. Available here.

© Dublin City University
This article is from the free online

Reinforcement Learning

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now