Skip main navigation

Incremental action value methods

Blurred out equations written on chalkboard.
© Shutterstock
In the previous step, we used sample averages for the observed awards.

This involves keeping track of this computation.

[Q_t(a) = frac{R_1 + R_2+cdots+R_{N_t(a)}}{N_t(a)}]

In terms of practical implementation, this presents problems as it requires storing all past rewards per action. With every new time step, we have an increase in the computation required. Obviously, this would be a very naïve implementation and an obvious thing to do is compute an increment. We introduce such an idea as follows:

Let (Q_j) be our estimate of the value of a particular action when considering the j-th reward. This means that (Q_j) is based on the sample average using the first (j-1) rewards. Let us call the j-th reward as (R_j) and therefore we can write our new estimate of the value of a as :

[Q_{j+1} = frac{1}{j} sum_{i=1}^{j}R_i] [= frac{1}{j} left(R_j + sum_{i=1}^{j-1}R_iright)] [= frac{1}{j} left (R_j + (j-1)Q_j + Q_j – Q_j right )] [= frac{1}{j}left(R_j + jQ_j – Q_jright)] [Q_{j+1} = Q_j + frac{1}{j}left(R_j – Q_jright)]

This holds for (j=1) (where (Q_1) can be anything) and allows us to keep a running calculation of the average of the past (j) awards (i.e. (Q_{j+1})) and storing the previous estimate (Q_j) and the time step (j).

© Dublin City University
This article is from the free online

Reinforcement Learning

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now