# Action value methods: an introduction

x

If we want to evaluate actions, we must estimate the values of these actions.

We call these action evaluation and action selection estimates action value methods.

### The simplest action selection rule

We begin by thinking about simple ways to track our estimates of the values for each action. After that, we need to come up with a simple scheme that allows us to map from this set of estimates to an action selection.

First of all we represent the true, actual value of an action a as (q(a)). Following the convention used in Sutton and Barto’s (2018) Reinforcement Learning: An Introduction we denote the estimated value on the t-th time step as (Q_t(a)). As (q(a)) is the mean reward received when the action is selected, one idea to estimate this is to keep a running average of past rewards from that action as follows:

[Q_t(a) = frac{R_1 + R_2+cdots+R_{N_t(a)}}{N_t(a)}]

Where (N_t (a)) is the number of times (a) has been chosen prior to time (t). (R) denotes the rewards to date over those (N_t (a)) occasions indexed via the subscript. This is the sample average approach to estimating values. By the law of large numbers as (N_t (a)) tends to infinity we approach the population mean. For (N_t (a)=0) we need to define (Q_t (a)) as something reasonable as a default value. In general, this gets us going and we can deploy this to see what happens.

### Algorithm 1: The greedy method

Let’s start with the simplest thing we can do in selecting an action which is to select the greedy action, i.e. the action with the highest value. This greedy action method can be expressed as:

[A_T=underset{a}{argmax}⁡Q_t (a)]

Such an approach always exploits current knowledge to maximize the immediate reward. It takes no risks trying out what look like inferior actions to explore if they might be better.

### Algorithm 2: The ε-greedy method

A simple extension, which is pretty intuitive, is to play greedy almost all the time but occasionally with a probability ε choose one of the alternative actions. You don’t necessarily choose the next highest valued-action. You choose from the remaining actions with equal likelihood instead. So if, and it’s a big if, the agent plays for long enough, every action will be sampled many, many, times. Therefore, (N_t (a)) becomes larger and larger for every action (a). Consequently, (Q_t (a)) converges to (q(a)) as (N_t (a)toinfty) for every action (a).

Therefore, in this case, the probability of choosing the optimal action converges to (1- varepsilon), which is great, although it is an asymptotic result. The actual size of (varepsilon) impacts performance in quite a significant way. A larger value leads to more exploration and a shorter time to find the highest rewarding action compared to a smaller value of (varepsilon). However, a larger value of (varepsilon) means that the agent does not exploit this optimal value as much as for a smaller (varepsilon) by definition of (varepsilon). So, an interesting further extension is to start with a larger (varepsilon) and then reduce it over time. Also, it is worth noting that the best choice of (varepsilon) would depend on the variance of the distributions around a. To appreciate this, take the extreme case of the variance being equal to zero. If so, then (varepsilon) is best set to zero and a pure greedy method should be applied as the very first value sample for (q(a)).

References:

Sutton, R. and Barto, A. (2018) Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA: MIT Press. Available here.