Skip main navigation

Upper confidence bound action selection

Green algorithm icon.
© Shutterstock

In reinforcement learning, exploration is always needed as we are always uncertain regarding the action value estimates in terms of accuracy.

Greedy actions involve immediate gratification, i.e. actions that look for maximum gain at the time of the action but at the expense of actions which lead to future bigger payoffs over longer time scales. Recall our (varepsilon)-Greedy method, which occasionally tries out other actions but does so indiscriminately. Wouldn’t it be smarter if our (varepsilon)-greedy method chose other actions according to their potential for being good actions? To this end, it would take into account how close these actions are to being the maximum, including the uncertainty estimates such as those captured in the following expression.

[A_t = underset{a}{argmax}left[Q_t(a) + c sqrt{frac{ ln t}{N_t(a)}}right]]

What we have here looks alarming, but it is sensible. Let’s go through what we are doing here. (N_t(a)) is the number of times an action has been chosen. The scalar (c>0) is going to be used to control the gain on the term on the right-hand side and, in effect, will control exploration.

So the square root term is a measure of variance (uncertainty) in the value of an action (a). We could imagine a confidence interval around our estimate of (Q(a)) and by taking the maximum we are looking at an upper confidence bound (UCB) on (Q(a)). You can see then that we are using (c) to determine the confidence level. How about our use of (N_t(a)) in the expression. Does it make sense? Thinking about this, if we have selected (a) many times previously, then (N_t(a)) is a large number and our uncertainty should be small compared to a situation where (N_t(a)) is a large number. So having (N_t(a)) in our denominator makes sense.

Every time we select this particular action (a), our uncertainty in the value of the action goes down as we get more information. What about (t), why is that there? Ok, so this captures a desired characteristic, i.e. when a behaviour other than (a) is selected, uncertainty for the value of that action (a) increases. Why? When some other action is selected, (N_t(a)) does not increase, but (t) will increase as it is independent of the action. Therefore, the confidence bound will loosen out as time goes on, but the action(a) is not selected.

But why then the natural log? This is to squash (t) so that increases in (t) have a smaller effect over time. Consequently, all actions will eventually be selected and over time we will get good exploration behaviour over actions. The improvement over our previous algorithm is shown in the figure below.

Average performance of UBC action selection on the 10-armed testbed. As shown in the diagram UBC generally performs better than $$E$$ greedy, except in the first _k_ step.

Figure 1 (Sutton and Barto, 2018)

We will end our discussion of reinforcement learning here. For k-armed Bandit problems, UBC is really good, although gradient bandit algorithms are really worth looking into if you wish to pursue this topic further. For now though, we have done well.


Sutton, R. and Barto, A. (2018) Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA: MIT Press. Available here.

© Dublin City University
This article is from the free online

Reinforcement Learning

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now