# The k-Armed Bandit problem

x

The k-armed bandit problem is a classic learning problem in reinforcement learning, describing a situation where the agent has a choice among k different options.

The selection of each option is a distinct action, so you can consider this selection among k different actions. After each action, the agent is presented with a reward (numerical) which is chosen from a probability distribution (stationary for now). The goal of the agent is to maximise the total reward over a large number of trials. We consider these trials as time steps, i.e. a single agent acting sequentially. We take the analogy of a slot machine, which has a number of levers (arms), each of which may have varying chances of generating good payouts. A selection of an action in this analogy is the selection of a specific “arm” or lever.

Aside: The slot machine analogy is a good one, but do have a think about other problems which have this structure but which are not actual slot machines! In the comments section suggest other examples.

k-Armed Bandits, as considered in the set of examples which follow, occur in a non-associative setting. This is in contrast to an associative setting. What do we mean by this? If we have to associate different actions with different situations then we have an associative setting. In a non-associative setting, the agent encounters the same situation repeatedly. I hope it is clear that our k-armed bandit situation is non-associative, as we are always interacting with the same problem. it’s the same situation over and over, i.e. trying to maximize return from this “slot machine”. We will focus on this non-associative setting for now, as it simplifies our introduction to Reinforcement Learning.

Let’s think about the k-armed bandit problem in terms of the elements of Reinforcement Learning we have established. Firstly, every action, (i.e. selection of a particular lever) has a mean reward. This is considered the value of that action. It is the mean of the reward distribution associated with that action,i.e. that particular lever. If the agent knows the value of each action, i.e. the agent knows the mean of each distribution for each lever, then it would be able to master this challenge easily because it would simply always select the lever with the greatest mean. To make this challenge interesting we assume the agent only has estimates for these value actions, i.e. it does not have certainty on these values.

Let us think about this idea of keeping estimates of the value actions. At any time step in our task, we have a set of levers that we have tagged with an estimate of their value. Among this set of values, there is a maximum. Selecting the lever corresponding to this maximum value is called a greedy action. With your greedy action, you are exploiting your knowledge of the values of the actions. If you select a non-greedy action, i.e. one of the lower-value estimated levers/actions, then you are engaging in exploratory behaviour, as this will improve your knowledge of the value for that non-greedy action. While exploitation is the best strategy to take, on average, on a single step, the exploratory route may reveal an alternative solution, one which produces greater reward over a much longer time scale. For example, if we know the value for action 1 with great certainty and it is a little greater than the remaining actions but these remaining have great uncertainty, then after exploration, you may find that one of these, let’s say Action 4 is actually better than Action 1.

Once identified, Action 4 can be exploited again and again over many trials to produce a greater return over Action 1 only. In such a scenario, exploring leads to rewards which are lower in the short run, but which pay off over a longer time scale. Clearly both exploring and exploiting are important for success, but balancing exploration and exploitation is challenging as it depends on factors such as initial estimates, uncertainties, and available time steps. In the next step, we show how mixing up exploring and exploitation is superior in the context of k-armed bandits over exploitation-only strategies.

What do you think?

Can you suggest other examples of learning problems which have this structure?

Share and discuss your responses with other learners in the comments section below.