4.18

# Multi-Armed Bandit Optimization

## Multi-Armed Bandit Optimization

The phrase ‘multi-armed bandit’ comes from the description/nickname ‘one-armed bandit’ which refers to slot machines, each of which has a single ‘arm’ which is pulled to run the game. They are bandits since they are programmed, in the long run, to take the player’s money. A multi-armed bandit is therefore a group of such slot machines.

The statistical problem this nickname targets is a situation where there are multiple different actions that can be performed (machines to be played), where each action will either produce a positive or negative result (we win or not). We assume all negative results are equally bad and all positive results equally good. These different actions each have potentially different probabilities of producing a positive result. We begin with no knowledge of the probability of any of these actions producing a good result. How should we proceed?

### An Example: Web Layout Optimization

An example of such a situation is web site layout optimization. Web designers often wish to test different layouts to see how people react to them. One statistic of interest is the ‘click through’ rate: The proportion of visiters to a webpage who end up clicking on a particular link or control. Designers are often asked to produce a layout that will maximize a particular click through rate, such as a ‘BUY NOW’ button.

A traditional and still common approach to compare two such layouts is to perform what is known as A-B testing. In this, two layouts are exposed to a previously decided upon number, $n$, of randomly selected visitors. The responses of these customers (whether they clicked-through) is recorded, and the click-through rate for both layouts for these visitors calculateed. Finally statistical tests are applied to determine the confidence we can have that the better performing layout out-performed the worse due to a reason other than chance. If the difference is not statistically significant, we can proceed to increase $n$.

Problems with basic A-B testing include:

• Only two options are tested at once.
• If one option is clearly worse, it is still tested $n$ times.
• It stops at a specific time.

Multi-armed bandits is an alternative approach that has advantages over basic A-B testing (though A-B testing can itself can be improved). Any number of options can be tested at the same time, and poor options are quickly ignored so as to concentrate on those options that are performing well. Finally it can be run until a particular option is clearly dominant, or even run without stop. The key to the first two of these points is that it is a reinforcement learning technique, which means the algorithm itself has some degree of controls on the data it collects. It is able to balance the exploration of less examined options that might have big payoffs versus exploiting known high value options. This exploration-exploitation balancing is one of the two key characteristics of reinforcement learning. The other is that the algorithm learns via feedback from the environment, where this feedback is not simply the label it is seeking to learn. In this case, the feedback is whether a visitor clicks-through on a particular layout. The label the algorithm is trying to learn is which layout is best (in each case, insofar as we might conditionalize on customize characteristics, but here we assume it is unconditional).

## Algorithm

We associated with each option (layout) a beta distribution, which is a Dirichlet distribution for a categorical distribution with only two options. (If you need to, re-read the section on Dirichlet distributions in the Clustering: Dirichlet Process Mixture Models step in week 3.) Initially, the parameters of each of these Dirichlet distributions are all ones, representing ignorance about the probability of a visitor ‘clicking through’ for each layout.

When a visitor to the website appears, we take n samples from each of these Dirichlet distributions. This will give us n numbers between 0 and 1 for each layout. We choose to show the visitor the layout which has the highest mean in these associated samples. Initially, of course, all layouts will have an equal chance of being chosen, but this will change as we update the parameters of the Dirichlet distributions.

Once we have chosen the layout to show the visitor, they will either click through or not. We update the parameters of the Dirichlet distribution according to their action (adding one to the appropriate parameter). This change reflects our updated belief about the click through rate for that layout, and changes the expected value of the samples taken from that distribution when the next visitor appears.

We continue doing this as long as we like. Assuming the click through rates of all layouts are less than .5, the effect will be that all layouts are initially chosen quite regularly but poor performing ones are quickly ignored (or chosen very rarely) allowing the algorithm to concentrate attention on higher performing options. New layouts can be added at a later point and, again assuming all click through rates are less than .5, they too will initially be chosen regularly while they are being ‘explored’.