Suppose I present you a collection of coins. These coins may be biased in different ways. As in the previous problem, you don't know what the bias of each coin is.
I then invite you to toss any coin you want. This is repeated an arbitrary number of times. Every time you toss a coin, I will give you $1 if it comes up heads and $0 if it comes up tails.
The question is simple: How can you maximize your earnings?
This is called the multi-armed bandit problem. Any strategy must balance exploration of different coins (since more samples give a better estimate of the expected return) and exploitation of coins known to yield good results. Always choosing exploitation is called the greedy strategy. It can lead to good rewards in the very short term, but in the long term it will probably leave you 'stuck' with a suboptimal coin.
The Thompson sampling strategy is remarkably powerful and elegant. Simply put, Thompson sampling says: Choose an action according to the probability that it is the best action. Despite its simplicity, Thompson sampling achieves logarithmic expected regret for the multi-armed bandit problem.
[To be continued]