How can we determine whether a coin is fair?

We know that the outcome of each toss follows a Bernoulli distribution. Let \(p\) be the probability that the coin comes up heads, and \(1 - p\) be the probability that it comes up tails.

If the coin is fair, it should be equally likely to come up heads or tails. This means \(p\) and \(1 - p\) should be equal to each other, and therefore equal to \(1/2\).

The only way to estimate \(p\) is by tossing the coin many times and seeing how often it comes up heads or tails. Our estimate of \(p\) should be high if it comes up heads more often, and low if it comes up tails more often. Furthermore, our confidence in this estimate should increase as the number of tosses increases.

Before we start tossing the coin, let us consider how to represent our beliefs about \(p\).

We can let \(\mathrm{P}(p)\) represent our degree of belief that \(p\) is within a certain interval. This is called a probability density function. For example, \(\mathrm{P}(p \lt 1/2)\) represents our degree of belief that \(p\) is less than \(1/2\), meaning the coin is biased toward tails.

Because we have not observed any tosses yet, \(\mathrm{P}(p)\) is called the prior. It represents our beliefs about \(p\) *before* we observe any tosses. These beliefs will be updated once we start seeing how often the coin comes up heads or tails.

Since we have no prior information about \(p\), we will assume that \(p\) is uniformly distributed on the interval between 0 and 1, meaning that \(\mathrm{P}(p)\) is equal to 1 on \([0, 1]\) and 0 everywhere else.

It is possible to use other priors, like the non-informative Jeffreys prior. A comparison of different priors can be found here, and they yield slightly different distributions. For now, we will use the uniform prior.

Now we start tossing the coin. Let \(X\) be the set of outcomes we have seen so far, where each \(x \in X\) is either heads or tails.

Let \(\mathrm{P}(p \mid X)\) represent our degree of belief that \(p\) is within a certain interval *after* we have observed \(X\). \(\mathrm{P}(p \mid X)\) is called the posterior, because it represents our beliefs about \(p\) after observing \(X\).

Bayes' theorem states that the posterior is proportional to the prior:

\[\mathrm{P}(p \mid X) \propto \mathrm{P}(p)\]It also states that the posterior is proportional to the likelihood:

\[\mathrm{P}(p \mid X) \propto \mathrm{P}(X \mid p)\]The likelihood represents our degree of belief that we would observe this number of heads and tails *if this were the value of \(p\)*. In general, the likelihood is the probability that we would make an observation under the assumption that a given hypothesis is correct.

Because the posterior is proportional to the likelihood, the likelihood is a measure of the *impact* that observations have on our hypotheses.

Putting the prior and the likelihood together, we have

\[\mathrm{P}(p \mid X) \propto \mathrm{P}(p) \mathrm{P}(X \mid p)\]To make sure our probabilities add up to 1, we introduce the following normalizing constant:

\[\frac{1}{\int \mathrm{P}(p') \mathrm{P}(X \mid p') \mathrm{d}p'}\]which is simply the reciprocal of our previous expression summed over all possible values of \(p\). Hence the posterior is

\[\mathrm{P}(p \mid X) = \frac{\mathrm{P}(p) \mathrm{P}(X \mid p)}{\int \mathrm{P}(p') \mathrm{P}(X \mid p') \mathrm{d}p'}\]All we need to do now is determine the likelihood, \(\mathrm{P}(X \mid p)\).

Assuming the outcome of each toss is independent of the outcome of every other toss, we can express the likelihood as

\[\mathrm{P}(X \mid p) = \prod_{x \in X} \mathrm{P}(x \mid p)\]That is, we multiply the probabilities of the outcomes in \(X\), assuming that this is the true value of \(p\).

Recall that \(p\) is the probability that the outcome of a toss is heads, and \(1 - p\) is the probability that the outcome is tails. Thus we can express the likelihood of each individual outcome as follows:

\[\mathrm{P}(x \mid p) = \begin{cases} p & \text{if \(x \text{ is heads}\)} \\ 1 - p & \text{if \(x \text{ is tails}\)} \end{cases}\]Separating cases where \(x\) is heads and \(x\) is tails, we have

\begin{align} \mathrm{P}(X \mid p) &= \left[\prod_{x \in X\text{ : \(x\) is heads}} \mathrm{P}(x \mid p)\right] \left[\prod_{x \in X \text{ : \(x\) is tails}} \mathrm{P}(x \mid p)\right] \\ &= \left[\prod_{x \in X\text{ : \(x\) is heads}} p\right] \left[\prod_{x \in X \text{ : \(x\) is tails}} (1 - p)\right] \end{align}If we have observed \(h\) heads and \(t\) tails in total, then

\[\mathrm{P}(X \mid p) = p^h (1-p)^t\]Substituting this into our expression for the posterior yields

\[\mathrm{P}(p \mid X) = \frac{\mathrm{P}(p) p^h (1-p)^t}{\int \mathrm{P}(p') p'^h (1-p')^t \mathrm{d}p'}\]Since we are using a uniform prior, \(\mathrm{P}(p) = 1\) for \(p \in [0,1]\) and

\[\mathrm{P}(p \mid X) = \frac{p^h (1-p)^t}{\int_0^1 p'^h (1-p')^t \mathrm{d}p'}\]for \(p \in [0,1]\). The denominator can be expressed in terms of the beta function as follows:

\[ \mathrm{P}(p \mid X) = \frac{p^h (1 - p)^t}{\mathrm{B}(h + 1, t + 1)} \]This distribution is called the beta distribution. It gives the probability density function for \(p\) after \(h\) heads and \(t\) tails have been observed.

The distribution accumulates closer to 0 if the number of tails exceeds the number of heads, and closer to 1 if the number of heads exceeds the number of tails.

It also becomes more sharply peaked as the total number of tosses increases. This is an indicator of greater precision.

Assuming we start with a uniform prior, the posterior achieves its maximum probability density, or mode, at \begin{align} 0 &= \frac{\partial}{\partial p} \mathrm{P}(p \mid h, t) \\ &= \frac{\partial}{\partial p} \frac{p^h (1 - p)^t}{\mathrm{B}(h + 1, t + 1)} \\ &= \frac{\partial}{\partial p} p^h (1 - p)^t \\ &= (1 - p)^t \frac{\partial}{\partial p} p^h + p^h \frac{\partial}{\partial p} (1 - p)^t \\ &= (1 - p)^t h p^{h - 1} - p^h t (1 - p)^{t - 1} \\ &= h p^{h - 1} - p^h t (1 - p)^{-1} \\ &= h p^{-1} - t (1 - p)^{-1} \\ &= h (1 - p) - t p \\ &= h - h p - t p \\ &= h - (h + t) p \\ &= \frac{h}{h + t} - p \\ \end{align}

Therefore

\[ p = \frac{h}{h + t} \]This is called the maximum a posteriori estimate, since it is value of \(p\) which posseses the greatest probability density.

We can find the mean (expected value) of the distribution as follows: \begin{align} \mathrm{E}(p \mid h, t) &= \int p \mathrm{P}(p \mid h, t) \mathrm{d}p \\ &= \int_0^1 p \frac{p^h (1 - p)^t}{\mathrm{B}(h + 1, t + 1)} \mathrm{d}p \\ &= \frac{1}{\mathrm{B}(h + 1, t + 1)} \int_0^1 p^{h + 1} (1 - p)^t \mathrm{d}p \\ &= \frac{\mathrm{B}(h + 2, t + 1)}{\mathrm{B}(h + 1, t + 1)} \\ &= \frac{h + 1}{(h + 1) + (t + 1)} \\ &= \frac{h + 1}{h + t + 2} \end{align}

Notice that the mean of the distribution is slightly different from the mode. However, they both approach the same value as the total number of tosses approaches infinity.