Previously a method was outlined for testing a die for fairness.
The test calls for rolling the die a large number of times, computing a p-value, and classifying the die as unfair if the p-value is below a significance threshold.
The example uses 100 trials and a significance threshold of .05. It doesn't mention why these numbers are chosen.
The Two Types of Error
To intelligently choose the numbers, we have to consider two types of mistakes. We might reject a fair die, or we might accept an unfair die.The chance of making the first error—rejecting a fair die—is our significance threshold. To reduce the chance of making the first error we make the significance threshold low. A threshold of .05 is a bit high; our test has a 1 in 20 chance of rejecting a fair die!
The Power of a Test
The power of a test is the chance of not making the second error. That is, the power is the chance of rejecting an unfair die.
The power of a test usually depends on how unfair the die is. The more lopsided the chances, the easier it is to detect the unfairness.
An Approximation
Because of the dependence on the amount of unfairness, we can't represent the power as a single percentage. It is natural to characterize an unfair die as a categorical distribution with n - 1 free parameters where n is the number of faces. Even in the case of a d6 we can't plot the power as a function of the free parameters: there are too many.
For this reason let's only consider unfair dice characterized by a single parameter which is the probability of rolling the most likely face. All other faces are assumed to be equally likely.
The Power of the .05 Significance Test
Now we can plot the power of a test, given the probability of the mostly likely face.
If we increase the number of trials, the chance of detecting an unfair die goes up. Even if we only do 100 trials, the chance of detecting an unfair die where one of the faces is twice as likely as expected is about 90%.
I would like the percentage to be higher, but when I think the about the effort of making 200 trials instead, the percentage seems acceptable.
I would like the percentage to be higher, but when I think the about the effort of making 200 trials instead, the percentage seems acceptable.
The code used for calculating the power of a test is online. It uses the Monte Carlo method as discussed previously. The number of Monte Carlo trials was 10,000 which gives a 95% confidence our estimate is within 1%.
The Power of the .01 Significance Test
What if we decide to use a .01 significance level instead?
The chart doesn't change much. The chance of detecting the die where one of the faces is twice as likely as expected is now 77%.