Hypothesis Testing

Hypothesis testing is a formal procedure for making informed conclusions about collected data. It is the foundation of pretty much all of modern science.

The procedure can be understood in terms of what we have learned so far about probability. When we collect data, we can think of each measurement as the value of some random variable $X$. This random variable will have some underlying distribution, which we might know something about or at least might have some assumptions about.

Our goal is to determine the probability of collecting the data, given the assumptions we have made about the distribution of $X$. If this probability is low enough, then we can confidently determine that some of our initial assumptions about the distribution of $X$ were wrong.

Hypothesis Testing Procedure: The procedure follows five main steps

  1. Formulate the null hypothesis $H_0$. This is our set of initial assumptions about the distribution of $X$.
  2. Formulate an alternative hypothesis $H_a$. This should be the complement to the null hypothesis in some aspect. If you are making several assumptions for the null hypothesis, then the alternative hypothesis should take the opposite assumption for one, and keep all other assumptions fixed. The alternative hypothesis is what you are trying to show with the data.
  3. Choose a significance level $\alpha$. This is a value between 0 and 1, which is typically chosen to be small. The significance level will be a threshold value used to determine if the differences between our assumptions and the data are random or not. If the probability of the data given the assumptions is less than $\alpha$, then we can conclude that our initial assumptions about $X$ were wrong.
  4. Calculate the probability of the collected data, assuming the null hypothesis is true.
  5. If the probability of the collected data is less than $\alpha$, then we reject the null hypothesis $H_0$ and accept the alternative hypothesis $H_a$. If the probability is greater than $\alpha$, then we fail to reject $H_0$.

Note: It's important to point out that we never accept the null hypothesis. If we fail to reject $H_0$, then we haven't shown that $H_0$ is true, we just haven't provided enough evidence to show that it isn't true.

Example: Anna has two dice that she thinks are weighted to come up as 7 more often than other numbers. To test this, she rolls the dice 20 times and gets a seven 11 times. Can she conclude that the dice are weighted?

Solution: Before setting up a hypothesis test, lets fix some notation for Anna's experiment. Let $w_i$ be the sum of both dice on the $i$th roll and let $$X(w_i) = \begin{cases} 1 & w_i = 7\\ 0 & w_i \ne 7.\end{cases} $$ Then $ X(w_i)$ is a Bernoulli random variable with some probability of success $p$. If the dice are weighted to land on 7 more often, then $p >1/6$, and if they are not, then $p \le 1/6$. In particular, if the dice are perfectly fair, then $p = 1/6$.

Let $Y = \sum_{i=1}^{20}X(w_i)$ be the number of times that the dice came up as 7 in the 20 rolls. Then $Y$ is binomially distributed with $n=20$. Our null hypothesis and alternative hypothesis will be about the random variable $Y$.

  1. Let $H_0 = \{Y $ is binomially distributed with $n=20$ and $p = 1/6 \}.$ We use this as the null hypothesis since we are trying to prove the opposite.
  2. Let $H_a = \{Y $ is binomially distributed with $n=20$ and $p > 1/6\}.$
  3. Let's take $\alpha = 0.01$. This choice is somewhat arbitrary. Choosing smaller values of $\alpha$ makes the strength of the conclusion stronger.
  4. The probability of the data under the assumptions of the null hypothesis is $$P(Y = 11 ~|~ H_0) = \binom{20}{11}\left(\frac{1}{6}\right)^{11}\left(\frac{5}{6}\right)^9.$$ The right hand side is the pmf for the binomial distribution with $n=20$ and $p = 1/6.$
  5. Calculating the probability in part IV gives $P(Y=11~|~H_0) \approx 8.97\times 10^{-5}.$ Since this number is less than the significance level $\alpha$ we set in part III, we reject the null hypothesis and accept the alternative hypothesis. In other words, we can conclude that the dice are weighted with significance 0.001.

Although we were able to reject $H_0$ in this example, it is possible that we have been misled by the data. I.e. although the probability of getting 7 eleven times was low, the event is still possible under the assumptions of $H_0$. Luckily, we can determine the probability that the data have misled us in our analysis.

Definition: A Type I error is when the data lead the null hypothesis to be rejected, even though it is true. This is also called a false positive. A Type II error is when the data lead to a failure to reject $H_0$, when $H_0$ is not true. This is also called a false negative.

These errors are not the fault of the person running the experiment, or the person doing the analysis. They are just due to the randomness of the data.

Example : When Anna was testing the dice, she concluded that the dice were weighted. From the hypothesis test, we would have rejected the null hypothesis for any value of $Y$ that satisfied $$P (Y ~|~H_0) < \alpha$$ We can determine which values of $Y$ satisfy this by looking at the pmf for $Y$. Since the pmf isn't invertible, we need to find the threshold value of $Y$ through trial and error. It turns out we would have rejected $H_0$ for any value of $Y$ greater than or equal to 8. Therefore the probability of a type I error is $$P(Y \ge 8~|~H_0) = 1- P(Y < 7 ~|~ H_0) = 0.0113$$ or about 1%.

Example: Suppose the dice are actually weighted with p=1/3 of the time. What is the probability that Anna makes a type II error?

Solution: Anna would fail to reject $H_0$ for all values of $Y \le 7$. Then the probability of a Type II error is $$P(Y\le 7 ~|~ p= 1/3, n=20) \approx 0.66.$$