Hypothesis Testing: Continuous Random Variables

Hypothesis testing with continuous random variables is slightly different than with discrete random variables. Recall that if $X$ is a continuous random variable, then for each value of $x$ in the range of $X$ the probability that $X$ takes on that exact value is 0, $P(X=x)=0$. So instead of using the significance to see if the probability of the data are low enough, we use the significance to define a cutoff value $X^*$, and a rejection region, which is a region of values of $X$ for which we will reject the null hypothesis.

In a lot of cases, we will be considering the mean of values sampled from the distribution of $X$. Instead of determining what the distribution of the mean of values is exactly, we will often use an approximation known as the central limit theorem.

The Central Limit Theorem: Suppose $X$ is a random variable. If $n$ is large enough, then the random variable defined by $$Y = \frac{1}{n}\sum_{j=1}^n X$$ is approximately normally distributed with $\mu = \mathbb{E}(X)$ and $\sigma^2 = \frac{\mathbb{V}(X)}{n}.$

What this means is that when we take the mean of the data, the mean is a normally distributed random variable. Typically, this approximation is good for $n>30$, so as long as we have taken more than 30 samples, the mean is normally distributed. As we take more samples, the variance of the resulting normal distribution decreases, and we are less likely to get a result far away from the mean of the underlying distribution $X$.

Example: An environmental monitoring group has been taking samples from Lake Simcoe in southern Ontario to test for lead. The water is considered unsafe if there is more than 3$\mu$g/L of lead. Suppose they have taken 25 samples of water from the lake, and that the lead concentrations in each sample are exponentially distributed. Set up a hypothesis test to determine if the average lead concentration is too high.

Solution: Let $X_j$ be the concentration of lead in the $j$th sample. We need to deterine the cut-off value of our test-statistic. Then the average concentration is $Y = \frac{1}{25}\sum_{j}^{25}X$. By the central limit theorem $Y$ is approximately normally distributed with $\mu = \frac{1}{\lambda}$ and $\sigma^2 = \frac{1}{25 \lambda^2}.$ Since $Y$ is so closely related to $X$, we can use it to make conclusions about $X$.

  1. Recall that the null hypothesis is what we are assuming is true about the distribution $X$. We will take the null hypothesis to be that $X$ is exponentially distributed with $\mathbb{E}(X) \le 3. $ In other words, that the water is safe. Since $X$ is assumed to be exponentially distributed, this implies that $\frac{1}{\lambda} \le 3$, or $ \lambda \ge 1/3.$
  2. The alternative hypothesis is the complement of the null hypothesis. Therefore, this would be that the water is not safe. In terms of $X,$ this can be expressed as $\mathbb{E}(X)> 3$ or $\lambda < 1/3.$
  3. Let's set the traditional significance level of $\alpha = 0.05$. As was stated last lecture, this is the "standard" value used.
  4. We need to use $\alpha$ to define a cut-off value of $Y$. I.e. we need to find a value $Y^*$ so that $P(Y\ge Y^*) =\alpha.$ That way, if the data come out larger than $Y^*$, we can determine that it was too unlikely for them to have occured randomly, and that our assumptions about $\lambda$ must be incorrect. We can find this cutoff value using the inverse of the normal cdf, which can be computed numerically from a calculator, from a programming language like python, or by referring to a z-score table if technology scares you. Any one of these procedures will help us find $Y^* = 3.987$.

  5. Using the cut-off value found in part IV, we determine that we will reject $H_0$ if $Y > 3.987.$ This would mean that we conclude that the water is unsafe.

As we saw before, just because the samples collected come up to a specific value, doesn't mean that we are making the correct conclusion. We need to determine the probability of type 1 and type 2 errors based off of our experimental design.

Type 1 error: The probability we reject $H_0$ even though $H_0$ is true is $P(Y \ge Y^* | H_0) = \alpha = 0.05$. As we can see, the probability of a type 1 error is built into the problem construction as just the significance level.

Type 2 error: Suppose that actual concentration of lead in the water is $4\mu$g/L. Then the alternative hypothesis can be expressed as $Y$ is normally distributed with $\mu = 4$ and $\sigma^2 = \frac{16}{25}$. Thus the probability of not rejecting $H_0$ is $P( Y < Y^* | H_A) = P( Y < 3.987~| H_A) = 0.49,$ which is very high.

Exercises

The probability of a type 2 error is very high. What could you change about the analysis or experimental setup to decrease the probability of a type 2 error?