In Chapter 6 we saw that two means can be compared by using a confidence interval for the difference $\mu_1-\mu_2$. Now we will explore another approach to the comparison of means: the procedure known as hypothesis testing. The general idea is to formulate as a hypothesis the statement that $\mu_1$ and $\mu_2$ differ and then to see whether the data provide sufficient evidence in support of that hypothesis.
The hypothesis that $\mu_1$ and $\mu_2$ are not equal is called an alternative hypothesis (or a research hypothesis) and is abbreviated $H_A$. It can be written as $$H_A:\mu_1\neq\mu_2.$$ Its antithesis is the null hypothesis, $$H_0:\mu_1=\mu_2,$$ which asserts that $\mu_1$ and $\mu_2$ are equal.
A statistical test of hypothesis is a procedure for assessing the strength of evidence present in the data in support of $H_A$. The data are considered to demonstrate evidence for $H_A$ if any discrepancies from $H_0$ (the opposite of $H_A$) could not be readily attributed to chance (i.e., to sampling error).
We consider the problem of testing the null hypothesis $$H_0:\mu_1=\mu_2\text{ or }H_0:\mu_1-\mu_2=0$$ against the alternative hypothesis $$H_A:\mu_1\neq\mu_2\text{ or }H_A:\mu_1-\mu_2\neq0.$$ The $t$ test is a standard method of choosing between these two hypotheses. To carry out the $t$ test, the first step is to compute the test statistic, which for a $t$ test is defined as $$T=\frac{(\bar{Y}_1-\bar{Y}_2)-0}{\mathrm{SE_{\bar{Y}_1-\bar{Y}_2}}}.$$ Notice the structure of $T$: It is a measure of how far the difference between the sample means is from the difference we would expect to see if $H_0$ were true (zero difference), expressed in relation to the SE of the difference: the amount of variation we expect to see in differences of means from random samples.
Abuse of substances containing toluene (e.g., glue) can produce various neurological symptoms. In an investigation of the mechanism of these toxic effects, researchers measured the concentrations of various chemicals in the brains of rats that had been exposed to a toluene-laden atmosphere, and also in unexposed control rats. The concentrations of the brain chemical norepinephrine (NE) in the medulla region of the brain, for six toluene-exposed rats and five control rats, are given in the following table.
The observed mean NE in the toluene group ($\bar{Y}_1=540.8$ ng/gm) is substantially higher than the mean in the control group ($\bar{Y}_2=444.2$ ng/gm). One might ask whether this observed difference indicates a real biological phenomenon (the effect of toluene) or whether the truth might be that toluene has no effect and that the observed difference between $\bar{Y}_1$ and $\bar{Y}_2$ reflects only chance variation.
The essence of the $t$ test procedure is to identify where the observed value $T$ falls in the Student's $t$ distribution.
For the brain NE data, the value of $T$ is $2.34$. We can ask, "If $H_0$ were true so that one would expect $\bar{Y}_1-\bar{Y}_2=0$, on average, what is the probability that $\bar{Y}_1-\bar{Y}_2$ would differ from zero by as many as $2.34$ SEs?". The $p$-value answers this question. The formula $$\nu=\frac{(\mathrm{SE}_1^2+\mathrm{SE}_2^2)^2}{\mathrm{SE}_1^4/(n_1-1)+\mathrm{SE}_2^4/(n_2-1)}$$ yields $8.47$ degrees of freedom for these data. Thus, the $p$-value is the area under the $t$ curve (with $8.47$ degrees of freedom) beyond $\pm2.34$. This area, which was found using a computer is shown in the following figure to be $0.0454$.
The $p$-value for a hypothesis test is the probability, computed under the condition that the null hypothesis is true, of the test statistic being at least as extreme as the value of the test statistic that was actually obtained.
From the definition of $p$-value, it follows that the $p$-value is a measure of compatibility between the data and $H_0$ and thus measures the evidence for $H_A$: A large $p$-value (close to 1) indicates a value of $T$ near the center of the $t$ distribution (lack of evidence for $H_A$), whereas a small $p$-value (close to 0) indicates a value of $T$ in the far tails of the $t$ distribution (evidence for $H_A$).
In Chapter 6 we saw that the mean height of fast plants was smaller when ancy was used than when water (the control) was used. The following table summarizes the data.
The difference between the sample means is $15.9-11.0=4.9$. The SE for the difference is $$\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}=\sqrt{\frac{4.8^2}{8}+\frac{4.7^2}{7}}=2.46.$$
Suppose we choose to use $\alpha=0.05$ in testing $$H_0: \mu_1-\mu_2=0$$ against the alternative hypothesis $$H_A: \mu_1-\mu_2\neq0.$$ The value of the test statistic is $$T=\frac{(15.9-11.0)-0}{2.46}=1.99.$$ Using the formula, we find the degrees of freedom to be: $$\nu=\frac{(1.7^2+1.8^2)^2}{1.7^4/(8-1)+1.8^4/(7-1)}=12.8.$$
The $p$-value for the test is the probability of getting a $t$ statistic that is at least as far away from zero as $1.99$.
In this section we have considered tests of the form $H_0: \mu_1-\mu_2=0$ against $H_A: \mu_1-\mu_2\neq0$; this is the most common pair of hypotheses. However, it may be that we wish to test that $\mu_1$ differs from $\mu_2$ by some specific, nonzero amount, say $c$. To test $H_0: \mu_1-\mu_2=c$ against $H_A: \mu_1-\mu_2\neq c$ we use the $t$ test with test statistic given by $$T=\frac{\bar{Y}_1-\bar{Y}_2-c}{\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}}.$$ From this point on, the test proceeds as before (i.e., as for the case when $c=0$).
There is a close connection between the confidence interval approach and the hypothesis testing approach to the comparison of $\mu_1$ and $\mu_2$. The $t$ test and the confidence interval use the same three quantities $\bar{Y}_1-\bar{Y}_2$, $\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}$, and $t_{\nu}(\alpha/2)$ but manipulate them in different ways.
The $p$-value is less than or equal to $\alpha$ if and only if the test statistic $T$ is in the tail of of the $t$ distribution, at or beyond $\pm t_{\nu}(\alpha/2)$. Thus we lack significant evidence for $H_A: \mu_1-\mu_2\neq0$ if and only if $|T|\leq t_{\nu}(\alpha/2)$, i.e., $$\frac{|\bar{Y}_1-\bar{Y}_2|}{\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}}\leq t_{\nu}(\alpha/2).$$
This is equivalent to $$(\bar{Y}_1-\bar{Y}_2)-t_{\nu}(\alpha/2)\times\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}\leq0\leq(\bar{Y}_1-\bar{Y}_2)+t_{\nu}(\alpha/2)\times\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}.$$ Thus we have shown that we lack significant evidence for $H_A: \mu_1-\mu_2\neq0$ if and only if the confidence interval for $\mu_1-\mu_2$ includes zero.
The rejection region for a hypothesis test is the set of values of the test statistic for which we reject the null hypothesis. It is determined based on the desired significance level ($\alpha$), the degrees of freedom $\nu$, and the alternative hypothesis $H_A$. If the test statistic falls within the rejection region, it provides significant evidence against the null hypothesis $H_0$.
To test $H_0: \mu_1-\mu_2=0$ against $H_A: \mu_1-\mu_2\neq0$ at the significance level $\alpha$, we reject $H_0$ at the significance level $\alpha$ if $|T|>t_{\nu}(\alpha/2)$. The corresponding rejection region is $\{T: |T|>t_{\nu}(\alpha/2)\}$.
Biologists took samples of the crawfish species Orconectes sanborii from two rivers in central Ohio, the Upper Cuyahoga River (CUY) and East Fork of Pine Creek (EFP), and measured the length (mm) of each crawfish captured.
For these data the two SEs are $3.78/\sqrt{30}=0.69$ and $2.90/\sqrt{30}=0.53$ for CUY and EFP, respectively. The degrees of freedom are $$\nu=\frac{(0.69^2+0.53^2)^2}{0.69^4/(30-1)+0.53^4/(30-1)}=54.4.$$ The quantity needed for a $t$ test with $\alpha=0.05$ is $$\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}=\sqrt{0.69^2+0.53^2}=0.87.$$
The test statistic is $$T=\frac{(22.91-21.97)-0}{0.87}=\frac{0.94}{0.87}=1.08.$$ The $p$-value for this test (found using a computer) is $0.2850$, which is greater than $0.05$, so we do not reject $H_0$. (A quick look at $t$ Table, using $\mathrm{df}=50$, shows that the $p$-value is between $0.20$ and $0.40$.)
If we construct a $95\%$ confidence interval for $\mu_1-\mu_2$ we get $$0.94\pm2.004\times0.87$$ or $(-2.68, 0.80)$. The confidence interval includes zero, which is consistent with not having significant evidence for $H_A: \mu_1-\mu_2\neq0$ in the $t$ test.
Students sometimes find it hard to distinguish between significance level $\alpha$ and $p$-value. For the $t$ test, both $\alpha$ and the $p$-value are tail areas under Student's $t$ curve. But $\alpha$ is an arbitrary prespecified value; it can be (and should be) chosen before looking at the data. By contrast, the $p$-value is determined from the data; indeed, giving the $p$-value is a way of describing the data.
We have seen that $\alpha$ can be interpreted as a probability: $$\alpha=P(\text{finding significant evidence for }H_A)\text{ if }H_0\text{ is true}.$$
The probability of making a Type II error is denoted by $$\beta=P(\text{lack of significant evidence for }H_A)\text{ if }H_A\text{ is true}.$$ The chance of not making a Type II error when $H_A$ is true (the chance of having significant evidence for $H_A$ when $H_A$ is true) is called the power of a statistical test: $$\mathrm{power}=1-\beta=P(\text{finding significant evidence for }H_A)\text{ if }H_A\text{ is true}.$$ Thus, the power of a $t$ test is a measure of the sensitivity of the test, or the ability of the test procedure to detect a difference between $\mu_1$ and $\mu_2$ when such a difference really does exist.
Consider a feeding experiment with lambs. The observation $Y$ will be weight gain in a 2-week trial. Ten animals will receive diet 1, and 10 animals will receive diet 2, where Diet 1 = Standard ration + Niacin and Diet 2 = Standard ration. On biological grounds it is expected that niacin may increase weight gain; there is no reason to suspect that it could possibly decrease weight gain. An appropriate alternative would be $$H_A:\text{ Niacin is effective in increasing weight gain }(\mu_1-\mu_2>0),$$ which is a right-sided hypothesis test. Suppose that we have $\bar{Y}_1=14$ lb, $\bar{Y}_2=10$ lb, $\mathrm{SE}_{\bar{Y}_1-\bar{Y}_2}=2.2$ lb, and $\nu=18$ and that we choose the significance level $\alpha=0.05$.
The test statistic is thus $$T=\frac{(\bar{Y}_1-\bar{Y}_2)-0}{\mathrm{SE_{\bar{Y}_1-\bar{Y}_2}}}=\frac{(14-10)-0}{2.2}=1.82.$$ The (right-sided) $p$-value for the test is the probability of getting a $t$ statistic, with $18$ degrees of freedom, that is as large or larger than $1.82$. This upper tail probability (found with a computer) is $0.043$. If we did not have a computer or graphing calculator available, we could use $t$ Table to bracket the $p$-value. From $t$ Table, we see that the $p$-value would be bracketed as follows: $$0.04<p\text{-value}<0.05.$$ Since $p$-value $<\alpha=0.05$, we reject $H_0$ and conclude that there is some evidence that niacin is effective.
The t test and confidence interval procedures we have described are appropriate if the following conditions hold: