In this chapter we discuss some methods for analyzing the relationship between two quantitative variables, $X$ and $Y$. Linear regression and correlation analysis are techniques based on fitting a straight line to the data.
The level of dissolved oxygen in a river is one measure of the overall health of the river. Researchers recorded water temperature (°C) and level of dissolved oxygen (mg/L) for 75 days at Dairy Creek in California. The figure below shows a scatterplot of the data, with $${\scriptstyle Y=\text{ level of dissolved oxygen (mg/L) plotted against }X=\text{ water temperature (°C).}}$$ The scatterplot suggests that higher water temperatures ($X$) are associated with lower levels of dissolved oxygen ($Y$).
Suppose we have a sample of $n$ pairs for which each pair represents the measurements of two variables, $X$ and $Y$. If a scatterplot of $Y$ versus $X$ shows a general linear trend, then it is natural to try to describe the strength of the linear association. We will learn how to measure the strength of linear association using the correlation coefficient.
In a study of a free-living population of the snake Vipera bertis, researchers caught and measured nine adult females. Their body lengths and weights are shown and displayed as a scatterplot in the following table and figure, separately. The number of observations is $n=9$.
The scatterplot shown in the preceding figure shows a clear upward trend. We say that weight shows a positive association with length, indicating that greater lengths are associated with greater weights. Thus, snakes that are longer than the average length of $\bar{X}=63$ tend to be heavier than the average weight of $\bar{Y}=152$. The line superimposed on the plot is called the fitted regression line or least-squares line of $Y$ on $X$. We will learn how to compute and interpret the regression line later.
How strong is the linear relationship between snake length and weight? Are the data points tightly clustered around the regression line, or is the scatter loose? To answer these questions we will compute the correlation coefficient, a scale-invariant numeric measure of the strength of linear association between two quantitative variables.
To understand how the correlation coefficient works, consider again the snake length and weight example. Rather than plotting the original data, the figure and table below show the standardized data (Z scores); note that the figure looks identical to our original figure except now our scales are unit-less.
The correlation coefficient is based on this sum. It is computed as the average product of standardized scores (using $n-1$ rather than $n$ to compute the average): $$r=\frac{1}{n-1}\sum_{i=1}^n\left(\frac{X_i-\bar{X}}{s_X}\right)\left(\frac{Y_i-\bar{Y}}{s_Y}\right).$$ From this formula it is clear that $X$ and $Y$ enter $r$ symmetrically; therefore, if we were to interchange the labels $X$ and $Y$ of our variables, $r$ would remain unchanged.
The figure below displays several examples with a variety of correlation coefficient values.
For the data in length and wight of snakes example, we showed that for the snake data the sum of the products of the standardized scores is 7.5494. Thus, the correlation coefficient for the lengths and weights of our sample of nine snakes is about 0.94. $$r=\frac{1}{9-1}\times7.5494=0.94.$$ In this example we may also refer to the value 0.94 as the sample correlation, since the lengths and weights of these nine snakes comprise a sample from a larger population. The sample correlation is an estimate of the population correlation (often denoted by the Greek letter "rho", $\rho$).
In some investigations it is not a foregone conclusion that there is any relationship between $X$ and $Y$. It then may be relevant to consider the possibility that any apparent trend in the data is illusory and reflects only sampling variability. In this situation it is natural to formulate the null hypothesis $$H_0: X\text{ and }Y\text{ are uncorrelated in the population}$$ or, equivalently $$H_0:\text{ There is no linear relationship between }X\text{ and }Y$$ or symbolically as $$H_0: \rho=0\text{ v.s. }H_A: \rho\neq0.$$
A traditional approach to investigate the null hypothesis is to use a $t$ test that is based on the test statistic $$T=r\sqrt{\frac{n-2}{1-r^2}}.$$ The null distribution of the test statistic is $t_{n-2}$, i.e., $$T\overset{H_0}{\sim}t_{n-2}.$$ Therefore, $H_0$ is rejected at the $\alpha$ level of significance if $$p\text{-value }=2\times P(t_{n-2}>T)<\alpha\mbox{ or }T>t_{n-2}(\alpha/2).$$
It is suspected that calcium in blood platelets may be related to blood pressure. As part of a study of this relationship, researchers recruited 38 subjects whose blood pressure was normal (i.e., not abnormally elevated). For each subject two measurements were made: pressure (average of systolic and diastolic measurements) and calcium concentration in the blood platelets. The data are shown in the figure below. The sample size is $n=38$, and the sample correlation is $r=0.5832$.
We wish to test the null hypothesis that there is no linear relationship between blood pressure and blood platelet calcium. Let us choose $\alpha=0.05$. The test statistic is $$T=0.5832\sqrt{\frac{38-2}{1-0.5832^2}}=4.308.$$ From $t$ Table with $\mathrm{df}=n-2=36$, we find $t_{30}(0.0005)=3.646$. Thus, we find $p$-value $<2\times0.0005=0.001$ (two-sided), and we reject $H_0$. The data provide strong evidence that platelet calcium is linearly related with blood pressure.
We learned how the correlation coefficient describes the strength of linear association between two numeric variables, $X$ and $Y$. In this section we will learn how to find and interpret the line that best summarizes their linear relationship.
Consider a data set for which there is a perfect linear relationship between $X$ and $Y$, for example, temperature measured in $X=$ Celsius and $Y=$ Fahrenheit. The following figure displays 20 weekly ocean temperatures (in both °C and °F) for a coastal California city along with a line that perfectly describes the relationship: $Y=32+1.8X$.
A summary of the data appears in the following table.
Because $X$ and $Y$ are measuring the same variable (temperature), it stands to reason that a water specimen that is $1$ SD above average in °C ($s_X=1.60$) will also be 1 SD above average in °F ($s_Y=2.88$). Combined, these values can describe the slope of the line that fits these data exactly: $$\frac{\mathrm{rise}}{\mathrm{run}}=\frac{s_Y}{s_X}=\frac{2.88}{1.60}=1.8.$$ In this example we also happen to know the equation of the line that describes the Celsius to Fahrenheit conversion. The slope of this line is $1.80$, the same value we found previously.
In the dissolved oxygen example, we observed a scatterplot indicating that the amount of dissolved oxygen in a river and water temperature appear to be linearly related ($r=-0.391$). The following figure displays a scatterplot of these data along with the SD line (dashed line) and fitted regression line (solid line). Each solid triangle indicates the mean dissolved oxygen level for a range of temperatures specified by the shading.
The dissolved oxygen example shows that the SD line tends to overestimate the mean value of $Y$ for below average $X$ values and underestimate the mean value of $Y$ for above average $X$ values.
For the dissolved oxygen data, the slope of the fitted regression line is $$r\frac{s_Y}{s_X}=-0.391\times\frac{1.30}{1.20}=-0.22,$$ meaning that each additional 1 °C increase in water temperature is associated with a 0.22 mg/L decrease in dissolved oxygen level, on average.
For the dissolved oxygen data, we found that the slope of the fitted regression line to be $b_1=-0.22$. Using this value we find the intercept, $$b_0=8.73-(-0.22)\times14.58=11.94.$$ Thus, our fitted regression line is $\bar{Y}=11.94-0.22X$.
We now consider a statistic that describes the scatter of the points about the fitted regression line. The equation of the fitted line is $\bar{Y}=b_0+b_1X$. Thus, for each observed $X_i$ in our data there is a predicted $Y$ value of $$\hat{Y}_i=b_0+b_1X_i.$$ Also associated with each observed pair $(X_i, Y_i)$ is a quantity called a residual, defined as $$e_i=Y_i-\hat{Y}_i.$$ A summary measure of the distances of the data points from the regression line is the error sum of squares, or SSE, which is defined as follows: $$\mathrm{SSE}=\sum_{i=1}^n(Y_i-\bar{Y}_i)^2=\sum_{i=1}^ne_i^2.$$
For the dissolved oxygen data, the table below indicates how SSE would be calculated from its definition. The values displayed are abbreviated to improve readability.
Several facts:
Many different criteria can be proposed to define the straight line that “best” fits a set of data points. The classical criterion is the least-squares criterion:
The formulas given for $b_0$ and $b_1$ were derived from the least-squares criterion by applying calculus to solve the minimization problem. The fitted regression line is also called the "least-squares line".
A measure derived from the error sum of squares (SSE) and easier to interpret is the residual standard deviation, $$s_e=\sqrt{\frac{\mathrm{SSE}}{n-2}}.$$ The residual standard deviation tells how far above or below the regression line points tend to be. Thus, the residual standard deviation specifies how far off predictions made using the regression model tend to be.
For the dissolved oxygen data, the residual standard deviation is $$s_e=\sqrt{\frac{106.14}{75-2}}=\sqrt{1.454}=1.21.$$ Thus, predictions for the levels of dissolved oxygen based on the regression model tend to deviate by about 1.21 mg/L on average.
We have said that the magnitude of $r$ describes the tightness of the linear relationship between $X$ and $Y$ and have seen how its value is related to the slope of the regression line. When squared, it also provides an additional and very interpretable summary of the regression relationship. The coefficient of determination, $r^2$, describes the proportion of the variance in $Y$ that is explained by the linear relationship between $Y$ and $X$. $$r^2=\frac{\sum_{i=1}^n(Y_i-\bar{Y})^2-\sum_{i=1}^n(Y_i-\hat{Y}_i)}{\sum_{i=1}^n(Y_i-\bar{Y})^2}=1-\frac{\mathrm{SSE}}{(n-1)s_Y^2}.$$ For the dissolved oxygen data, we found $r=-0.391$, so $r^2=0.153$. Thus, $15.3\%$ of the variance in dissolved oxygen level is explained by the linear relationship between dissolved oxygen level and water temperature.
One use of regression analysis is simply to provide a concise description of the data. The quantities $b_0$ and $b_1$ locate the regression line, and $s_e$ describes the scatter of the points about the line. For many purposes, however, data description is not enough. In this section we consider inference from the data to a larger population.
A conditional population of $Y$ values is a population of $Y$ values associated with a fixed, or given, value of $X$. Within a conditional population we may speak of the conditional distribution of $Y$. The mean and standard deviation of a conditional population distribution are denoted as $$E(Y|X)=\mu_{Y|X}=\text{ Population mean }Y\text{ value for a given }X$$ $$\mathrm{Var}(Y|X)=\sigma_{Y|X}^2=\text{ Population variance of }Y\text{ values for a given }X$$
Consider the variables $X=$ Height and $Y=$ Weight for a population of young men. The conditional means and standard deviations are $$\mu_{Y|X}=\text{ Mean weight of men who are }X\text{ inches tall}$$ $$\sigma_{Y|X}=\text{ SD of weights of men who are }X\text{ inches tall}$$ Thus, $\mu_{Y|X}$ and $\sigma_{Y|X}$ are the mean and standard deviation of weight in the subpopulation of men whose height is $X$. Of course, there is a different subpopulation for each value of $X$.
When we conduct a linear regression analysis, we think of $Y$ as having a distribution that depends on $X$. The analysis can be given a parametric interpretation if two conditions are met.
In the linear model $Y=\beta_0+\beta_1X+\varepsilon$, the $\varepsilon$ term represents random error. We include this term in the model to reflect the fact that $Y$ varies, even when $X$ is fixed.
Consider now the analysis of a set of $(X, Y)$ data. Suppose we assume that the linear model is an adequate description of the true relationship of $Y$ and $X$. Suppose further that we are willing to adopt the following random subsampling model:
Within the framework of the linear model and the random subsampling model, the quantities $b_0$, $b_1$, and $s_e$ calculated from a regression analysis can be interpreted as estimates of population parameters:
From the summaries of the snake data, we can compute the following regression coefficients $b_0=-301, b_1=7.19,$ and $s_e=12.5$ (computing these yourself from the provided summaries would be a good exercise). Thus,
The linear model provides interpretations of $b_0$, $b_1$, and $s_e$ that take them beyond data description into the domain of statistical inference. In this section we consider inference about the true slope $b_1$ of the regression line. The methods are based on the condition that the conditional population distribution of $Y$ for each value of $X$ is a normal distribution. This is equivalent to stating that in the linear model of $Y=\beta_0+\beta_1X+\varepsilon$, the $\varepsilon$ values come from a normal distribution.
Within the context of the linear model, $b_1$ is an estimate of $\beta_1$. Like all estimates calculated from data, $b_1$ is subject to sampling error. The standard error of $b_1$ is $$\mathrm{SE}_{b_1}=\frac{s_e}{s_X\sqrt{n-1}}.$$ For the snake data, we found that $n=9, s_X=4.637$ and $s_e=12.5$. The standard error of $b_1$ is $$\mathrm{SE}_{b_1}=\frac{12.5}{4.637\sqrt{9-1}}=0.9531.$$
In many studies the quantity $\beta_1$ is a biologically meaningful parameter and a primary aim of the data analysis is to estimate $\beta_1$. A confidence interval for $\beta_1$ can be constructed by the familiar method based on the SE and Student's $t$ distribution.
A $1-\alpha$ confidence interval for $\beta_1$ is constructed as $$b_1\pm t_{n-2}(\alpha/2)\times\mathrm{SE}_{b_1}.$$
For the snake data, we found that $b_1=7.19, \mathrm{SE}_{b_1}=0.9531$. There are $n=9$ observations; we refer to $t$ Table with $\mathrm{df}=9-2=7$, and obtain $t_7(0.025)=2.365$. The $95\%$ confidence interval is $$7.19\pm2.365\times0.9531$$ or $(4.94, 9.45)$. We are $95\%$ confident that the true slope of the regression of weight on length for this snake population is between 4.94 gm/cm and 9.45 gm/cm; this is a rather wide interval because the sample size is not very large.
In some investigations it is not a foregone conclusion that there is any linear relationship between $X$ and $Y$. It then may be relevant to consider the possibility that any apparent trend in the data is illusory and reflects only sampling variability. In this situation it is natural to formulate the null hypothesis $$H_0: \mu_{Y|X}\text{ does not depend on }X.$$ Within the linear model, this hypothesis can be translated as $$H_0:\beta_1=0.$$ A $t$ test of $H_0$ is based on the test statistic $$T=\frac{b_1-0}{\mathrm{SE}_{b_1}}.$$ The null distribution of the test statistic is $t_{n-2}$. Specifically, $$T\overset{H_0}{\sim}t_{n-2}.$$ $H_0$ is rejected at the $\alpha$ level of significance if $$p\text{-value }=2\times P(t_{n-2}>T)<\alpha\mbox{ or }T>t_{n-2}(\alpha/2).$$
While the forms of the test statistic are quite different, testing $H_0: \beta_1=0$ is equivalent to testing $H_0: \rho=0$. Recall that a population correlation of zero indicates that there is no linear relationship between $X$ and $Y$. In this case, the slope that best summarizes "no linear relationship" is a slope of zero.
Note that $$b_1=r\frac{s_Y}{s_X},\quad r^2=1-\frac{\mathrm{SSE}}{(n-1)s_Y^2},\quad s_e=\sqrt{\frac{\mathrm{SSE}}{n-2}},\quad\mathrm{SE}_{b_1}=\frac{s_e}{s_X\sqrt{n-1}}.$$ One can verify that the test statistic for $H_0: \beta_1=0$ is equal to the test statistic for $H_0: \rho=0$, i.e., $$\frac{b_1-0}{\mathrm{SE}_{b_1}}=r\sqrt{\frac{n-2}{1-r^2}}.$$
For the snake data, we found that $b_1=7.19, \mathrm{SE}_{b_1}=0.9531$. The test statistic is $$T=\frac{7.19-0}{0.9531}=7.54.$$ There are $n=9$ observations; we refer to $t$ Table with $\mathrm{df}=9-2=7$, and obtain $t_7(0.0005)=5.408$. Thus, we find that $p$-value $<0.001$ and we reject $H_0$. The data provide sufficient (and very strong) evidence to conclude that the true slope of the regression of snake body weight on body length in this population is nonzero.
Note that the test on $\beta_1$ does not ask whether the relationship between $\mu_{Y|X}$ and $X$ is linear. Rather, the test asks whether, assuming that the linear model holds, we can conclude that the slope is nonzero. It is therefore necessary to be careful in phrasing the conclusion from this test. For instance, the statement "There is a significant linear trend" could easily be misunderstood.
The quantities $b_0, b_1, s_e$, and $r$ can be used to describe a scatterplot that shows a linear trend. However, statistical inference based on these quantities depends on certain conditions concerning the design of the study, the parameters, and the conditional population distributions.