The Normal Distribution¶

Introduction¶

In this chapter we study the most important type of density curve: the normal curve.
The normal curve is a symmetric "bell-shaped" curve whose exact form we will describe next.
A distribution represented by a normal curve is called a normal distribution.

Example: serum cholesterol¶

The relationship between the concentration of cholesterol in the blood and the occurrence of heart disease has been the subject of much research. As part of a government health survey, researchers measured serum cholesterol levels for a large sample of Americans, including children. The distribution for children between $12$ and $14$ years of age can be fairly well approximated by a normal curve with mean $\mu=155$ mg/dl and standard deviation $\sigma=27$ mg/dl. The following figure shows a histogram based on a sample of $431$ children between $12$ and $14$ years old, with the normal curve superimposed.

Distribution of serum cholesterol in 431 12- to 14-year-old children

The Normal Curves¶

There are many normal curves; each particular normal curve is characterized by its mean and standard deviation.
If a random variable $Y$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$, then it is common to write $Y\sim N(\mu, \sigma)$.
The probability density function (pdf) of $Y\sim N(\mu, \sigma)$ is $$f(y)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y-\mu)^2}{2\sigma^2}},$$ which expresses the height of the normal curve as a function of the position along the horizontal axis. The quantities $e$ and $\pi$ that appear in the formula are constants, with $e$ approximately equal to $2.71$ and $\pi$ approximately equal to 3.14.

The figure below shows a graph of a normal curve. The shape of the curve is like a symmetric bell, centered at $y = \mu$.
The direction of curvature is downward (like an inverted bowl) in the central portion of the curve, and upward in the tail portions.
In principle the curve extends to $+\infty$ and $-\infty$, never actually reaching the $y$-axis; however, the height of the curve is very small for $y$ values more than three standard deviations from the mean.
The area under the curve is exactly equal to $1$.

$A normal curve with mean $\mu$ and standard deviation $\sigma$$

Normal curves with different means and SDs¶

The location of the normal curve along the $y$-axis is governed by $\mu$ since the curve is centered at $y=\mu$;
The width and the height of the curve (i.e., whether tall and thin or short and wide) are governed by $\sigma$.

Three normal curves with different means and standard deviations

Areas under a Normal Curve¶

The standard normal distribution, represented by $Z$, is the normal distribution having a mean of $0$ and a standard deviation of $1$. That is, $Z\sim N(0, 1)$.
If $X$ is a random variable from a normal distribution with mean $\mu$ and standard deviation $\sigma$, its Z-score (standardization) may be calculated from $X$ by subtracting $\mu$ and dividing by the standard deviation $\sigma$: $$Z=\frac{Y-\mu}{\sigma}.$$

Z table gives areas under the standard normal curve, with distances along the horizontal axis measured in the Z scale.
Each area tabled in the body of Z table is the area under the standard normal curve below a specified value of $z$, tabled in the margins.
If we want to find the area above a given value of $z$, we subtract the tabulated area from $1$.
How to find the area between two $z$ values?

A normal curve, showing the relationship between the natural scale ($Y$) and the standardized scale ($Z$)

The empirical rule for normal distribution¶

If the variable $Y$ follows a normal distribution, then

about $68\%$ of the $y$'s are within $\pm1$ SD of the mean.
about $95\%$ of the $y$'s are within $\pm2$ SDs of the mean.
about $99.7\%$ of the $y$'s are within $\pm3$ SDs of the mean.

Empirical rule for standard normal distribution

Determining areas for a normal curve¶

By taking advantage of the standardized scale, we can use $Z$ table to answer detailed questions about any normal population when the population mean and standard deviation are specified.

A professor's exam scores are approximately distributed normally with mean $80$ and standard deviation $5$.

What is the probability that a student scores an $82$ or less? $0.65542$
What is the probability that a student scores a $90$ or more? $0.02275$
What is the probability that a student scores between $74$ and $82$? $0.54035$

Inverse reading of Z table¶

We often need to determine corresponding $z$-values when we want to determine a percentile of a normal distribution. For example, suppose we want to find the $70$th percentile of a standard normal distribution. We need to look in Z table for an area of $0.7000$. The closest value is an area of $0.6985$, corresponding to a $z$ value of $0.52$.

What is the first quartile of the exam score distribution? $76.65$
What is the $70$th percentile of the exam score distribution? $82.6$

Assessing Normality¶

Many statistical procedures are based on having data from a normal population. In this section we consider ways to assess whether it is reasonable to use a normal curve model for a set of data and, if not, how we might proceed.

Normal quantile plots¶

A normal quantile plot is a special statistical graph that is used to assess normality. We present this statistical tool with an example using the heights (in inches) of a sample of $11$ women, sorted from smallest to largest:

$$61, 62.5, 63, 64, 64.5, 65, 66.5, 67, 68, 68.5, 70.5$$

Based on these data, does it make sense to use a normal curve to model the distribution of women's heights?

Computing indices and percentiles for the heights of 11 women

sort the data from smallest to largest.
calculate the adjusted percentiles $100(i-1/2)/n$
find the corresponding Z scores.
calculate the theoretical quantiles $\mu+Z\times\sigma$.
plot the sample quantiles against the theoretical quantiles in a scatterplot.

Computing theoretical Z scores and heights for 11 women

Normal quantile plot of the heights of 11 women

In this case our plot appears fairly linear, suggesting that the observed values generally agree with the theoretical values and the normal model provides a reasonable approximation to the data.
If the data do not agree with the normal model, then the plot will show strong nonlinear patterns such as curvature or S shapes.

Skewness in normal quantile plots¶

Histogram and normal quantile plot of a distribution that is skewed to the left

Histogram and normal quantile plot of a distribution that is skewed to the left

Histogram and normal quantile plot of a distribution that is skewed to the right

Histogram and normal quantile plot of a distribution that is skewed to the right

Histogram and normal quantile plot of a distribution that has long tails

Histogram and normal quantile plot of a distribution that has long tails

In [7]:

library(ggplot2)

# Create the normal quantile plot using ggplot2
g <- ggplot(data.frame(y = c(61, 62.5, 63, 64, 
                             64.5, 65, 66.5, 67, 
                             68, 68.5, 70.5)), 
            aes(sample = y)) +
    stat_qq() +
    stat_qq_line() +
    labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
    ggtitle("Normal Quantile Plot") +
    theme_bw() +
    theme(text = element_text(size = 20))
options(repr.plot.width=6, repr.plot.height=5)
g

Transformations for nonnormal data¶

Sometimes a histogram or normal quantile plot shows that our data are nonnormal, but a transformation of the data gives us a symmetric, bell-shaped curve.
In such a situation, we may wish to transform the data and continue our analysis in the new (transformed) scale.

In general, if the distribution is skewed to the right then one of the following transformations should be considered: $$\sqrt{Y}, \log Y, 1/\sqrt{Y}, 1/Y.$$
These transformations will pull in the long right-hand tail and push out the short left-hand tail, making the distribution more nearly symmetric. Each of these is more drastic than the one before. Thus, a square root transformation will change a mildly skewed distribution into a symmetric distribution, but a log transformation may be needed if the distribution is more heavily skewed, and so on.
If the distribution of a variable $Y$ is skewed to the left, then raising $Y$ to a power greater than $1$ can be helpful.