Sampling Distributions¶

Basic Ideas¶

An important goal of data analysis is to distinguish between features of the data that reflect real biological facts and features that may reflect only chance effects.
The random sampling model provides a framework for making this distinction: Chance effects are regarded as sampling error. That is, discrepancy between the sample and the population.
In this chapter we develop the theoretical background that will enable us to place specific limits on the degree of sampling error to be expected in a study.

Sampling variability¶

The variability among random samples from the same population is called sampling variability.
A probability distribution that characterizes some aspect of sampling variability is termed a sampling distribution.
We have to expect a certain amount of discrepancy between the sample and the population due to the sampling error.

The meta-study¶

A meta-study consists of indefinitely many repetitions, or replications, of the same study. If the study consists of drawing a random sample of size $n$ from some population, the corresponding meta-study involves drawing repeated random samples of size $n$ from the same population.

Schematic representation of study and meta-study

The Sample Mean¶

The sample mean $\bar{y}$ can be used, not only as a description of the data in the sample, but also as an estimate of the population mean $\mu$. It is natural to ask, "How close to $\mu$ is $\bar{y}$?"
We cannot answer this question for the mean $\bar{y}$ of a particular sample due to the randomness of the sample. Regarding the sample mean as a random variable $\bar{Y}$, the question then becomes: "How close to $\mu$ is $\bar{Y}$ likely to be?"
To characterize such randomness, we resort to the sampling distribution of the sample mean $\bar{Y}$, the probability distribution that describes sampling variability in $\bar{Y}$.

To visualize the sampling distribution of $\bar{Y}$, imagine the meta-study as follows:

Random samples of size $n$ are repeatedly drawn from a fixed population with mean $\mu$ and standard deviation $\sigma$; each sample has its own mean $\bar{y}$.
The variation of the $\bar{y}$'s among the samples is specified by the sampling distribution of $\bar{Y}$.

$Schematic representation of the sampling distribution of $\bar{Y}$$

When we think of $\bar{Y}$ as a random variable, we need to be aware of two basic facts

On average, the sample mean equals to the population mean. That is, the average of the sampling distribution of $\bar{Y}$ is $\mu$.
As the sample size increases, the standard deviation of $\bar{Y}$ decreases. That is, for larger samples, $\bar{Y}$ will tend to be closer to the population mean.

Theorem¶

Theorem 5.2.1

Consider the random sample $Y_1, \ldots, Y_n$, drawn from a population with mean $\mu$ and standard deviation $\sigma$. The sample mean is denoted as $\bar{Y}=\frac{1}{n}\sum_{i=1}^nY_i$. Try to derive Parts 1 and 2 of the above theorem.

The Central Limit Theorem states that, no matter what distribution $Y$ may have in the population, if the sample size is large enough, then the sampling distribution of $\bar{Y}$ will be approximately a normal distribution.
It is because of the Central Limit Theorem (and other similar theorems) that the normal distribution plays such a central role in statistics.
It is natural to ask how "large" a sample size is required by the Central Limit Theorem.
- If the shape is normal, any n will do.
- If the shape is moderately nonnormal, a moderate n is adequate.
- If the shape is highly nonnormal, then a rather large n will be required.

Example: weights of seeds¶

A large population of seeds of the princess bean Phaseotus vulgaris is to be sampled. The weights of the seeds in the population follow a normal distribution with mean $\mu=500$ mg and standard deviation $\sigma=120$ mg. Suppose now that a random sample of four seeds is to be weighed, and let $\bar{Y}$ represent the mean weight of the four seeds. What is the sampling distribution of $\bar{Y}$? $N(500, 3600)$

Dependence of sample size¶

Larger $n$ gives a smaller value of $\sigma_{\bar{Y}}$ and consequently a smaller expected sampling error if $\bar{y}$ is used as an estimate of $\mu$.
If the population distribution is not normal, then the shape of the sampling distribution of $\bar{Y}$ depends on $n$, being more nearly normal for larger $n$.
The mean of a larger sample is not necessarily closer to $\mu$ than the mean of a smaller sample, but it has a greater probability of being close. It is in this sense that a larger sample provides more information about the population mean than a smaller sample.

$Sampling distribution of $\bar{Y}$ for various sample sizes $n$$

Populations, samples, and sampling distributions¶

It is important to distinguish clearly among three different distributions related to a quantitative variable $Y$:

the distribution of $Y$ in the population;
the distribution of $Y$ in a sample of data, and
the sampling distribution of $\bar{Y}$.

Distribution	Mean	Standard deviation
$Y$ in population	$\mu$	$\sigma$
$Y$ in sample	$\bar{y}$	$s$
$\bar{Y}$ (in meta-study)	$\mu_{\bar{Y}}=\mu$	$\sigma_{\bar{Y}}=\sigma/\sqrt{n}$

Example¶

Recall the weights of seeds example, the population mean and standard deviation are $\mu=500$ mg and $\sigma=120$ mg. Suppose we weigh a random sample of $n=25$ seeds from the population and obtain the data in the table below

Weights of $25$ princess bean seeds

The population distribution of $Y=$ weights is represented in (a)
the sample mean is $\bar{y}=526.1$ mg and the sample standard deviation is $s=113.7$ mg. (b) shows a histogram of the data; this histogram represents the distribution of $Y$ in the sample.
The sampling distribution of $\bar{Y}$ as shown in (c) is a theoretical distribution which relates, not to the particular sample shown in the histogram, but rather to the meta-study of infinitely repeated samples of size $n=25$. The mean and standard deviation of the sampling distribution are $\mu_{\bar{Y}}=500$ mg and $\sigma_{\bar{Y}}=120/\sqrt{25}=24$ mg.

Three distributions

Notice that the distributions in (a) and (b) are more or less similar; in fact, the distribution in (b) is an estimate of the distribution in (a). By contrast, the distribution in (c) is much narrower, because it represents a distribution of means rather than of individual observations.

The Normal Approximation to the Binomial Distribution¶

The binomial random variable $X\sim B(n, p)$ is the sum of $n$ identical Bernoulli random variables, each with expected value $p$ and variance $p(1-p)$. In other words, if $X_1, \ldots, X_n$ are identical (and independent) Bernoulli random variables with parameter $p$, then $X=X_1+\cdots+X_n$.
Think of $X_1, \ldots, X_n$ as a random sample. Then the sample mean $\hat{P}=\frac{1}{n}\sum_{i=1}^nX_i$ is governed by the Central Limit Theorem.

Theorem¶

If $n$ is large, then the binomial distribution of the probability of success, $\hat{P}$, can be approximated by a normal distribution with mean $=p$ and standard deviation $=\sqrt{p(1-p)/n}$.
If $n$ is large, then the binomial distribution of the number of successes, $Y$, can be approximated by a normal distribution with mean $=np$ and standard deviation $=\sqrt{np(1-p)}$.

Example: normal approximation to binomial¶

We consider a binomial distribution with $n=50$ and $p=0.3$. (a) shows this binomial distribution, using spikes to represent probabilities; superimposed is a normal curve with mean $=np=15$ and standard deviation $=\sqrt{np(1-p)}=3.24$. (b) shows the sampling distribution of $\hat{P}$; superimposed is a normal curve with mean $=p=0.3$ and standard deviation $=\sqrt{p(1-p)/n}=0.0648$.

The normal approximation (blue curve) to the binomial distribution (black spikes) with $n=50$ and $p=0.3$

To illustrate the use of the normal approximation, let us find the probability that $50$ independent trials result in at least $18$ successes, i.e., $P(Y\geq18)$. The exact calculation using the binomial formula is very tedious, which involves $50-18+1=33$ terms ($0.2178$). If instead the normal approximation is adopted, we only need to find the corresponding area under the normal curve.

Normal approximation to the probability of at least $18$ successes

The Z score that corresponds to $18$ is $$z=\frac{18-15}{3.2404}=0.93.$$ We find that the area is $1-0.8238=0.1762$ using Z table.

The continuity correction¶

What would happen if we want to compute $P(Y=18)$ using the normal approximation, the probability of 18 successes?
We think of "$18$" as covering the space from $17.5$ to $18.5$ and thus we consider the area under the normal curve between $17.5$ and $18.5$.
Compute $P(Y\geq18)$ using the continuity correction.
- The Z score is $$z=\frac{17.5-15}{3.2404}=0.77$$
- From the Z table, we find that the area above $0.77$ is $1-0.7794=0.2206$.
What about $P(12\leq Y\leq18)$ and $P(12<Y<18)$?

Continuity correction

Summary of continuity correction¶

If $P(Y=n)$ use $$P(n-0.5<Y<n+0.5).$$
If $P(Y>n)$ use $$P(Y>n+0.5).$$
If $P(Y\leq n)$ use $$P(Y<n+0.5).$$
If $P(Y<n)$ use $$P(Y<n-0.5).$$
If $P(Y\geq n)$ use $$P(Y>n-0.5).$$

How large must $n$ be?¶

The required $n$ depends on the value of $p$.

If $p=0.5$, then the binomial distribution is symmetric and the normal approximation is quite good even for $n$ as small as $10$.
However, if $p=0.1$, the binomial distribution for $n=10$ is quite skewed and is poorly fitted by a normal curve; for larger $n$ the skewness is diminished and the normal approximation is better.
A simple rule of thumb is the following:
- The normal approximation to the binomial distribution is fairly good if both $np$ and $n(1-p)$ are at least equal to $5$.