# 5 Introduction to statistical inference

Statistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates. While the equations and details change depending on the setting, the foundations for inference are the same throughout all of statistics.

We start with two case studies designed to motivate the process of making decisions about research claims. We formalize the process through the introduction of the hypothesis testing framework, which allows us to formally evaluate claims about the population.

Finally we expand on the familiar idea of using a sample proportion to estimate a population proportion. That is, we create what is called a confidence interval, which is a range of plausible values where we may find the true population value.

Throughout the book so far, you have worked with data in a variety of contexts. You have learned how to summarize and visualize the data as well as how to model multiple variables at the same time. Sometimes the dataset at hand represents the entire research question. But more often than not, the data have been collected to answer a research question about a larger group of which the data are a (hopefully) representative subset.

You may agree that there is almost always variability in data – one dataset will not be identical to a second dataset even if they are both collected from the same population using the same methods. However, quantifying the variability in the data is neither obvious nor easy to do, i.e. answering the question “how different is one dataset from another?” is not trivial.

Suppose your professor splits the students in your class into two groups: students who sit on the left side of the classroom and students who sit on the right side of the classroom. If $$\hat{p}_{L}$$ represents the proportion of students who sit on the left side of the classroom and own an Apple product and and $$\hat{p}_{R}$$ represents the proportion of students who sit on the right side of the classroom and own an Apple product, would you be surprised if $$\hat{p}_{L}$$ did not exactly equal $$\hat{p}_{R}$$?

While the proportions $$\hat{p}_{L}$$ and $$\hat{p}_{R}$$ would probably be close to each other, it would be unusual for them to be exactly the same. We would probably observe a small difference due to chance.

If we don’t think the side of the room a person sits on in class is related to whether the person owns an Apple product, what assumption are we making about the relationship between these two variables? (Reminder: for these Guided Practice questions, you can check your answer in the footnote.)106

Studying randomness of this form is a key focus of statistics. Throughout this chapter, and those that follow, we provide three different approaches for quantifying the variability inherent in data: randomization, bootstrapping, and mathematical models. Using the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that these samples come from.

## 5.1 Randomization tests

The first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units. As you learned in Chapter 1, a randomized experiment can be used to assess whether or not one variable (the explanatory variable) causes changes in a second variable (the response variable). Every dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison. That is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level. By reassigning the treatments many many times, we can compare the actual experiment to the sham experiment. If the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not just variability inherent to the data. Using a few different case studies, let’s look more carefully at this idea of a randomization test.

### 5.1.1 Gender discrimination case study

We consider a study investigating gender discrimination in the 1970s, which is set in the context of personnel decisions within a bank.107 The research question we hope to answer is, “Are females discriminated against in promotion decisions made by male managers?”

The data from this study can be found in the openintro package: gender_discrimination.

#### Observed data

The participants in this study were 48 male bank supervisors attending a management institute at the University of North Carolina in 1972. They were asked to assume the role of the personnel director of a bank and were given a personnel file to judge whether the person should be promoted to a branch manager position. The files given to the participants were identical, except that half of them indicated the candidate was male and the other half indicated the candidate was female. These files were randomly assigned to the subjects.

Is this an observational study or an experiment? How does the type of study impact what can be inferred from the results?108

For each supervisor we recorded the gender associated with the assigned file and the promotion decision. Using the results of the study summarized in Table 5.1, we would like to evaluate if females are unfairly discriminated against in promotion decisions. In this study, a smaller proportion of females are promoted than males (0.583 versus 0.875), but it is unclear whether the difference provides convincing evidence that females are unfairly discriminated against.

Table 5.1: Summary results for the gender discrimination study.
decision
gender promoted not promoted Total
male 21 3 24
female 14 10 24
Total 35 13 48

The data are visualized in Figure 5.1. Note that each card box denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted. Additionally, the observations are broken up into the male and female groups. Figure 5.1: The gender descriminiation study can be thought of as 48 red and white cards.

Statisticians are sometimes called upon to evaluate the strength of evidence. When looking at the rates of promotion for males and females in this study, why might we be tempted to immediately conclude that females are being discriminated against?

The large difference in promotion rates (58.3% for females versus 87.5% for males) suggest there might be discrimination against women in promotion decisions. However, we cannot yet be sure if the observed difference represents discrimination or is just due to random chance. Since we wouldn’t expect the sample proportions to be exactly equal, even if the truth was that the promotion decisions were independent of gender, we can’t rule out random chance as a possible explanation when simply comparing the sample proportions.

The previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population. Table 5.1 shows there were 7 fewer promotions in the female group than in the male group, a difference in promotion rates of 29.2% $$\left( \frac{21}{24} - \frac{14}{24} = 0.292 \right)$$. This observed difference is what we call a point estimate of the true difference. The point estimate of the difference is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance. We label these two competing claims, $$H_0$$ and $$H_A$$:

• $$H_0$$: Null hypothesis. The variables gender and decision are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 29.2%, was due to random chance.

• $$H_A$$: Alternative hypothesis. The variables gender and decision are not independent. The difference in promotion rates of 29.2% was not due to random chance, and equally qualified females are less likely to be promoted than males.

Hypothesis testing

These hypotheses are part of what is called a hypothesis test. A hypothesis test is a statistical technique used to evaluate competing claims using data. Often times, the null hypothesis takes a stance of no difference or no effect.

If the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.

There are many nuances to hypothesis testing, so don’t worry if you aren’t a master of hypothesis testing at the end of this section. We’ll discuss these ideas and details many times in this chapter as well as in the chapters that follow.

What would it mean if the null hypothesis, which says the variables gender and decision are unrelated, was true? It would mean each banker would decide whether to promote the candidate without regard to the gender indicated on the file. That is, the difference in the promotion percentages would be due to the way the files were randomly divided to the bankers, and the randomization just happened to give rise to a relatively large difference of 29.2%.

Consider the alternative hypothesis: bankers were influenced by which gender was listed on the personnel file. If this was true, and especially if this influence was substantial, we would expect to see some difference in the promotion rates of male and female candidates. If this gender bias was against females, we would expect a smaller fraction of promotion recommendations for female personnel files relative to the male files.

We will choose between these two competing claims by assessing if the data conflict so much with $$H_0$$ that the null hypothesis cannot be deemed reasonable. If this is the case, and the data seem to support $$H_A$$, then we will reject the notion of independence and conclude that these data provide strong evidence of discrimination.

#### Variability of the statistic

Table 5.1 shows that 35 bank supervisors recommended promotion and 13 did not. Now, suppose the bankers’ decisions were independent of gender. Then, if we conducted the experiment again with a different random assignment of gender to the files, differences in promotion rates would be based only on random fluctuation. We can actually perform this randomization, which simulates what would have happened if the bankers’ decisions had been independent of gender but we had distributed the file genders differently.109

In this simulation, we thoroughly shuffle 48 personnel files, 35 labelled promoted and 13 labelled not promoted, and we deal these files into two stacks. Note that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file (independent of gender). We will deal 24 files into the first stack, which will represent the 24 “female” files. The second stack will also have 24 files, and it will represent the 24 “male” files. Figure 5.2 highlights both the shuffle and the reallocation to the sham gender groups. Figure 5.2: The gender descriminiation data is shuffled and reallocated to the gender groups.

Then, as we did with the original data, we tabulate the results and determine the fraction of personnel files designated as “male” and “female” who were promoted.

Since the randomization of files in this simulation is independent of the promotion decisions, any difference in the two promotion rates is entirely due to chance. Table 5.2 show the results of one such simulation.

Table 5.2: Simulation results, where the difference in promotion rates between male and female is purely due to random chance.
decision
gender promoted not promoted Total
male 18 6 24
female 17 7 24
Total 35 13 48

What is the difference in promotion rates between the two simulated groups in Table 5.2 ? How does this compare to the observed difference 29.2% from the actual study?110

Figure 5.3 shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 >>> 0.042). The quantity of interest throughout this case study has been the difference in promotion rates. We call the summary value the statistic of interest (or often the test statistic). When we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample. Figure 5.3: We summarize the randomized data to produce one estiamte of the difference in proportions given no gender discrimination.

#### Observed statistic vs. null statistics

We computed one possible difference under the null hypothesis in Guided Practice, which represents one difference due to chance. While in this first simulation, we physically dealt out files, it is much more efficient to perform this simulation using a computer. Repeating the simulation on a computer, we get another difference due to chance: -0.042. And another: 0.208. And so on until we repeat the simulation enough times that we have a good idea of the shape of the distribution of differences from chance alone. Figure 5.4 shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion. Figure 5.4: A stacked dot plot of differences from 100 simulations produced under the null hypothesis, $$H_0$$, where the simulated gender and decision are independent. Two of the 100 simulations had a difference of at least 29.2%, the difference observed in the study, and are shown as solid red dots.

Note that the distribution of these simulated differences in proportions is centered around 0. Because we simulated differences in a way that made no distinction between men and women, this makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation.

How often would you observe a difference of at least 29.2% (0.292) according to Figure 5.4? Often, sometimes, rarely, or never?

It appears that a difference of at least 29.2% due to chance alone would only happen about 2% of the time according to Figure 5.4. Such a low probability indicates that observing such a large difference from chance alone is rare.

The difference of 29.2% is a rare event if there really is no impact from listing gender in the candidates’ files, which provides us with two possible interpretations of the study results:

• $$H_0$$: Null hypothesis. Gender has no effect on promotion decision, and we observed a difference that is so large that it would only happen rarely.

• $$H_A$$: Alternative hypothesis. Gender has an effect on promotion decision, and what we observed was actually due to equally qualified women being discriminated against in promotion decisions, which explains the large difference of 29.2%.

When we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.111 In our analysis, we determined that there was only a $$\approx$$ 2% probability of obtaining a sample where $$\geq$$ 29.2% more males than females get promoted by chance alone, so we conclude that the data provide strong evidence of gender discrimination against women by the supervisors. In this case, we reject the null hypothesis in favor of the alternative.

Statistical inference is the practice of making decisions and conclusions from data in the context of uncertainty. Errors do occur, just like rare events, and the data set at hand might lead us to the wrong conclusion. While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur. Before getting into the nuances of hypothesis testing, let’s work through another case study.

### 5.1.2 Opportunity cost case study

How rational and consistent is the behavior of the typical American college student? In this section, we’ll explore whether college student consumers always consider the following: money not spent now can be spent later.

In particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier. A skeptic might think that such a reminder would have no impact. We can summarize the two different perspectives using the null and alternative hypothesis framework.

• $$H_0$$: Null hypothesis. Reminding students that they can save money for later purchases will not have any impact on students’ spending decisions.
• $$H_A$$: Alternative hypothesis. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.

In this section, we’ll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.112

#### Observed data

One-hundred and fifty students were recruited for the study, and each was given the following statement:

Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below. Half of the 150 students were randomized into a control group and were given the following two options: 1. Buy this entertaining video. 1. Not buy this entertaining video. The remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B): 1. Buy this entertaining video. 1. Not buy this entertaining video. Keep the$14.99 for other purchases.

Would the extra statement reminding students of an obvious fact impact the purchasing decision? Table 5.3 summarizes the study results.

The data from this study can be found in the openintro package: opportunity_cost.

Table 5.3: Summary results of the opportunity cost study.
decision
control 56 19 75
treatment 41 34 75
Total 97 53 150

It might be a little easier to review the results using a visualisation. Figure 5.5 shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group. Figure 5.5: Segmented bar plot of results of the opportunity cost study.

Another useful way to review the results from Table 5.3 is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the videod. These summaries are given in Table 5.4.

Table 5.4: The opportunity cost data are summarized using row proportions. Row proportions are particularly useful here since we can view the proportion of buy and not buy decisions in each group.
decision
control 0.747 0.253 1
treatment 0.547 0.453 1

We will define a success in this study as a student who chooses not to buy the video.113 Then, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later.

We can construct a point estimate for this difference as

$\hat{p}_{treatment} - \hat{p}_{control} = \frac{34}{75} - \frac{19}{75} = 0.453 - 0.253 = 0.200$

The proportion of students who chose not to buy the video was 20% higher in the treatment group than the control group. However, is this result statistically significant? In other words, is a 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone?

#### Variability of the statistic

The primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students. For this, we’ll use the same procedure we applied in Section 5.1.1: randomization.

Let’s think about the data in the context of the hypotheses. If the null hypothesis ($$H_0$$) was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance. If, on the other hand, the alternative hypothesis ($$H_A$$) is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.

#### Observed statistic vs. null statistics

Just like with the gender discrimination study, we can perform a statistical analysis. Using the same randomization technique from the last section, let’s see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.

While we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer. We start with 150 index cards and label each card to indicate the distribution of our response variable: decision. That is, 53 cards will be labeled “not buy video” to represent the 53 students who opted not to buy, and 97 will be labeled “buy video” for the other 97 students. Then we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups. Any observed difference between the proportions of “not buy video” cards (what we earlier defined as success) can be attributed entirely to chance.

If we are randomly assigning the cards into the simulated treatment and control groups, how many “not buy video” cards would we expect to end up with in each simulated group? What would be the expected difference between the proportions of “not buy video” cards in each group?

Since the simulated groups are of equal size, we would expect $$53 / 2 = 26.5$$, i.e., 26 or 27, “not buy video” cards in each simulated group, yielding a simulated point estimate of the difference in proportions of 0% . However, due to random chance, we might actually observe a number a little above or below 26 and 27.

The results of a single randomization from chance alone is shown in Table 5.5.

Table 5.5: Summary of student choices against their simulated groups. The group assignment had no connection to the student decisions, so any difference between the two groups is due to chance.
decision
control 46 29 75
treatment 51 24 75
Total 97 53 150

From this table, we can compute a difference that occurred from chance alone:

$\hat{p}_{treatment, sim} - \hat{p}_{control, sim} = \frac{24}{75} - \frac{29}{75} = 0.32 - 0.387 = - 0.067$

Just one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone.

We’ll simulate another set of simulated groups and compute the new difference: 0.04.

And again: 0.12.

And again: -0.013.

We’ll do this 1,000 times.

The results are summarized in a dot plot in Figure 5.6, where each point represents a simulation. Figure 5.6: A stacked dot plot of 1,000 simulated differences produced under the null hypothesis, $$H_0$$. Six of the 1,000 simulations had a difference of at least 20% , which was the difference observed in the study.

Since there are so many points, it is more convenient to summarize the results in a histogram such as the one in Figure 5.7, where the height of each histogram bar represents the fraction of observations in that group. Figure 5.7: A histogram of 1,000 chance differences produced under the null hypothesis, $$H_0$$. Histograms like this one are a more convenient representation of data or results when there are a large number of observations.

If there was no treatment effect, then we’d only observe a difference of at least +20% about 0.6% of the time.

That is really rare!

Instead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase.

Notice that we are able to make a causal statement for this study since the study is an experiment.

### 5.1.3 Hypothesis testing

In the last two sections, we utilized a hypothesis test, which is a formal technique for evaluating two competing possibilities.

In each scenario, we described a null hypothesis, which represented either a skeptical perspective or a perspective of no difference.

We also laid out an alternative hypothesis, which represented a new perspective such as the possibility that there has been a change or that there is a treatment effect in an experiment.

The alternative hypothesis is usually the reason the scientists set out to do the research in the first place.

Null and alternative hypotheses.

The null hypothesis ($$H_0$$) often represents either a skeptical perspective or a claim to be tested.

The alternative hypothesis ($$H_A$$) represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.

The hypothesis testing framework is a very general tool, and we often use it without a second thought.

If a person makes a somewhat unbelievable claim, we are initially skeptical.

However, if there is sufficient evidence that supports the claim, we set aside our skepticism.

The hallmarks of hypothesis testing are also found in the US court system.

#### The US court system

A US court considers two possible claims about a defendant: they are either innocent or guilty.

If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?

The jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person’s guilt.

That is, the skeptical perspective (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis).

Jurors examine the evidence to see whether it convincingly shows a defendant is guilty.

Notice that if a jury finds a defendant not guilty, this does not necessarily mean the jury is confident in the person’s innocence.

They are simply not convinced of the alternative that the person is guilty.

This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as truth.

Failing to find strong evidence for the alternative hypothesis is not equivalent to providing evidence that the null hypothesis is true.

#### p-value and statistical significance

In Section 5.1.1 we encountered a study from the 1970’s that explored whether there was strong evidence that women were less likely to be promoted than men. The research question – are females discriminated against in promotion decisions? – was framed in the context of hypotheses:

• $$H_0$$: Gender has no effect on promotion decisions.

• $$H_A$$: Women are discriminated against in promotion decisions.

The null hypothesis ($$H_0$$) was a perspective of no difference. The data on gender discrimination provided a point estimate of a 29.2% difference in recommended promotion rates between men and women. We determined that such a difference from chance alone would be rare: it would only happen about 2 in 100 times. When results like these are inconsistent with $$H_0$$, we reject $$H_0$$ in favor of $$H_A$$. Here, we concluded there was discrimination against women.

The 2-in-100 chance is what we call a p-value, which is a probability quantifying the strength of the evidence against the null hypothesis and in favor of the alternative.

p-value.

The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true. We typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses. This summary value that is used to compute the p-value is often called the test statistic.

In the gender discrimination study, the difference in discrimination rates was our test statistic. What was the test statistic in the opportunity cost study covered in Section 5.1.2?

The test statistic in the opportunity cost study was the difference in the proportion of students who decided against the video purchase in the treatment and control groups. In each of these examples, the point estimate of the difference in proportions was used as the test statistic.

When the p-value is small, i.e., less than a previously set threshold, we say the results are statistically significant. This means the data provide such strong evidence against $$H_0$$ that we reject the null hypothesis in favor of the alternative hypothesis. The threshold, called the significance level and often represented by $$\alpha$$ (the Greek letter alpha), is typically set to $$\alpha = 0.05$$, but can vary depending on the field or the application. Using a significance level of $$\alpha = 0.05$$ in the discrimination study, we can say that the data provided statistically significant evidence against the null hypothesis.

Statistical significance.

We say that the data provide statistically significant evidence against the null hypothesis if the p-value is less than some reference value, often $$\alpha=0.05$$.

In the opportunity cost study in Section 5.1.2, we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future. We determined that such a large difference would only occur 6-in-1000 times if the reminder actually had no influence on student decision-making. What is the p-value in this study? Was the result statistically significant?

The p-value was 0.006. Since the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.

We often use a threshold of 0.05 to determine whether a result is statistically significant. But why 0.05? Maybe we should use a bigger number, or maybe a smaller number. If you’re a little puzzled, that probably means you’re reading with a critical eye – good job! We’ve made a video to help clarify why 0.05:

https://www.openintro.org/book/stat/why05/

Sometimes it’s also a good idea to deviate from the standard. We’ll discuss when to choose a threshold different than 0.05 in Section 6.2.1.

### 5.1.4 Randomization test summary

Figure 5.8 provides a visual summary of the randomization testing procedure. Figure 5.8: An example of one simulation of the full randomization procedure. We repeat the steps hundreds or thousands of times.

We can summarise this procedure as follows:

• Frame the research question in terms of hypotheses. Hypothesis tests are appropriate for research questions that can be summarized in two competing hypotheses. The null hypothesis ($$H_0$$) usually represents a skeptical perspective or a perspective of no difference. The alternative hypothesis ($$H_A$$) usually represents a new view or a difference.
• Collect data with an observational study or experiment. If a research question can be formed into two hypotheses, we can collect data to run a hypothesis test. If the research question focuses on associations between variables but does not concern causation, we would run an observational study. If the research question seeks a causal connection between two or more variables, then an experiment should be used.
• Model the randomness as if the null hypothesis was true. In the examples above, the variability has been modeled as if the treatment (e.g., gender, opportunity, blood thinner) allocation was independent of the outcome of the study. The computer generated the null distribution from many different randomizations in order to quantify the null variability.
• Analyze the data. Choose an analysis technique appropriate for the data and identify the p-value. So far, we’ve only seen one analysis technique: randomization. Throughout the rest of this textbook, we’ll encounter several new methods suitable for many other contexts.
• Form a conclusion. Using the p-value from the analysis, determine whether the data provide statistically significant evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results.

Table 5.6 is another look at this summary.

Table 5.6: Summary of randomization tests as an inferential statistical method.
Randomization Test
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment
What is the random process described? Randomized experiment
Is there flexibility? Yes, can be used to describe random sampling in an observational model
What is it best for? Hypothesis Testing (can be used for Confidence Intervals, but not covered in this text)
What physical object represents the simulation process? Shuffling cards

### 5.1.5 Exercises

Exercises for this section are under construction.

## 5.2 Bootstrap confidence intervals

Randomization is a statistical technique suitable for evaluating whether a difference in sample proportions is due to chance.

Randomization tests are best suited for modeling experiments where the treatment (explanatory variable) has been randomly assigned to the observational units and there is an attempt to answer a simple yes/no question.

For example, consider the following research questions that can be well assessed with a randomization test:

• Does this vaccine make it less likely that a person will get malaria?
• Does drinking caffeine affect how quickly a person can tap their finger?
• Can we predict which candidate will win the upcoming election?

In this section, however, we are instead interested in estimating the unknown value of a population parameter.

For example,

• How much less likely am I to get malaria if I get the vaccine?
• How much faster (or slower) can a person tap their finger, on average, if they drink caffeine first?
• What proportion of the vote will go to candidate A?

Here, we explore the situation where focus is on a single proportion, and we introduce a new simulation method, bootstrapping.

Bootstrapping is best suited for modeling studies where the data have been generated through random sampling from a population.

As with randomization tests, our goal with bootstrapping is to understand variability of a statistic.

Unlike randomization tests (which modeled how the statistic would change if the treatment had been allocated differently), the bootstrap will model how a statistic changes from repeated sampling. How a statistic varies from sample to sample will provide information about how different the statistic is from the parameter of interest.

Quantifying the variability of a statistic from sample to sample is a hard problem.

Fortunately, sometimes the mathematical theory for how a statistic varies (across different samples) is well-known; this is the case for the sample proportion as seen in Section 5.3.

However, some statistics don’t have simple theory for how they vary, and bootstrapping provides a computational approach for providing interval estimates for almost any population parameter (we will revisit bootstrapping in Chapters 6, 7, and 8 so you’ll get plenty of practice as well as exposure to bootstrapping in many different data settings).

Our goal with bootstrapping will be to produce an interval estimate (a range of plausible values) for the population parameter.

If we could, we would measure the variability of the statistics by repeatedly taking sample data from the population compute the sample proportion.

Then we could do it again.

And again.

And so on until we have a good sense of the variability of our original estimate.

When the sampling variability is large, we would assume that the original statistic is possibly far from the true population parameter of interest (and the interval estimate will be wide).

When the variability across the samples is small, we expect the sample statistic to be close to the true parameter of interest (and the interval estimate will be narrow).

The ideal world where sampling data is free or extremely cheap is almost never the case, and taking repeated samples from a population is usually impossible.

So, instead of using a “resample from the population” approach, bootstrapping uses a “resample from the sample” approach.

The sections below provide examples and details about the bootstrapping process.

### 5.2.1 Medical consultant case study

People providing an organ for donation sometimes seek the help of a special medical consultant. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients.

#### Observed data

One consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

We will let $$p$$ represent the true complication rate for liver donors working with this consultant. (The “true” complication rate will be referred to as the parameter.) We estimate $$p$$ using the data, and label this estimation $$\hat{p}$$.

The sample proportion for the complication rate is 3 complications divided by the 62 surgeries the consultant has worked on: $$\hat{p} = 3/62 = 0.048$$.

Is it possible to assess the consultant’s claim using the data?

No.

The claim is that there is a causal connection, but the data are observational, so we must be on the lookout for confounding variables.

For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.

While it is not possible to assess the causal claim, it is still possible to understand the consultant’s true rate of complications.

Parameter.

A parameter is the “true” value of interest.

We typically estimate the parameter using a point estimate from a sample of data.

For example, we estimate the probability $$p$$ of a complication for a client of the medical consultant by examining the past complications rates of her clients:

$\hat{p} = 3 / 62 = 0.048~text{is used to estimate}~p$

#### Variability of the statistic

In the medical consultant case study, the parameter is $$p$$, the true probability of a complication for a client of the medical consultant.

There is no reason to believe that $$p$$ is exactly $$\hat{p} = 3/62$$, but there is also no reason to believe that $$p$$ is particularly far from $$\hat{p} = 3/62$$.

By sampling with replacement from the dataset (a process called bootstrapping), the variability of the possible $$\hat{p}$$ values can be approximated.

Most of the inferential procedures covered in this text are grounded in quantifying how one data set would differ from another when they are both taken from the same population.

It doesn’t make sense to take repeated samples from the same population because if you have the means to take more samples, a larger sample size will benefit you more than the exact same sample twice.

Instead, we measure how the samples behave under an estimate of the population.

Figure 5.9 shows how the unknown original population can be estimated by using the sample to approximate the proportion of successes and failures (in our case, the proportion of complications and no complications for the medical consultant). Figure 5.9: The unknown population is estimated using the observed sample data. Note that we can use the sample to create an estimated or bootstrapped population from which to sample. The observed data include three red and four white marbles, so the estimated population also has red marbles in a proportion of 3/7.

By taking repeated samples from the estimated population, the variability from sample to sample can be observed.

In Figure 5.10 the repeated bootstrap samples are obviously different both from each other and from the original population.

Recall that the bootstrap samples were taken from the same (estimated) population, and so the differences are due entirely to natural variability in the sampling procedure. Figure 5.10: Bootstrap sampling provides a measure of the variability of sample to sample. Note that we are taking samples from the estimated population that was created from the sample.

By summarizing each of the bootstrap samples (here, using the sample proportion), we see, directly, the variability of the sample proportion, $$\hat{p}$$, from sample to sample.

The distribution of $$\hat{p}_{BS}$$ for the example scenario is shown in Figure 5.11, and the bootstrap distribution for the medical consultant data is shown in Figure 5.14. Figure 5.11: The bootstrapped proportion is estimated for each bootstrap sample. The resulting bootstrap distribution (histogram) provides a measure for how the proportions vary from sample to sample

It turns out that in practice, it is very difficult for computers to work with an infinite population (with the same proportional breakdown as in the sample).

However, there is a physical and computational model which produces an equivalent bootstrap distribution of the sample proportion in a computationally efficient manner.

Consider the observed data to be a bag of marbles 3 of which are success (white) and 4 of which are failures (red).

By drawing the marbles out of the bag with replacement, we depict the exact same sampling process as was done with the infinitely large estimated population.

Note in Figure 5.12 that when sampling the original observations, a particular data point may end up in the new sample one time (evidenced by a circle around the observation), two times (evidenced by two circles around the observation), or not at all (no circles around the observation). Figure 5.12: Taking repeated resamples from the sample data is the same process as creating an infinitely large estimate of the population. It is computationally more feasible to take resamples directly from the sample. Note that the resampling is now done with replacement (that is, the original sample does not ever change) so that the original sample and estimated hypothetical population are equivalent. Figure 5.13: A comparison of the process of sampling from the estimate infinite population and resampling with replacement from the original sample. Note that the histogram of bootstrapped proportions is the same because the process by which the statistics were estimated is equivalent.

If we apply the bootstrap sampling process to the medical consultant example, we consider each client to be one of the marbles in the bag.

There will be 59 white marbles (no complication) and 3 red marbles (complication).

If we 62 choose marbles out of the bag (one at a time with replacement) and compute the proportion of simulated patients with complications, $$\hat{p}_{BS}$$, then this “bootstrap” proportion represents a single simulated proportion from the “resample from the sample” approach.

In a simulation of 62 patients, about how many would we expect to have had a complication?114

One simulation isn’t enough to get a sense of the variability from one bootstrap proportion to another bootstrap proportion, so we repeated the simulation 10,000 times using a computer.

Figure 5.14 shows the distribution from the 10,000 bootstrap simulations.

The bootstrapped proportions vary from about zero to 11.3%.

The variability in the bootstrapped proportions leads us to believe that the true probability of complication (the parameter, $$p$$) is somewhere between 0 and 11.3%.

The range of values for the true proportion is called a bootstrap percentile confidence interval, and we will see it again in throughout the next few sections and chapters. Figure 5.14: The original medical consultant data is bootstrapped 10,000 times. Each simulation creates a sample from the original data where the probability of a complication is $$\hat{p} = 3/62$$. The bootstrap 2.5 percentile proportion is 0 and the 97.5 percentile is 0.113. The result is: we are confident that, in the population, the true probability of a complication is between 0% and 11.3%.

The original claim was that the consultant’s true rate of complication was under the national rate of 10%. Does the interval estimate of 0 to 11.3% for the true probability of complication indicate that the surgical consultant has a lower rate of complications than the national average? Explain.

No. Because the interval overlaps 10%, it might be that the consultant’s work is associated with a lower risk of complications, or it might be that the consultant’s work is associated with a higher risk (i.e., greater than 10%) of complications! Additionally, as previously mentioned, because this is an observational study, even if an association can be measured, there is no evidence that the consultant’s work is the cause of the complication rate (being higher or lower).

### 5.2.2 Tappers and listeners case study

Here’s a game you can try with your friends or family: pick a simple, well-known song, tap that tune on your desk, and see if the other person can guess the song. In this simple game, you are the tapper, and the other person is the listener.

#### Observed data

A Stanford University graduate student named Elizabeth Newton conducted an experiment using the tapper-listener game.115 In her study, she recruited 120 tappers and 120 listeners into the study. About 50% of the tappers expected that the listener would be able to guess the song. Newton wondered, is 50% a reasonable expectation?

In Newton’s study, only 3 out of 120 listeners ($$\hat{p} = 0.025$$) were able to guess the tune! That seems like quite a low number which leads the researcher to ask: what is the true proportion of people who can guess the tune?

### Variability of the statistic

To answer the question, we will again use a simulation.

To simulate 120 games, this time we use a bag of 120 marbles 3 are red (for those who guessed correctly) and 117 are white (for those who could not guess the song).

Sampling from the bag 120 times (while not actually removing the marbles from the bag) produces one bootstrap sample.

For example, we can start by simulating 5 tapper-listener pairs by sampling 5 marbles from the bag of 3 red and 117 white marbles.

W W W R W
Wrong Wrong Wrong Correct Wrong

After selecting 120 marbles, we counted 2 red for $$\hat{p}_{BS1} = 0.0167$$. As we did with the randomization technique, seeing what would happen with one simulation isn’t enough. In order to evaluate whether our originally observed proportion of 0.025 is unusual or not, we should generate more simulations. Here we’ve repeated the entire simulation ten times:

$0.0417 \quad 0.0250 \quad 0.0250 \quad 0.0083 \quad 0.0500 \quad 0.0333 \quad 0.0250 \quad 0.000 \quad 0.0083 \quad 0.000$

As before, we’ll run a total of 10,000 simulations using a computer. As seen in Figure 5.15, the range of 95% of the resampled $$\hat{p}_{BS}$$ is 0.000 to 0.0583. That is, we expect that between 0% and 5.83% of people are truly able to guess the tapper’s tune. Figure 5.15: The original listener-tapper data is bootstrapped 10,000 times. Each simulation creates a sample where the probability of being correct is $$\hat{p} = 3/120$$. The 2.5 percentile proportion is 0 and the 97.5 percentile is 0.0583. The result is that we are confident that, in the population, the true percent of people who can guess correctly is between 0% and 5.83%.

Do the data provide “statistically significant” evidence against the claim that 50% of listeners can guess the tapper’s tune?116

### 5.2.3 Confidence intervals

A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely perfect; usually there is some error in the estimate. In addition to supplying a point estimate of a parameter, a next logical step would be to provide a plausible range of values for the parameter.

#### Population parameter

A plausible range of values for the population parameter is called a confidence interval. Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net. We can throw a spear where we saw a fish, but we will probably miss. On the other hand, if we toss a net in that area, we have a good chance of catching the fish.

If we report a point estimate, we probably will not hit the exact population parameter. On the other hand, if we report a range of plausible values – a confidence interval – we have a good shot at capturing the parameter.

If we want to be very certain we capture the population parameter, should we use a wider interval or a smaller interval?117

#### Bootstrap confidence interval

As we saw above, a bootstrap sample is a sample of the original sample. In the case of the medical complications data, we proceed as follows:

• Randomly sample one observation from the 62 patients (sample, without removing the marbles, 62 marbles from the bag).
• Randomly sample a second observation from the 62 patients. Because we sample with replacement (i.e., we don’t actually remove the marbles from the bag), there is a 1-in-62 chance that the second observation will be the same one sampled in the first step! …
• Randomly sample a 62nd observation from the 62 patients.

Bootstrap sampling is often called sampling with replacement.

A bootstrap sample behaves similarly to how an actual sample would behave, and we compute the point estimate of interest (here, compute $$\hat{p}_{BS}$$).

Due to theory that is beyond this text, we know that the bootstrap proportions $$\hat{p}_{BS}$$ vary around $$\hat{p}$$ in a similar way to how different sample (i.e., different datasets which would produce different $$\hat{p}$$ values) proportions vary around the true parameter $$p$$.

Therefore, an interval estimate for $$p$$ can be produced using the $$\hat{p}_{BS}$$ values themselves.

95% Bootstrap percentile confidence interval for a parameter $$p$$.

The 95% bootstrap confidence interval for the parameter $$p$$ can be obtained directly using the ordered values $$\hat{p}_{BS}$$ values.

Consider the sorted $$\hat{p}_{BS}$$ values, and let $$\hat{p}_{BS, 0.025}$$ be the 2.5% value and $$\hat{p}_{BS, 0.975}$$ be the 97.5% value.

The 95% confidence interval is given by:

($$\hat{p}_{BS, 0.025}$$, $$\hat{p}_{BS, 0.975}$$)

In Section 6.1.1 we will discuss different percentages for the confidence interval (e.g., 90% confidence interval or 99% confidence interval).

Section 6.1.1 also provides a longer discussion on what “95% confidence” actually means.

### 5.2.4 Bootstrap summary

Bootstrap flow chart Figure 5.16: We will use sampling with replacement to measure the variability of the statistic of interest (here the proportion). Sampling with replacement is a computational tool which is equivalent to using the sample as a way of estimating an infinitely large population from which to sample.

Table 5.7: Summary of bootstrapping as an inferential statistical method.
Bootstrapping
What does it do? Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data.
What is the random process described? Random sampling
Is there flexibility? Yes, can be used to describe random allocation in an experiment
What is it best for? Confidence Intervals (HT for one proportion covered in Chapter 6)
What physical object represents the simulation process? Pulling balls from a bag

### 5.2.5 Exercises

Exercises for this section are under construction.

## 5.3 Mathematical models

### 5.3.1 Central Limit Theorem

We’ve encountered four case studies so far this chapter. While they differ in the settings, in their outcomes, and also in the technique we’ve used to analyze the data, they all have something in common: the general shape of the null distribution.

#### Null distribution

Figure 5.17 shows the null distributions in each of the four case studies where we ran 10,000 simulations. In the case of the opportunity cost study, which originally had just 1,000 simulations, we’ve included an additional 9,000 simulations.    Figure 5.17: The null distribution for each of the four case studies presented in previous.

Describe the shape of the distributions and note anything that you find interesting.118

The case study for the medical consultant is the only distribution with any evident skew.

As we observed in Chapter 1, it’s common for distributions to be skewed or contain outliers.

However, the null distributions we’ve so far encountered have all looked somewhat similar and, for the most part, symmetric.

They all resemble a bell-shaped curve.

This is not a coincidence, but rather, is guaranteed by mathematical theory.

Central Limit Theorem for proportions.

If we look at a proportion (or difference in proportions) and the scenario satisfies certain conditions, then the sample proportion (or difference in proportions) will appear to follow a bell-shaped curve called the normal distribution.

An example of a perfect normal distribution is shown in Figure 5.18.

Imagine laying a normal curve over each of the four null distributions in Figure 5.17.

While the mean (center) and standard deviation (width or spread) may change for each plot, the general shape remains roughly intact. Figure 5.18: A normal curve.

Mathematical theory guarantees that if repeated samples are taken a sample proportion or a difference in sample proportions will follow something that resembles a normal distribution when certain conditions are met. (Note: we typically only take one sample, but the mathematical model lets us know what to expect if we had taken repeated samples.) These conditions fall into two categories:

Observations in the sample are independent. Independence is guaranteed when we take a random sample from a population. It can also be guaranteed if we randomly divide individuals into treatment and control groups. The sample is large enough. The sample size cannot be too small. What qualifies as “small” differs from one context to the next, and we’ll provide suitable guidelines for proportions in Chapter 6.

So far we’ve had no need for the normal distribution. We’ve been able to answer our questions somewhat easily using simulation techniques. However, soon this will change. Simulating data can be non-trivial. For example, some of the scenarios encountered in Chapters 3 and 4 would require complex simulations in order to make inferential conclusions. Instead, the normal distribution and other distributions like it offer a general framework for statistical inference that applies to a very large number of settings.

#### Anticipating frequent use of the normal distribution

Below we introduce three new settings where the normal distribution will be useful, and constructing suitable simulations can be difficult.

The opportunity cost study determined that students are thriftier if they are reminded that saving money now means they can spend the money later. The study’s point estimate for the estimated impact was 20%, meaning 20% fewer students would move forward with a video purchase in the study scenario. However, as we’ve learned, point estimates aren’t perfect – they only provide an approximation of the truth.

It would be useful if we could provide a range of plausible values for the impact, more formally known as a confidence interval. It is often difficult to construct a reliable confidence interval in many situations using simulations.119 However, doing so is reasonably straightforward using the normal distribution.

Book prices were collected for 73 courses at UCLA in Spring 2010. Data were collected from both the UCLA Bookstore and Amazon. The differences in these prices are shown in Figure 5.19. The mean difference in the price of the books was \$12.76, and we might wonder, does this provide strong evidence that the prices differ between the two book sellers?

Here again we can apply the normal distribution, this time in the context of numerical data. We’ll explore this example and construct such a hypothesis test in Section 7.3. Figure 5.19: Histogram of the difference in price for each book sampled. These data are strongly skewed.

Elmhurst College in Illinois released anonymized data for family income and financial support provided by the school for Elmhurst’s first-year students in 2011. Figure 5.20 shows a regression line fit to a scatterplot of a sample of the data. One question we will ask is, do the data show a real linear trend, or is the trend we observe reasonably explained by random chance?

In Chapter 3 we learned how to apply least squares regression to quantify the trend. In Chapter 8 we will focus on whether or not that trend can be explained by chance alone. For this case study, we could again use the normal distribution to help us answer this question. Figure 5.20: Gift aid and family income for a random sample of 50 first-year students from Elmhurst College, shown with a regression line.

These examples highlight the value of the normal distribution approach. However, before we can apply the normal distribution to statistical inference, it is necessary to become familiar with the mechanics of the normal distribution. In Section 5.3.2 we discuss characteristics of the normal distribution, explore examples of data that follow a normal distribution, and learn a new plotting technique that is useful for evaluating whether a data set roughly follows the normal distribution. In Section 5.3.3, we apply the new knowledge in the context of hypothesis tests and confidence intervals.

### 5.3.2 Normal Distribution

Among all the distributions we see in statistics, one is overwhelmingly the most common. The symmetric, unimodal, bell curve is ubiquitous throughout statistics. It is so common that people know it as a variety of names including the normal curve, normal model, or normal distribution.120 Under certain conditions, sample proportions, sample means, and sample differences can be modeled using the normal distribution. Additionally, some variables such as SAT scores and heights of US adult males closely follow the normal distribution.

Normal distribution facts.

Many summary statistics and variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems. We will use it in data exploration and to solve important problems in statistics.

In this section, we will discuss the normal distribution in the context of data to become familiar with normal distribution techniques. In the following sections and beyond, we’ll move our discussion to focus on applying the normal distribution and other related distributions to model point estimates for hypothesis tests and for constructing confidence intervals.

#### Normal distribution model

The normal distribution always describes a symmetric, unimodal, bell-shaped curve. However, normal curves can look different depending on the details of the model. Specifically, the normal model can be adjusted using two parameters: mean and standard deviation. As you can probably guess, changing the mean shifts the bell curve to the left or right, while changing the standard deviation stretches or constricts the curve. Figure 5.21 shows the normal distribution with mean $$0$$ and standard deviation $$1$$ (which is commonly referred to as the standard normal distribution) on top. A normal distributions with mean $$19$$ and standard deviation $$4$$ is shown on the bottom. Figure 5.22 shows the same two normal distributions on the same axis.  Figure 5.21: Both curves represent the normal distribution, however, they differ in their center and spread. The normal distribution with mean 0 and standard deviation 1 is called the standard normal distribution.

Add fig ref in caption, it should read "The two normal models shown in the Figures \@ref(fig:twoSampleNormals) but plotted together on the same scale." Figure 5.22: The two normal models shown in the previous figures, but plotted together on the same scale.

If a normal distribution has mean $$\mu$$ and standard deviation $$\sigma$$, we may write the distribution as $$N(\mu, \sigma)$$. The two distributions in Figure 5.22 can be written as

$N(\mu = 0, \sigma = 1)\quad\text{and}\quad N(\mu = 19, \sigma = 4)$

Because the mean and standard deviation describe a normal distribution exactly, they are called the distribution’s parameters.

Write down the short-hand for a normal distribution with (a) mean 5 and standard deviation 3, (b) mean -100 and standard deviation 10, and (c) mean 2 and standard deviation 9.121

#### Standardizing with Z scores

Table 5.8 shows the mean and standard deviation for total scores on the SAT and ACT. The distribution of SAT and ACT scores are both nearly normal. Suppose Ann scored 1800 on her SAT and Tom scored 24 on his ACT. Who performed better?122

Table 5.8: Mean and standard deviation for the SAT and ACT.
SAT ACT
Mean 1500 21
SD 300 5 Figure 5.23: Ann’s and Tom’s scores shown with the distributions of SAT and ACT scores.

The solution to the previous example relies on a standardization technique called a Z score, a method most commonly employed for nearly normal observations (but that may be used with any distribution). The Z score of an observation is defined as the number of standard deviations it falls above or below the mean. If the observation is one standard deviation above the mean, its Z score is 1. If it is 1.5 standard deviations below the mean, then its Z score is -1.5. If $$x$$ is an observation from a distribution $$N(\mu, \sigma)$$, we define the Z score mathematically as

$Z = \frac{x-\mu}{\sigma}$

Using $$\mu_{SAT}=1500$$, $$\sigma_{SAT}=300$$, and $$x_{Ann}=1800$$, we find Ann’s Z score:

$Z_{Ann} = \frac{x_{Ann} - \mu_{SAT}}{\sigma_{SAT}} = \frac{1800-1500}{300} = 1$

The Z score.

The Z score of an observation is the number of standard deviations it falls above or below the mean. We compute the Z score for an observation $$x$$ that follows a distribution with mean $$\mu$$ and standard deviation $$\sigma$$ using

$Z = \frac{x-\mu}{\sigma}$

Use Tom’s ACT score, 24, along with the ACT mean and standard deviation to compute his Z score.123

Observations above the mean always have positive Z scores while those below the mean have negative Z scores. If an observation is equal to the mean (e.g., SAT score of 1500), then the Z score is $$0$$.

Let $$X$$ represent a random variable from $$N(\mu=3, \sigma=2)$$, and suppose we observe $$x=5.19$$. (a) Find the Z score of $$x$$. (b) Use the Z score to determine how many standard deviations above or below the mean $$x$$ falls.124

Head lengths of brushtail possums follow a nearly normal distribution with mean 92.6 mm and standard deviation 3.6 mm. Compute the Z scores for possums with head lengths of 95.4 mm and 85.8 mm.125

We can use Z scores to roughly identify which observations are more unusual than others. One observation $$x_1$$ is said to be more unusual than another observation $$x_2$$ if the absolute value of its Z score is larger than the absolute value of the other observation’s Z score: $$|Z_1| > |Z_2|$$. This technique is especially insightful when a distribution is symmetric.

Which of the two brushtail possum observations in the previous guided practice is more unusual?126

#### Normal probability calculations

Ann from the SAT Guided Practice earned a score of 1800 on her SAT with a corresponding $$Z=1$$. She would like to know what percentile she falls in among all SAT test-takers.

Ann’s percentile is the percentage of people who earned a lower SAT score than Ann. We shade the area representing those individuals in Figure 5.24. The total area under the normal curve is always equal to 1, and the proportion of people who scored below Ann on the SAT is equal to the area shaded in Figure 5.24: 0.8413. In other words, Ann is in the $$84^{th}$$ percentile of SAT takers. Figure 5.24: The normal model for SAT scores, shading the area of those individuals who scored below Ann.

We can use the normal model to find percentiles or probabilities. A normal probability table, which lists Z scores and corresponding percentiles, can be used to identify a percentile based on the Z score (and vice versa). Statistical software can also be used.

Normal probabilities are most commonly found using statistical software which we will show here using R. We use the software to identify the percentile corresponding to any particular Z score. For instance, the percentile of $$Z=0.43$$ is 0.6664, or the $$66.64^{th}$$ percentile. The pnorm() function is available in default R and will provide the percentile associated with any cutoff on a normal curve. The normTail() function is available in the OpenIntro R package and will draw the associated curve if it is helpful.

pnorm(0.43, mean = 0, sd = 1)
#>  0.666
openintro::normTail(0.43, m = 0, s = 1) We can also find the Z score associated with a percentile. For example, to identify Z for the $$80^{th}$$ percentile, we use qnorm() which identifies the quantile for a given percentage. The quantile represents the cutoff value. (To remember the function qnorm() as providing a cutoff, notice that both qnorm() and “cutoff” start with the sound “kuh.” To remember the pnorm() function as providing a probability from a given cutoff, notice that both pnorm() and probability start with the sound “puh.”) We determine the Z score for the $$80^{th}$$ percentile using qnorm(): 0.84.

qnorm(0.80, mean = 0, sd = 1)
#>  0.842
openintro::normTail(0.84162, m = 0, s = 1) Determine the proportion of SAT test takers who scored better than Ann on the SAT.127

#### Normal probability examples

Cumulative SAT scores are approximated well by a normal model, $$N(\mu=1500, \sigma=300)$$.

Shannon is a randomly selected SAT taker, and nothing is known about Shannon’s SAT aptitude. What is the probability that Shannon scores at least 1630 on her SATs?

First, always draw and label a picture of the normal distribution. (Drawings need not be exact to be useful.) We are interested in the chance she scores above 1630, so we shade the upper tail. See the normal curve below.

The picture shows the mean and the values at 2 standard deviations above and below the mean. The simplest way to find the shaded area under the curve makes use of the Z score of the cutoff value. With $$\mu=1500$$, $$\sigma=300$$, and the cutoff value $$x=1630$$, the Z score is computed as

$Z = \frac{x - \mu}{\sigma} = \frac{1630 - 1500}{300} = \frac{130}{300} = 0.43$

We use software to find the percentile of $$Z=0.43$$, which yields 0.6664. However, the percentile describes those who had a Z score lower than 0.43. To find the area above $$Z=0.43$$, we compute one minus the area of the lower tail, as seen below.

The probability Shannon scores at least 1630 on the SAT is 0.3336.  Always draw a picture first, and find the Z score second.

For any normal probability situation, always always always draw and label the normal curve and shade the area of interest first. The picture will provide an estimate of the probability.

After drawing a figure to represent the situation, identify the Z score for the observation of interest.

If the probability of Shannon scoring at least 1630 is 0.3336, then what is the probability she scores less than 1630? Draw the normal curve representing this exercise, shading the lower region instead of the upper one.128

Edward earned a 1400 on his SAT. What is his percentile?

First, a picture is needed. Edward’s percentile is the proportion of people who do not get as high as a 1400. These are the scores to the left of 1400. Identifying the mean $$\mu=1500$$, the standard deviation $$\sigma=300$$, and the cutoff for the tail area $$x=1400$$ makes it easy to compute the Z score:

$Z = \frac{x - \mu}{\sigma} = \frac{1400 - 1500}{300} = -0.33$

Using the normal probability table, identify the row of $$-0.3$$ and column of $$0.03$$, which corresponds to the probability $$0.3707$$. Edward is at the $$37^{th}$$ percentile.

Use the results of the previous example to compute the proportion of SAT takers who did better than Edward. Also draw a new picture.

If Edward did better than 37% of SAT takers, then about 63% must have done better than him. Areas to the right.

The normal probability table in most books gives the area to the left. If you would like the area to the right, first find the area to the left and then subtract this amount from one.

Stuart earned an SAT score of 2100. Draw a picture for each part. (a) What is his percentile? (b) What percent of SAT takers did better than Stuart?129

Based on a sample of 100 men,130 the heights of male adults between the ages 20 and 62 in the US is nearly normal with mean 70.0’’ and standard deviation 3.3’’.

Mike is 5’7’’ and Jim is 6’4’’. (a) What is Mike’s height percentile? (b) What is Jim’s height percentile? Also draw one picture for each part.^[First put the heights into inches: 67 and 76 inches. Figures are shown below. (a) $$Z_{Mike} = \frac{67 - 70}{3.3} = -0.91\ \to\ 0.1814$$. (b) $$Z_{Jim} = \frac{76 - 70}{3.3} = 1.82\ \to\ 0.9656$$.  ]

The last several problems have focused on finding the probability or percentile for a particular observation. What if you would like to know the observation corresponding to a particular percentile?

Erik’s height is at the $$40^{th}$$ percentile. How tall is he?

As always, first draw the picture. In this case, the lower tail probability is known (0.40), which can be shaded on the diagram. We want to find the observation that corresponds to this value. As a first step in this direction, we determine the Z score associated with the $$40^{th}$$ percentile.

Because the percentile is below 50%, we know $$Z$$ will be negative. Looking in the negative part of the normal probability table, we search for the probability inside the table closest to 0.4000. We find that 0.4000 falls in row $$-0.2$$ and between columns $$0.05$$ and $$0.06$$. Since it falls closer to $$0.05$$, we take this one: $$Z=-0.25$$.

Knowing $$Z_{Erik}=-0.25$$ and the population parameters $$\mu=70$$ and $$\sigma=3.3$$ inches, the Z score formula can be set up to determine Erik’s unknown height, labeled $$x_{Erik}$$:

$-0.25 = Z_{Erik} = \frac{x_{Erik} - \mu}{\sigma} = \frac{x_{Erik} - 70}{3.3}$

Solving for $$x_{Erik}$$ yields the height 69.18 inches. That is, Erik is about 5’9’’ (this is notation for 5-feet, 9-inches).

qnorm(0.4, mean = 0, sd = 1)
#>  -0.253

What is the adult male height at the $$82^{nd}$$ percentile?

Again, we draw the figure first.

#>  0.915

Next, we want to find the Z score at the $$82^{nd}$$ percentile, which will be a positive value. Using qnorm(), the $$82^{nd}$$ percentile corresponds to $$Z=0.92$$. Finally, the height $$x$$ is found using the Z score formula with the known mean $$\mu$$, standard deviation $$\sigma$$, and Z score $$Z=0.92$$:

$0.92 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3}$

This yields 73.04 inches or about 6’1’’ as the height at the $$82^{nd}$$ percentile. 1. What is the $$95^{th}$$ percentile for SAT scores?
2. What is the $$97.5^{th}$$ percentile of the male heights? As always with normal probability problems, first draw a picture.131
1. What is the probability that a randomly selected male adult is at least 6’2’’ (74 inches)?
2. What is the probability that a male adult is shorter than 5’9’’ (69 inches)?132

What is the probability that a random adult male is between 5’9’’ and 6’2’’?

These heights correspond to 69 inches and 74 inches. First, draw the figure. The area of interest is no longer an upper or lower tail. The total area under the curve is 1. If we find the area of the two tails that are not shaded (from the previous Guided Practice, these areas are $$0.3821$$ and $$0.1131$$), then we can find the middle area: That is, the probability of being between 5’9’’ and 6’2’’ is 0.5048.

What percent of SAT takers get between 1500 and 2000?133

What percent of adult males are between 5’5’’ and 5’7’’?134

#### 68-95-99.7 rule

Here, we present a useful general rule for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution. The rule will be useful in a wide range of practical settings, especially when trying to make a quick estimate without a calculator or Z table. Figure 5.25: Probabilities for falling within 1, 2, and 3 standard deviations of the mean in a normal distribution.

Use pnorm() (or a Z table) to confirm that about 68%, 95%, and 99.7% of observations fall within 1, 2, and 3, standard deviations of the mean in the normal distribution, respectively. For instance, first find the area that falls between $$Z=-1$$ and $$Z=1$$, which should have an area of about 0.68. Similarly there should be an area of about 0.95 between $$Z=-2$$ and $$Z=2$$.135

It is possible for a normal random variable to fall 4, 5, or even more standard deviations from the mean. However, these occurrences are very rare if the data are nearly normal. The probability of being further than 4 standard deviations from the mean is about 1-in-30,000. For 5 and 6 standard deviations, it is about 1-in-3.5 million and 1-in-1 billion, respectively.

SAT scores closely follow the normal model with mean $$\mu = 1500$$ and standard deviation $$\sigma = 300$$.
(a) About what percent of test takers score 900 to 2100?
(b) What percent score between 1500 and 2100 ?136

### 5.3.3 Hypothesis testing case studies

The approach for using the normal model in the context of inference is very similar to the practice of applying the model to individual observations that are nearly normal. We will replace null distributions we previously obtained using the randomization or simulation techniques and verify the results once again using the normal model. When the sample size is sufficiently large, the normal approximation generally provides us with the same conclusions as the simulation model.

#### Standard error

Point estimates vary from sample to sample, and we quantify this variability with what is called the standard error (SE). The standard error is equal to the standard deviation associated with the estimate. So, for example, if we used the standard deviation to quantify the variability of a point estimate from one sample to the next, this standard deviation would be called the standard error of the point estimate.

The way we determine the standard error varies from one situation to the next. However, typically it is determined using a formula based on the Central Limit Theorem.

#### Opportunity cost

##### Observed data

In Section 5.1.2 we were introduced to the opportunity cost study, which found that students became thriftier when they were reminded that not spending money now means the money can be spent on other things in the future. Let’s re-analyze the data in the context of the normal distribution and compare the results.

##### Variability of the statistic

Figure 5.26 summarizes the null distribution as determined using the randomization method. The best fitting normal distribution for the null distribution has a mean of 0. We can calculate the standard error of this distribution by borrowing a formula that we will become familiar with in Section 6.2, but for now let’s just take the value $$SE = 0.078$$ as a given. Recall that the point estimate of the difference was 0.20, as shown in the plot. Next, we’ll use the normal distribution approach to compute the two-tailed p-value. Figure 5.26: Null distribution of differences with an overlaid normal curve for the opportunity cost study. 10,000 simulations were run for this figure.

##### Observed statistic vs. null statistics

As we learned in Section 5.3.2, it is helpful to draw and shade a picture of the normal distribution so we know precisely what we want to calculate. Here we want to find the area of the tail beyond 0.2, representing the p-value. Next, we can calculate the Z score using the observed difference, 0.20, and the two model parameters. The standard error, $$SE = 0.078$$, is the equivalent of the model’s standard deviation.

$Z = \frac{\text{observed difference} - 0}{SE} = \frac{0.20 - 0}{0.078} = 2.56$

We can either use statistical software or look up $$Z = 2.56$$ in the normal probability table to determine the right tail area: 0.0052, which is about the same as what we got for the right tail using the randomization approach (0.006). Using this area as the p-value, we see that the p-value is less than 0.05, we conclude that the treatment did indeed impact students’ spending.

Z score in a hypothesis test.

In the context of a hypothesis test, the Z score for a point estimate is

$Z = \frac{\text{point estimate} - \text{null value}}{SE}$

The standard error in this case is the equivalent of the standard deviation of the point estimate, and the null value comes from the null hypothesis.

We have confirmed that the randomization approach we used earlier and the normal distribution approach provide almost identical p-values and conclusions in the opportunity cost case study. Next, let’s turn our attention to the medical consultant case study.

#### Medical consultant

##### Observed data

In Section 5.2.1 we learned about a medical consultant who reported that only 3 of her 62 clients who underwent a liver transplant had complications, which is less than the more common complication rate of 0.10. In that work, we did not model a null scenario, but we will discuss a simulation method for a one proportion null distribution in Section 6.1.1, such a distribution is provided in Figure 5.27. We have added the best-fitting normal curve to the figure, which has a mean of 0.10. Borrowing a formula that we’ll encounter in Chapter 6, the standard error of this distribution was also computed: $$SE = 0.038$$.

##### Variability of the statistic

Before we begin, we want to point out a simple detail that is easy to overlook: the null distribution we generated from the simulation is slightly skewed, and the histogram is not particularly smooth. In fact, the normal distribution only sort-of fits this model. Figure 5.27: The null distribution for the sample proportion, created from 10,000 simulated studies, along with the best-fitting normal model.

##### Observed statistic vs. null statistics

As always, we’ll draw a picture before finding the normal probabilities. Below is a normal distribution centered at 0.10 with a standard error of 0.038. Next, we can calculate the Z score using the observed complication rate, $$\hat{p} = 0.048$$ along with the mean and standard deviation of the normal model. Here again, we use the standard error for the standard deviation.

$Z = \frac{\hat{p} - p_0}{SE_{\hat{p}}} = \frac{0.048 - 0.10}{0.038} = -1.37$

Identifying $$Z = -1.37$$ using statistical software or in the normal probability table, we can determine that the left tail area is 0.0853 which is the estimated p-value for the hypothesis test. There is a small problem: the p-value of 0.0853 is slightly different from the simulation p-value or 0.1222 which will be calculated in Section 6.1.1.

The discrepancy is explained by normal model’s poor representation of the null distribution in Figure 5.27. As noted earlier, the null distribution from the simulations is not very smooth, and the distribution itself is slightly skewed. That’s the bad news. The good news is that we can foresee these problems using some simple checks. We’ll learn about these checks in the following chapters.

In Section 5.3.1 we noted that the two common requirements to apply the Central Limit Theorem are (1) the observations in the sample must be independent, and (2) the sample must be sufficiently large. The guidelines for this particular situation – which we will learn in Section 6.1 – would have alerted us that the normal model was a poor approximation.

#### Conditions for applying the normal model

The success story in this section was the application of the normal model in the context of the opportunity cost data. However, the biggest lesson comes from our failed attempt to use the normal approximation in the medical consultant case study.

Statistical techniques are like a carpenter’s tools. When used responsibly, they can produce amazing and precise results. However, if the tools are applied irresponsibly or under inappropriate conditions, they will produce unreliable results. For this reason, with every statistical method that we introduce in future chapters, we will carefully outline conditions when the method can reasonably be used. These conditions should be checked in each application of the technique.

### 5.3.4 Confidence interval case study

A point estimate is our best guess for the value of the parameter, so it makes sense to build the confidence interval around that value. The standard error, which is a measure of the uncertainty associated with the point estimate, provides a guide for how large we should make the confidence interval. The 68-95-99.7 rule tells us that, in general, 95% of observations are within 2 standard errors of the mean. Here, we use the value 1.96 to be slightly more precise.

Constructing a 95% confidence interval.

When the sampling distribution of a point estimate can reasonably be modeled as normal, the point estimate we observe will be within 1.96 standard errors of the true value of interest about 95% of the time. Thus, a 95% confidence interval for such a point estimate can be constructed:

$\text{point estimate} \pm 1.96 \times SE$

We can be 95% confident this interval captures the true value.

Compute the area between -1.96 and 1.96 for a normal distribution with mean 0 and standard deviation 1.137

The point estimate from the opportunity cost study was that 20% fewer students would buy a video if they were reminded that money not spent now could be spent later on something else. The point estimate from this study can reasonably be modeled with a normal distribution, and a proper standard error for this point estimate is $$SE = 0.078$$. Construct a 95% confidence interval.138

Since the conditions for the normal approximation have already been verified, we can move forward with the construction of the 95% confidence interval:

$\text{point estimate} \pm 1.96 \times SE = 0.20 \pm 1.96 \times 0.078 = (0.047, 0.353)$

We are 95% confident that the video purchase rate resulting from the treatment is between 4.7% and 35.3% lower than in the control group. Since this confidence interval does not contain 0, it is consistent with our earlier result where we rejected the notion of “no difference” using a hypothesis test.

#### Stents

##### Observed data

Consider an experiment that examined whether implanting a stent in the brain of a patient at risk for a stroke helps reduce the risk of a stroke. The results from the first 30 days of this study, which included 451 patients, are summarized in Table 5.9. These results are surprising! The point estimate suggests that patients who received stents may have a higher risk of stroke: $$p_{trmt} - p_{ctrl} = 0.090$$.

Table 5.9: Descriptive statistics for 30-day results for the stent study.
stroke no event Total
treatment 33 191 224
control 13 214 227
Total 46 405 451
##### Variability of the statistic

Consider the stent study and results. The conditions necessary to ensure the point estimate $$p_{trmt} - p_{ctrl} = 0.090$$ is nearly normal have been verified for you, and the estimate’s standard error is $$SE = 0.028$$. Construct a 95% confidence interval for the change in 30-day stroke rates from usage of the stent.

The conditions for applying the normal model have already been verified, so we can proceed to the construction of the confidence interval:

$\text{point estimate} \pm 1.96 \times SE = 0.090 \pm 1.96 \times 0.028 = (0.035, 0.145)$

We are 95% confident that implanting a stent in a stroke patient’s brain increased the risk of stroke within 30 days by a rate of 0.035 to 0.145. This confidence interval can also be used in a way analogous to a hypothesis test: since the interval does not contain 0 (is completely above 0), it means the data provide statistically significant evidence that the stent used in the study increases the risk of stroke within 30 days.

As with hypothesis tests, confidence intervals are imperfect. About 1-in-20 properly constructed 95% confidence intervals will fail to capture the parameter of interest, simply due to natural variability in the observed data. Figure 5.28 shows 25 confidence intervals for a proportion that were constructed from simulations where the true proportion was $$p = 0.3$$. However, 1 of these 25 confidence intervals happened not to include the true value. The interval which does not capture $$p=0.3$$ is not due to bad science. Instead, it is due to natural variability, and we should expect some of our intervals to miss the parameter of interest. Indeed, over a lifetime of creating 95% intervals, you should expect 5% of your reported intervals to miss the parameter of interest (unfortunately, you will not ever know which of your reported intervals captured the parameter and which missed the parameter). Figure 5.28: Twenty-five samples of size $$n=300$$ were simulated when $$p = 0.30$$. For each sample, a confidence interval was created to try to capture the true proportion $$p$$. However, 1 of these 25 intervals did not capture $$p = 0.30$$.

In Figure 5.28, one interval does not contain the true proportion, $$p = 0.3$$. Does this imply that there was a problem with the simulations run?139

#### Interpreting confidence intervals

A careful eye might have observed the somewhat awkward language used to describe confidence intervals. Correct interpretation:

We are XX% confident that the population parameter is between… Incorrect language might try to describe the confidence interval as capturing the population parameter with a certain probability. This is one of the most common errors: while it might be useful to think of it as a probability, the confidence level only quantifies how plausible it is that the parameter is in the interval.

Another especially important consideration of confidence intervals is that they only try to capture the population parameter. Our intervals say nothing about the confidence of capturing individual observations, a proportion of the observations, or about capturing point estimates. Confidence intervals provide an interval estimate for and attempt to capture population parameters.

### 5.3.5 Mathematical model summary

Math flow chart

Table 5.10: Summary Mathematical Models as inferential statistical methods.
Mathematical Model
What does it do? Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples
What is the random process described? Either / both
Is there flexibility? Yes
What is it best for? Quick analyses through, for example, calculating a Z score
What physical object represents the simulation process? Not applicable

### 5.3.6 Exercises

Exercises for this section are under construction.

## 5.4 Chapter review

### 5.4.1 Summary

In this chapter, we have provided three different methods for statistical inference. We will continue to build on the methods throughout the text, and by the end, you should have an understanding of their similarities and differences. Meanwhile, it is important to note that the methods are designed to mimic variability with data, and we know that variability can come from different sources (e.g., random sampling vs. random allocation). In Table 5.11, we have summarized some of the ways the inferential procedures feature specific sources of variability. We hope that you refer back to the table often as you dive more deeply into the ideas in future chapters.

Table 5.11: Summary and comparison of Randomization Tests, Bootstrapping, and Mathematical Models as inferential statistical methods.
Randomization Test Bootstrapping Mathematical Model
What does it do? Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data. Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples
What is the random process described? Randomized experiment Random sampling Either / both
Is there flexibility? Yes, can be used to describe random sampling in an observational model Yes, can be used to describe random allocation in an experiment Yes
What is it best for? Hypothesis Testing (can be used for Confidence Intervals, but not covered in this text) Confidence Intervals (HT for one proportion covered in Chapter 6) Quick analyses through, for example, calculating a Z score
What physical object represents the simulation process? Shuffling cards Pulling balls from a bag Not applicable Figure 5.29: Analysis conclusions should be made carefully according to how the data were collected. Note that very few datasets come from the top left box because usually ethics require that randomly allocated treatments can only be given to volunteers.

### 5.4.2 Terms

We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.

 95% confidence interval hypothesis test parameter standard normal distribution 95% confident independent percentile statistic alternative hypothesis normal curve permutation test statistically significant bootstrap percentile confidence interval normal distribution point estimate success bootstrap sample normal model randomization test test statistic bootstrapping normal probability table sampling with replacement Z score Central Limit Theorem null hypothesis simulation confidence interval p-value standard error

### 5.4.3 Chapter exercises

Exercises for this section are under construction.

### 5.4.4 Interactive R tutorials

Navigate the concepts you’ve learned in this chapter in R using the following self-paced tutorials. All you need is your browser to get started!

You can also access the full list of tutorials supporting this book here.

### 5.4.5 R labs

Further apply the concepts you’ve learned in this chapter in R with computational labs that walk you through a data analysis case study.