5 Introduction to statistical inference
Statistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates. While the equations and details change depending on the setting, the foundations for inference are the same throughout all of statistics.
We start with two case studies designed to motivate the process of making decisions about research claims. We formalize the process through the introduction of the hypothesis testing framework, which allows us to formally evaluate claims about the population.
Finally we expand on the familiar idea of using a sample proportion to estimate a population proportion. That is, we create what is called a confidence interval, which is a range of plausible values where we may find the true population value.
Throughout the book so far, you have worked with data in a variety of contexts. You have learned how to summarize and visualize the data as well as how to model multiple variables at the same time. Sometimes the dataset at hand represents the entire research question. But more often than not, the data have been collected to answer a research question about a larger group of which the data are a (hopefully) representative subset.
You may agree that there is almost always variability in data – one dataset will not be identical to a second dataset even if they are both collected from the same population using the same methods. However, quantifying the variability in the data is neither obvious nor easy to do, i.e. answering the question “how different is one dataset from another?” is not trivial.
Suppose your professor splits the students in your class into two groups: students who sit on the left side of the classroom and students who sit on the right side of the classroom. If \(\hat{p}_{L}\) represents the proportion of students who sit on the left side of the classroom and own an Apple product and and \(\hat{p}_{R}\) represents the proportion of students who sit on the right side of the classroom and own an Apple product, would you be surprised if \(\hat{p}_{L}\) did not exactly equal \(\hat{p}_{R}\)?
While the proportions \(\hat{p}_{L}\) and \(\hat{p}_{R}\) would probably be close to each other, it would be unusual for them to be exactly the same. We would probably observe a small difference due to chance.
If we don’t think the side of the room a person sits on in class is related to whether the person owns an Apple product, what assumption are we making about the relationship between these two variables? (Reminder: for these Guided Practice questions, you can check your answer in the footnote.)^{105}
Studying randomness of this form is a key focus of statistics. Throughout this chapter, and those that follow, we provide three different approaches for quantifying the variability inherent in data: randomization, bootstrapping, and mathematical models. Using the methods provided in this chapter, we will be able to draw conclusions beyond the dataset at hand to research questions about larger populations that the samples come from.
5.1 Randomization tests
The first type of variability we will explore comes from experiments where the explanatory variable (or treatment) is randomly assigned to the observational units. As you learned in Chapter 1, a randomized experiment can be used to assess whether or not one variable (the explanatory variable) causes changes in a second variable (the response variable). Every dataset has some variability in it, so to decide whether the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized experiment as a comparison. That is, we assume that each observational unit would have gotten the exact same response value regardless of the treatment level. By reassigning the treatments many many times, we can compare the actual experiment to the sham experiment. If the actual experiment has more extreme results than any of the sham experiments, we are led to believe that it is the explanatory variable which is causing the result and not just variability inherent to the data. Using a few different case studies, let’s look more carefully at this idea of a randomization test.
5.1.1 Gender discrimination case study
We consider a study investigating gender discrimination in the 1970s, which is set in the context of personnel decisions within a bank.^{106} The research question we hope to answer is, “Are females discriminated against in promotion decisions made by male managers?”
The data from this study can be found in the openintro package: gender_discrimination
.
Observed data
The participants in this study were 48 male bank supervisors attending a management institute at the University of North Carolina in 1972.^{107} They were asked to assume the role of the personnel director of a bank and were given a personnel file to judge whether the person should be promoted to a branch manager position. The files given to the participants were identical, except that half of them indicated the candidate was male and the other half indicated the candidate was female. These files were randomly assigned to the subjects.
Is this an observational study or an experiment? How does the type of study impact what can be inferred from the results?^{108}
For each supervisor both the gender associated with the assigned file and the promotion decision were recorded. Using the results of the study summarized in Table 5.1, we would like to evaluate if females are unfairly discriminated against in promotion decisions. In this study, a smaller proportion of females are promoted than males (0.583 versus 0.875), but it is unclear whether the difference provides convincing evidence that females are unfairly discriminated against.
gender | promoted | not promoted | Total |
---|---|---|---|
male | 21 | 3 | 24 |
female | 14 | 10 | 24 |
Total | 35 | 13 | 48 |
The data are visualized in Figure 5.1 as a set of cards. Note that each card denotes a personnel file (an observation from our dataset) and the colors indicate the decision: red for promoted and white for not promoted. Additionally, the observations are broken up into the male and female groups.
Statisticians are sometimes called upon to evaluate the strength of evidence. When looking at the rates of promotion for males and females in this study, why might we be tempted to immediately conclude that females are being discriminated against?
The large difference in promotion rates (58.3% for females versus 87.5% for males) suggest there might be discrimination against women in promotion decisions. However, we cannot yet be sure if the observed difference represents discrimination or is just due to random chance. Since we wouldn’t expect the sample proportions to be exactly equal, even if the truth was that the promotion decisions were independent of gender, we can’t rule out random chance as a possible explanation when simply comparing the sample proportions.
The previous example is a reminder that the observed outcomes in the sample may not perfectly reflect the true relationships between variables in the underlying population. Table 5.1 shows there were 7 fewer promotions in the female group than in the male group, a difference in promotion rates of 29.2% \(\left( \frac{21}{24} - \frac{14}{24} = 0.292 \right).\) This observed difference is what we call a point estimate of the true difference. The point estimate of the difference is large, but the sample size for the study is small, making it unclear if this observed difference represents discrimination or whether it is simply due to chance. Chance can be thought of as the claim due to natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate. We label these two competing claims, \(H_0\) and \(H_A:\)
\(H_0:\) Null hypothesis. The variables
gender
anddecision
are independent. They have no relationship, and the observed difference between the proportion of males and females who were promoted, 29.2%, was due to random chance.\(H_A:\) Alternative hypothesis. The variables
gender
anddecision
are not independent. The difference in promotion rates of 29.2% was not due to random chance, and equally qualified females are less likely to be promoted than males.
Hypothesis testing
These hypotheses are part of what is called a hypothesis test. A hypothesis test is a statistical technique used to evaluate competing claims using data. Often times, the null hypothesis takes a stance of no difference or no effect.
If the null hypothesis and the data notably disagree, then we will reject the null hypothesis in favor of the alternative hypothesis.
There are many nuances to hypothesis testing, so don’t worry if you aren’t a master of hypothesis testing at the end of this section. We’ll discuss these ideas and details many times in this chapter as well as in the chapters that follow.
What would it mean if the null hypothesis, which says the variables gender
and decision
are unrelated, was true?
It would mean each banker would decide whether to promote the candidate without regard to the gender indicated on the file.
That is, the difference in the promotion percentages would be due to the way the files were randomly allocated to the bankers, and the randomization just happened to give rise to a relatively large difference of 29.2%.
Consider the alternative hypothesis: bankers were influenced by which gender was listed on the personnel file. If this was true, and especially if this influence was substantial, we would expect to see some difference in the promotion rates of male and female candidates. If this gender bias was against females, we would expect a smaller fraction of promotion recommendations for female personnel files relative to the male files.
We will choose between the two competing claims by assessing if the data conflict so much with \(H_0\) that the null hypothesis cannot be deemed reasonable. If data and the null claim seem to be at odds with one another, and the data seem to support \(H_A,\) then we will reject the notion of independence and conclude that the data provide strong evidence of discrimination.
Variability of the statistic
Table 5.1 shows that 35 bank supervisors recommended promotion and 13 did not.
Now, suppose the bankers’ decisions were independent of gender.
Then, if we conducted the experiment again with a different random assignment of gender to the files, differences in promotion rates would be based only on random fluctuation.
We can actually perform this randomization, which simulates what would have happened if the bankers’ decisions had been independent of gender
but we had distributed the file genders differently.^{109}
In the simulation, we thoroughly shuffle 48 personnel files, 35 labelled promoted
and 13 labelled not promoted
, and we deal files into two stacks.
Note that by keeping 35 promoted and 13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual whose content is contained in the file independent of gender.
We will deal 24 files into the first stack, which will represent the 24 “female” files.
The second stack will also have 24 files, and it will represent the 24 “male” files.
Figure 5.2 highlights both the shuffle and the reallocation to the sham gender groups.
Then, as we did with the original data, we tabulate the results and determine the fraction of personnel files designated as “male” and “female” who were promoted.
Since the randomization of files in this simulation is independent of the promotion decisions, any difference in the two promotion rates is entirely due to chance. Table 5.2 show the results of one such simulation.
gender | promoted | not promoted | Total |
---|---|---|---|
male | 18 | 6 | 24 |
female | 17 | 7 | 24 |
Total | 35 | 13 | 48 |
What is the difference in promotion rates between the two simulated groups in Table 5.2 ? How does this compare to the observed difference 29.2% from the actual study?^{110}
Figure 5.3 shows that the difference in promotion rates is much larger in the original data than it is in the simulated groups (0.292 >>> 0.042). The quantity of interest throughout this case study has been the difference in promotion rates. We call the summary value the statistic of interest (or often the test statistic). When we encounter different data structures, the statistic is likely to change (e.g., we might calculate an average instead of a proportion), but we will always want to understand how the statistic varies from sample to sample.
Observed statistic vs. null statistics
We computed one possible difference under the null hypothesis in Guided Practice, which represents one difference due to chance. While in this first simulation, we physically dealt out files, it is much more efficient to perform this simulation using a computer. Repeating the simulation on a computer, we get another difference due to chance: -0.042. And another: 0.208. And so on until we repeat the simulation enough times that we have a good idea of the shape of the distribution of differences from chance alone. Figure 5.4 shows a plot of the differences found from 100 simulations, where each dot represents a simulated difference between the proportions of male and female files recommended for promotion.
Note that the distribution of these simulated differences in proportions is centered around 0. Because we simulated differences in a way that made no distinction between men and women, this makes sense: we should expect differences from chance alone to fall around zero with some random fluctuation for each simulation.
How often would you observe a difference of at least 29.2% (0.292) according to Figure 5.4? Often, sometimes, rarely, or never?
It appears that a difference of at least 29.2% due to chance alone would only happen about 2% of the time according to Figure 5.4. Such a low probability indicates that observing such a large difference from chance alone is rare.
The difference of 29.2% is a rare event if there really is no impact from listing gender in the candidates’ files, which provides us with two possible interpretations of the study results:
If \(H_0,\) the Null hypothesis is true: Gender has no effect on promotion decision, and we observed a difference that is so large that it would only happen rarely.
If \(H_A,\) the Alternative hypothesis is true: Gender has an effect on promotion decision, and what we observed was actually due to equally qualified women being discriminated against in promotion decisions, which explains the large difference of 29.2%.
When we conduct formal studies, we reject a null position (the idea that the data are a result of chance only) if the data strongly conflict with that null position.^{111} In our analysis, we determined that there was only a \(\approx\) 2% probability of obtaining a sample where \(\geq\) 29.2% more males than females get promoted by chance alone, so we conclude that the data provide strong evidence of gender discrimination against women by the supervisors. In this case, we reject the null hypothesis in favor of the alternative.
Statistical inference is the practice of making decisions and conclusions from data in the context of uncertainty. Errors do occur, just like rare events, and the data set at hand might lead us to the wrong conclusion. While a given data set may not always lead us to a correct conclusion, statistical inference gives us tools to control and evaluate how often these errors occur. Before getting into the nuances of hypothesis testing, let’s work through another case study.
5.1.2 Opportunity cost case study
How rational and consistent is the behavior of the typical American college student? In this section, we’ll explore whether college student consumers always consider the following: money not spent now can be spent later.
In particular, we are interested in whether reminding students about this well-known fact about money causes them to be a little thriftier. A skeptic might think that such a reminder would have no impact. We can summarize the two different perspectives using the null and alternative hypothesis framework.
- \(H_0:\) Null hypothesis. Reminding students that they can save money for later purchases will not have any impact on students’ spending decisions.
- \(H_A:\) Alternative hypothesis. Reminding students that they can save money for later purchases will reduce the chance they will continue with a purchase.
In this section, we’ll explore an experiment conducted by researchers that investigates this very question for students at a university in the southwestern United States.^{112}
Observed data
One-hundred and fifty students were recruited for the study, and each was given the following statement:
Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below.
Half of the 150 students were randomized into a control group and were given the following two options:
- Buy this entertaining video.
- Not buy this entertaining video.
The remaining 75 students were placed in the treatment group, and they saw a slightly modified option (B):
- Buy this entertaining video.
- Not buy this entertaining video. Keep the $14.99 for other purchases.
Would the extra statement reminding students of an obvious fact impact the purchasing decision? Table 5.3 summarizes the study results.
The data from this study can be found in the openintro package: opportunity_cost
.
group | buy video | not buy video | Total |
---|---|---|---|
control | 56 | 19 | 75 |
treatment | 41 | 34 | 75 |
Total | 97 | 53 | 150 |
It might be a little easier to review the results using a visualisation. Figure 5.5 shows that a higher proportion of students in the treatment group chose not to buy the video compared to those in the control group.
Another useful way to review the results from Table 5.3 is using row proportions, specifically considering the proportion of participants in each group who said they would buy or not buy the video. These summaries are given in Table 5.4.
group | buy video | not buy video | Total |
---|---|---|---|
control | 0.747 | 0.253 | 1 |
treatment | 0.547 | 0.453 | 1 |
We will define a success in this study as a student who chooses not to buy the video.^{113} Then, the value of interest is the change in video purchase rates that results by reminding students that not spending money now means they can spend the money later.
We can construct a point estimate for this difference as (\(T\) for treatment and \(C\) for control):
\[\hat{p}_{T} - \hat{p}_{C} = \frac{34}{75} - \frac{19}{75} = 0.453 - 0.253 = 0.200\]
The proportion of students who chose not to buy the video was 20 percentage points higher in the treatment group than the control group. However, is this result statistically significant? In other words, is a 20% difference between the two groups so prominent that it is unlikely to have occurred from chance alone?
Variability of the statistic
The primary goal in this data analysis is to understand what sort of differences we might see if the null hypothesis were true, i.e., the treatment had no effect on students. For this, we’ll use the same procedure we applied in Section 5.1.1: randomization.
Let’s think about the data in the context of the hypotheses. If the null hypothesis \((H_0)\) was true and the treatment had no impact on student decisions, then the observed difference between the two groups of 20% could be attributed entirely to random chance. If, on the other hand, the alternative hypothesis \((H_A)\) is true, then the difference indicates that reminding students about saving for later purchases actually impacts their buying decisions.
Observed statistic vs. null statistics
Just like with the gender discrimination study, we can perform a statistical analysis. Using the same randomization technique from the last section, let’s see what happens when we simulate the experiment under the scenario where there is no effect from the treatment.
While we would in reality do this simulation on a computer, it might be useful to think about how we would go about carrying out the simulation without a computer.
We start with 150 index cards and label each card to indicate the distribution of our response variable: decision
.
That is, 53 cards will be labeled “not buy video” to represent the 53 students who opted not to buy, and 97 will be labeled “buy video” for the other 97 students.
Then we shuffle these cards thoroughly and divide them into two stacks of size 75, representing the simulated treatment and control groups.
Any observed difference between the proportions of “not buy video” cards (what we earlier defined as success) can be attributed entirely to chance.
If we are randomly assigning the cards into the simulated treatment and control groups, how many “not buy video” cards would we expect to end up in each simulated group? What would be the expected difference between the proportions of “not buy video” cards in each group?
Since the simulated groups are of equal size, we would expect \(53 / 2 = 26.5,\) i.e., 26 or 27, “not buy video” cards in each simulated group, yielding a simulated point estimate of the difference in proportions of 0% . However, due to random chance, we might also expect to sometimes observe a number a little above or below 26 and 27.
The results of a single randomization from chance alone is shown in Table 5.5.
group | buy video | not buy video | Total |
---|---|---|---|
control | 46 | 29 | 75 |
treatment | 51 | 24 | 75 |
Total | 97 | 53 | 150 |
From this table, we can compute a difference that occurred the first shuffle of the data (i.e., from chance alone):
\[\hat{p}_{T, shfl1} - \hat{p}_{C, shfl1} = \frac{24}{75} - \frac{29}{75} = 0.32 - 0.387 = - 0.067\]
Just one simulation will not be enough to get a sense of what sorts of differences would happen from chance alone.
We’ll simulate another set of simulated groups and compute the new difference: 0.04.
And again: 0.12.
And again: -0.013.
We’ll do this 1,000 times.
The results are summarized in a dot plot in Figure 5.6, where each point represents a simulation.
Since there are so many points, it is more convenient to summarize the results in a histogram such as the one in Figure 5.7, where the height of each histogram bar represents the number of simulations resulting in an outcome of that magnitude.
If there was no treatment effect, then we’d only observe a difference of at least +20% about 0.6% of the time.
That is really rare!
Instead, we will conclude the data provide strong evidence there is a treatment effect: reminding students before a purchase that they could instead spend the money later on something else lowers the chance that they will continue with the purchase.
Notice that we are able to make a causal statement for this study since the study is an experiment, although we don’t know why the reminder induces a lower purchase rate.
5.1.3 Hypothesis testing
In the last two sections, we utilized a hypothesis test, which is a formal technique for evaluating two competing possibilities.
In each scenario, we described a null hypothesis, which represented either a skeptical perspective or a perspective of no difference.
We also laid out an alternative hypothesis, which represented a new perspective such as the possibility that there has been a change or that there is a treatment effect in an experiment.
The alternative hypothesis is usually the reason the scientists set out to do the research in the first place.
Null and alternative hypotheses.
The null hypothesis \((H_0)\) often represents either a skeptical perspective or a claim to be tested.
The alternative hypothesis \((H_A)\) represents an alternative claim under consideration and is often represented by a range of possible values for the value of interest.
If a person makes a somewhat unbelievable claim, we are initially skeptical.
However, if there is sufficient evidence that supports the claim, we set aside our skepticism.
The hallmarks of hypothesis testing are also found in the US court system.
The US court system
A US court considers two possible claims about a defendant: they are either innocent or guilty.
If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative?
The jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person’s guilt.
That is, the skeptical perspective (null hypothesis) is that the person is innocent until evidence is presented that convinces the jury that the person is guilty (alternative hypothesis).
Jurors examine the evidence to see whether it convincingly shows a defendant is guilty.
Notice that if a jury finds a defendant not guilty, this does not necessarily mean the jury is confident in the person’s innocence.
They are simply not convinced of the alternative that the person is guilty.
This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as truth.
Failing to find strong evidence for the alternative hypothesis is not equivalent to providing evidence that the null hypothesis is true. We will see this idea in greater detail in Section 6.2.2.
p-value and statistical significance
In Section 5.1.1 we encountered a study from the 1970’s that explored whether there was strong evidence that women were less likely to be promoted than men. The research question – are females discriminated against in promotion decisions? – was framed in the context of hypotheses:
\(H_0:\) Gender has no effect on promotion decisions.
\(H_A:\) Women are discriminated against in promotion decisions.
The null hypothesis \((H_0)\) was a perspective of no difference. The data on gender discrimination provided a point estimate of a 29.2% difference in recommended promotion rates between men and women. We determined that such a difference from chance alone would be rare: it would only happen about 2 in 100 times. When results like these are inconsistent with \(H_0,\) we reject \(H_0\) in favor of \(H_A.\) Here, we concluded there was discrimination against women.
The 2-in-100 chance is what we call a p-value, which is a probability quantifying the strength of the evidence against the null hypothesis and in favor of the alternative.
p-value.
The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current dataset, if the null hypothesis were true. We typically use a summary statistic of the data, such as a difference in proportions, to help compute the p-value and evaluate the hypotheses. This summary value that is used to compute the p-value is often called the test statistic.
In the gender discrimination study, the difference in discrimination rates was our test statistic. What was the test statistic in the opportunity cost study covered in Section 5.1.2?
The test statistic in the opportunity cost study was the difference in the proportion of students who decided against the video purchase in the treatment and control groups. In each of these examples, the point estimate of the difference in proportions was used as the test statistic.
When the p-value is small, i.e., less than a previously set threshold, we say the results are statistically significant. This means the data provide such strong evidence against \(H_0\) that we reject the null hypothesis in favor of the alternative hypothesis. The threshold, called the significance level and often represented by \(\alpha\) (the Greek letter alpha), is typically set to \(\alpha = 0.05,\) but can vary depending on the field or the application. Using a significance level of \(\alpha = 0.05\) in the discrimination study, we can say that the data provided statistically significant evidence against the null hypothesis.
Statistical significance.
We say that the data provide statistically significant evidence against the null hypothesis if the p-value is less than some reference value, often \(\alpha=0.05.\)
In the opportunity cost study in Section 5.1.2, we analyzed an experiment where study participants had a 20% drop in likelihood of continuing with a video purchase if they were reminded that the money, if not spent on the video, could be used for other purchases in the future. We determined that such a large difference would only occur 6-in-1000 times if the reminder actually had no influence on student decision-making. What is the p-value in this study? Was the result statistically significant?
The p-value was 0.006. Since the p-value is less than 0.05, the data provide statistically significant evidence that US college students were actually influenced by the reminder.
What’s so special about 0.05?
We often use a threshold of 0.05 to determine whether a result is statistically significant. But why 0.05? Maybe we should use a bigger number, or maybe a smaller number. If you’re a little puzzled, that probably means you’re reading with a critical eye – good job! We’ve made a video to help clarify why 0.05:
https://www.openintro.org/book/stat/why05/
Sometimes it’s also a good idea to deviate from the standard. We’ll discuss when to choose a threshold different than 0.05 in Section 6.2.2.
5.1.4 Randomization test summary
Figure 5.8 provides a visual summary of the randomization testing procedure.
We can summarise the Randomization Test procedure as follows:
- Frame the research question in terms of hypotheses. Hypothesis tests are appropriate for research questions that can be summarized in two competing hypotheses. The null hypothesis \((H_0)\) usually represents a skeptical perspective or a perspective of no difference. The alternative hypothesis \((H_A)\) usually represents a new view or a difference.
- Collect data with an observational study or experiment. If a research question can be formed into two hypotheses, we can collect data to run a hypothesis test. If the research question focuses on associations between variables but does not concern causation, we would run an observational study. If the research question seeks a causal connection between two or more variables, then an experiment should be used.
- Model the randomness as if the null hypothesis was true. In the examples above, the variability has been modeled as if the treatment (e.g., gender, opportunity) allocation was independent of the outcome of the study. The computer generated the null distribution from many different randomizations in order to quantify the null variability.
- Analyze the data. Choose an analysis technique appropriate for the data and identify the p-value. So far, we’ve only seen one analysis technique: randomization. Throughout the rest of this textbook, we’ll encounter several new methods suitable for many other contexts.
- Form a conclusion. Using the p-value from the analysis, determine whether the data provide statistically significant evidence against the null hypothesis. Also, be sure to write the conclusion in plain language so casual readers can understand the results.
Table 5.6 is another look at the Randomization Test summary.
Randomization Test | |
---|---|
What does it do? | Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment. |
What is the random process described? | Randomized experiment. |
What other random processes can be approximated? | Can also be used to describe random sampling in an observational model |
What is it best for? | Hypothesis Testing (can be used for Confidence Intervals, but not covered in this text). |
What physical object represents the simulation process? | Shuffling cards |
5.1.5 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
5.2 Bootstrap confidence intervals
As seen in Section 5.1, Randomization is a statistical technique suitable for evaluating whether a difference in sample proportions is due to chance.
Randomization tests are best suited for modeling experiments where the treatment (explanatory variable) has been randomly assigned to the observational units and there is an attempt to answer a simple yes/no research question.
For example, consider the following research questions that can be well assessed with a randomization test:
- Does this vaccine make it less likely that a person will get malaria?
- Does drinking caffeine affect how quickly a person can tap their finger?
- Can we predict whether candidate A will win the upcoming election?
In this section, however, we are instead interested in a different approach to understanding population parameters. Instead of testing a claim, the goal now is to estimate the unknown value of a population parameter.
For example,
- How much less likely am I to get malaria if I get the vaccine?
- How much faster (or slower) can a person tap their finger, on average, if they drink caffeine first?
- What proportion of the vote will go to candidate A?
Here, we explore the situation where focus is on a single proportion, and we introduce a new simulation method: bootstrapping.
Bootstrapping is best suited for modeling studies where the data have been generated through random sampling from a population.
As with randomization tests, our goal with bootstrapping is to understand variability of a statistic.
Unlike randomization tests (which modeled how the statistic would change if the treatment had been allocated differently), the bootstrap will model how a statistic changes from repeated sampling. How a statistic varies from sample to sample will provide information about how different the statistic is from the parameter of interest.
Quantifying the variability of a statistic from sample to sample is a hard problem.
Fortunately, sometimes the mathematical theory for how a statistic varies (across different samples) is well-known; this is the case for the sample proportion as seen in Section 5.3.
However, some statistics don’t have simple theory for how they vary, and bootstrapping provides a computational approach for providing interval estimates for almost any population parameter (we will revisit bootstrapping in Chapters 6, 7, and 8 so you’ll get plenty of practice as well as exposure to bootstrapping in many different data settings).
Our goal with bootstrapping will be to produce an interval estimate (a range of plausible values) for the population parameter.
If we could, we would measure the variability of the statistics by repeatedly taking sample data from the population compute the sample proportion.
Then we could do it again.
And again.
And so on until we have a good sense of the variability of our original estimate.
When the sampling variability is large, we would assume that the original statistic is possibly far from the true population parameter of interest (and the interval estimate will be wide).
When the variability across the samples is small, we expect the sample statistic to be close to the true parameter of interest (and the interval estimate will be narrow).
The ideal world where sampling data is free or extremely cheap is almost never the case, and taking repeated samples from a population is usually impossible.
So, instead of using a “resample from the population” approach, bootstrapping uses a “resample from the sample” approach.
The sections below provide examples and details about the bootstrapping process.
5.2.1 Medical consultant case study
People providing an organ for donation sometimes seek the help of a special medical consultant. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients.
Observed data
One consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have had only 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).
We will let \(p\) represent the true complication rate for liver donors working with this consultant. (The “true” complication rate will be referred to as the parameter.) We estimate \(p\) using the data, and label the estimate \(\hat{p}.\)
The sample proportion for the complication rate is 3 complications divided by the 62 surgeries the consultant has worked on: \(\hat{p} = 3/62 = 0.048.\)
Is it possible to assess the consultant’s claim (that the reduction in complications is due to her work) using the data?
No.
The claim is that there is a causal connection, but the data are observational, so we must be on the lookout for confounding variables.
For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.
While it is not possible to assess the causal claim, it is still possible to understand the consultant’s true rate of complications.
Parameter.
A parameter is the “true” value of interest.
We typically estimate the parameter using a point estimate from a sample of data. The point estimate is also known as the statistic.
For example, we estimate the probability \(p\) of a complication for a client of the medical consultant by examining the past complications rates of her clients:
\[\hat{p} = 3 / 62 = 0.048~\text{is used to estimate}~p\]
Variability of the statistic
In the medical consultant case study, the parameter is \(p,\) the true probability of a complication for a client of the medical consultant.
There is no reason to believe that \(p\) is exactly \(\hat{p} = 3/62,\) but there is also no reason to believe that \(p\) is particularly far from \(\hat{p} = 3/62.\)
By sampling with replacement from the dataset (a process called bootstrapping), the variability of the possible \(\hat{p}\) values can be approximated.
Most of the inferential procedures covered in this text are grounded in quantifying how one data set would differ from another when they are both taken from the same population.
It doesn’t make sense to take repeated samples from the same population because if you have the means to take more samples, a larger sample size will benefit you more than separately evaluating two sample of the exact same size.
Instead, we measure how the samples behave under an estimate of the population.
Figure 5.9 shows how the unknown original population can be estimated by using the sample to approximate the proportion of successes and failures (in our case, the proportion of complications and no complications for the medical consultant).
By taking repeated samples from the estimated population, the variability from sample to sample can be observed.
In Figure 5.10 the repeated bootstrap samples are obviously different both from each other and from the original population.
Recall that the bootstrap samples were taken from the same (estimated) population, and so the differences are due entirely to natural variability in the sampling procedure.
By summarizing each of the bootstrap samples (here, using the sample proportion), we see, directly, the variability of the sample proportion, \(\hat{p},\) from sample to sample.
The distribution of \(\hat{p}_{boot}\) for the example scenario is shown in Figure 5.11, and the full bootstrap distribution for the medical consultant data is shown in Figure 5.14.
It turns out that in practice, it is very difficult for computers to work with an infinite population (with the same proportional breakdown as in the sample).
However, there is a physical and computational model which produces an equivalent bootstrap distribution of the sample proportion in a computationally efficient manner.
Consider the observed data to be a bag of marbles 3 of which are success (red) and 4 of which are failures (white).
By drawing the marbles out of the bag with replacement, we depict the exact same sampling process as was done with the infinitely large estimated population.
If we apply the bootstrap sampling process to the medical consultant example, we consider each client to be one of the marbles in the bag.
There will be 59 white marbles (no complication) and 3 red marbles (complication).
If we choose 62 marbles out of the bag (one at a time with replacement) and compute the proportion of simulated patients with complications, \(\hat{p}_{boot},\) then this “bootstrap” proportion represents a single simulated proportion from the “resample from the sample” approach.
In a simulation of 62 patients, about how many would we expect to have had a complication?^{114}
One simulation isn’t enough to get a sense of the variability from one bootstrap proportion to another bootstrap proportion, so we repeat the simulation 10,000 times using a computer.
Figure 5.14 shows the distribution from the 10,000 bootstrap simulations.
The bootstrapped proportions vary from about zero to 11.3%.
The variability in the bootstrapped proportions leads us to believe that the true probability of complication (the parameter, \(p\)) is somewhere between 0 and 11.3%.
The range of values for the true proportion is called a bootstrap percentile confidence interval, and we will see it again in throughout the next few sections and chapters.
The original claim was that the consultant’s true rate of complication was under the national rate of 10%. Does the interval estimate of 0 to 11.3% for the true probability of complication indicate that the surgical consultant has a lower rate of complications than the national average? Explain.
No. Because the interval overlaps 10%, it might be that the consultant’s work is associated with a lower risk of complications, or it might be that the consultant’s work is associated with a higher risk (i.e., greater than 10%) of complications! Additionally, as previously mentioned, because this is an observational study, even if an association can be measured, there is no evidence that the consultant’s work is the cause of the complication rate (being higher or lower).
5.2.2 Tappers and listeners case study
Here’s a game you can try with your friends or family: pick a simple, well-known song, tap that tune on your desk, and see if the other person can guess the song. In this simple game, you are the tapper, and the other person is the listener.
Observed data
A Stanford University graduate student named Elizabeth Newton conducted an experiment using the tapper-listener game.^{115} In her study, she recruited 120 tappers and 120 listeners into the study. About 50% of the tappers expected that the listener would be able to guess the song. Newton wondered, is 50% a reasonable expectation?
In Newton’s study, only 3 out of 120 listeners (\(\hat{p} = 0.025\)) were able to guess the tune! That seems like quite a low number which leads the researcher to ask: what is the true proportion of people who can guess the tune?
Variability of the statistic
To answer the question, we will again use a simulation.
To simulate 120 games, this time we use a bag of 120 marbles 3 are red (for those who guessed correctly) and 117 are white (for those who could not guess the song).
Sampling from the bag 120 times (remembering to replace the marble back into the bag each time to keep constant the population proportion of red) produces one bootstrap sample.
For example, we can start by simulating 5 tapper-listener pairs by sampling 5 marbles from the bag of 3 red and 117 white marbles.
W | W | W | R | W |
---|---|---|---|---|
Wrong | Wrong | Wrong | Correct | Wrong |
After selecting 120 marbles, we counted 2 red for \(\hat{p}_{boot1} = 0.0167.\) As we did with the randomization technique, seeing what would happen with one simulation isn’t enough. In order to understand how far the observed proportion of 0.025 might be from the true parameter, we should generate more simulations. Here we’ve repeated the entire simulation ten times:
\[0.0417 \quad 0.0250 \quad 0.0250 \quad 0.0083 \quad 0.0500 \quad 0.0333 \quad 0.0250 \quad 0.000 \quad 0.0083 \quad 0.000\]
As before, we’ll run a total of 10,000 simulations using a computer. As seen in Figure 5.15, the range of 95% of the resampled values of \(\hat{p}_{boot}\) is 0.000 to 0.0583. That is, we expect that between 0% and 5.83% of people are truly able to guess the tapper’s tune.
Do the data provide “statistically significant” evidence against the claim that 50% of listeners can guess the tapper’s tune?^{116}
5.2.3 Confidence intervals
A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely perfect; usually there is some error in the estimate. In addition to supplying a point estimate of a parameter, a next logical step would be to provide a plausible range of values for the parameter.
Plausible range of values for the population parameter
A plausible range of values for the population parameter is called a confidence interval. Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net. We can throw a spear where we saw a fish, but we will probably miss. On the other hand, if we toss a net in that area, we have a good chance of catching the fish.
If we report a point estimate, we probably will not hit the exact population parameter. On the other hand, if we report a range of plausible values – a confidence interval – we have a good shot at capturing the parameter.
If we want to be very certain we capture the population parameter, should we use a wider interval or a smaller interval?^{117}
Bootstrap confidence interval
As we saw above, a bootstrap sample is a sample of the original sample. In the case of the medical complications data, we proceed as follows:
- Randomly sample one observation from the 62 patients (replace the marble back into the bag so as to keep constant the population).
- Randomly sample a second observation from the 62 patients. Because we sample with replacement (i.e., we don’t actually remove the marbles from the bag), there is a 1-in-62 chance that the second observation will be the same one sampled in the first step!
- Keep going one sampled observation at a time …
- Randomly sample a 62nd observation from the 62 patients.
Bootstrap sampling is often called sampling with replacement.
A bootstrap sample behaves similarly to how an actual sample from a populationwould behave, and we compute the point estimate of interest (here, compute \(\hat{p}_{boot}\)).
Due to theory that is beyond this text, we know that the bootstrap proportions \(\hat{p}_{boot}\) vary around \(\hat{p}\) in a similar way to how different sample proportions (i.e., values of \(\hat{p}\)) vary around the true parameter \(p.\)
Therefore, an interval estimate for \(p\) can be produced using the \(\hat{p}_{boot}\) values themselves.
95% Bootstrap percentile confidence interval for a parameter \(p.\)
The 95% bootstrap confidence interval for the parameter \(p\) can be obtained directly using the ordered values \(\hat{p}_{boot}\) values.
Consider the sorted \(\hat{p}_{boot}\) values. Call the 2.5% bootstrapped proportion value “lower,” and call the 97.5% bootstrapped proportion value “upper.”
The 95% confidence interval is given by: (lower, upper)
In Section 6.1.1 we will discuss different percentages for the confidence interval (e.g., 90% confidence interval or 99% confidence interval).
Section 6.1.1 also provides a longer discussion on what “95% confidence” actually means.
5.2.4 Bootstrap summary
We can summarise the Bootstrap process as follows:
- Frame the research question in terms of a parameter to estimate. Confidence Intervals are appropriate for research questions that aim to estimate a number from the population (called a parameter).
- Collect data with an observational study or experiment. If a research question can be formed as a query about the parameter, we can collect data to calculate a statistic which is the best guess we have for the value of the parameter. However, we know that the statistic won’t be exactly equal to the parameter due to natural variability.
- Model the randomness by using the data values as a proxy for the population. In order to assess how far the statistic might be from the parameter, we take repeated resamples from the dataset to measure the variability in bootstrapped statistics. The variability of the bootstrapped statistics around the observed statistic (a quantity which can be measured through computational technique) should be approximately the same as the variability of many observed sample statistics around the parameter (a quantity which is very difficult to measure because in real life we only get exactly one sample).
- Create the interval. After choosing a particular confidence level, use the variability of the bootstrapped statistics to create an interval estimate which will hope to capture the true parameter. While the interval estimate associated with the particular sample at hand may or may not capture the parameter, the researcher knows that over their lifetime, the confidence level will determine the percentage of their research confidence intervals that do capture the true parameter.
- Form a conclusion. Using the confidence interval from the analysis, report on the interval estimate for the parameter of interest. Also, be sure to write the conclusion in plain language so casual readers can understand the results.
Table 5.7 is another look at the Bootstrap process summary.
Bootstrapping | |
---|---|
What does it do? | Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population. |
What is the random process described? | Random sampling from a population. |
What other random processes can be approximated? | Can also be used to describe random allocation in an experiment |
What is it best for? | Confidence Intervals (bootstrap HT for one proportion covered in Chapter 6). |
What physical object represents the simulation process? | Pulling marbles from a bag |
5.2.5 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
5.3 Mathematical models
5.3.1 Central Limit Theorem
We’ve encountered four case studies so far this chapter. While they differ in the settings, in their outcomes, and also in the technique we’ve used to analyze the data, they all have something in common: the general shape of the distribution of the statistics (called the sampling distribution).
Null distribution
Figure 5.17 shows the null distributions in each of the four case studies where we ran 10,000 simulations. Note that the null distribution is the sampling distribution of the statistic created under the setting where the null hypothesis is true. Therefore, the null distribution will always be centered at the value of the parameter given by the null hypothesis. In the case of the opportunity cost study, which originally had just 1,000 simulations, we’ve included an additional 9,000 simulations.
Describe the shape of the distributions and note anything that you find interesting.^{118}
The case study for the medical consultant is the only distribution with any evident skew.
As we observed in Chapter 1, it’s common for distributions to be skewed or contain outliers.
However, the null distributions we’ve so far encountered have all looked somewhat similar and, for the most part, symmetric.
They all resemble a bell-shaped curve.
The bell-shaped curve similarity is not a coincidence, but rather, is guaranteed by mathematical theory.
Central Limit Theorem for proportions.
If we look at a proportion (or difference in proportions) and the scenario satisfies certain conditions, then the sample proportion (or difference in proportions) will appear to follow a bell-shaped curve called the normal distribution.
An example of a perfect normal distribution is shown in Figure 5.18.
Imagine laying a normal curve over each of the four null distributions in Figure 5.17.
While the mean (center) and standard deviation (width or spread) may change for each plot, the general shape remains roughly intact.
Mathematical theory guarantees that if repeated samples are taken a sample proportion or a difference in sample proportions will follow something that resembles a normal distribution when certain conditions are met. (Note: we typically only take one sample, but the mathematical model lets us know what to expect if we had taken repeated samples.) These conditions fall into two general categories describing the independence between observations and the need to take a sufficiently large sample size.
Observations in the sample are independent. Independence is guaranteed when we take a random sample from a population. Independence can also be guaranteed if we randomly divide individuals into treatment and control groups.
The sample is large enough. The sample size cannot be too small. What qualifies as “small” differs from one context to the next, and we’ll provide suitable guidelines for proportions in Chapter 6.
So far we’ve had no need for the normal distribution. We’ve been able to answer our questions somewhat easily using simulation techniques. However, soon this will change. Simulating data can be non-trivial. For example, some of the scenarios encountered in Chapters 3 and 4 would require complex simulations in order to make inferential conclusions. Instead, the normal distribution and other distributions like it offer a general framework for statistical inference that applies to a very large number of settings.
Technical Conditions
In order for the normal approximation to describe the sampling distribution of the sample proportion as it varies from sample to sample, two conditions must hold. If these conditions do not hold, it is unwise to use the normal distribution (and related concepts like Z scores, probabilities from the normal curve, etc.) for inferential analyses.
- independent observations
- large enough sample (for proportions, at least 10 successes and 10 failures should have been observed in the sample)
Anticipating frequent use of the normal distribution
Below we introduce three new settings where the normal distribution will be useful, and constructing suitable simulations can be difficult.
The opportunity cost study determined that students are thriftier if they are reminded that saving money now means they can spend the money later. The study’s point estimate for the estimated impact was 20%, meaning 20% fewer students would move forward with a video purchase in the study scenario. However, as we’ve learned, point estimates aren’t perfect – they only provide an approximation of the truth.
It would be useful if we could provide a range of plausible values for the impact, more formally known as a confidence interval. It is often difficult to construct a reliable confidence interval in many situations using simulations.^{119} However, doing so is reasonably straightforward using the normal distribution.
Book prices were collected for 73 courses at UCLA in Spring 2010. Data were collected from both the UCLA Bookstore and Amazon. The differences in these prices are shown in Figure 5.19. The mean difference in the price of the books was $12.76, and we might wonder, does this provide strong evidence that the prices differ between the two book sellers?
Here again we can apply the normal distribution, this time in the context of numerical data. We’ll explore this example and construct such a hypothesis test in Section 7.3.
Elmhurst College in Illinois released anonymized data for family income and financial support provided by the school for Elmhurst’s first-year students in 2011. Figure 5.20 shows a regression line fit to a scatterplot of a sample of the data. One question we will ask is, do the data show a real linear trend, or is the trend we observe reasonably explained by random chance?
In Chapter 3 we learned how to apply least squares regression to quantify the trend. In Chapter 8 we will focus on whether or not that trend can be explained by chance alone. For this case study, we could again use the normal distribution to help us answer this question.
These examples highlight the value of the normal distribution approach. However, before we can apply the normal distribution to statistical inference, it is necessary to become familiar with the mechanics of the normal distribution. In Section 5.3.2 we discuss characteristics of the normal distribution and explore examples of data that follow a normal distribution. In Section 5.3.3, we apply the new knowledge in the context of hypothesis tests and confidence intervals.
5.3.2 Normal Distribution
Among all the distributions we see in statistics, one is overwhelmingly the most common. The symmetric, unimodal, bell curve is ubiquitous throughout statistics. It is so common that people know it as a variety of names including the normal curve, normal model, or normal distribution.^{120} Under certain conditions, sample proportions, sample means, and sample differences can be modeled using the normal distribution. Additionally, some variables such as SAT scores and heights of US adult males closely follow the normal distribution.
Normal distribution facts.
Many summary statistics and variables are nearly normal, but none are exactly normal. Thus the normal distribution, while not perfect for any single problem, is very useful for a variety of problems. We will use it in data exploration and to solve important problems in statistics.
In this section, we will discuss the normal distribution in the context of data to become familiar with normal distribution techniques. In the following sections and beyond, we’ll move our discussion to focus on applying the normal distribution and other related distributions to model point estimates for hypothesis tests and for constructing confidence intervals.
Normal distribution model
The normal distribution always describes a symmetric, unimodal, bell-shaped curve. However, normal curves can look different depending on the details of the model. Specifically, the normal model can be adjusted using two parameters: mean and standard deviation. As you can probably guess, changing the mean shifts the bell curve to the left or right, while changing the standard deviation stretches or constricts the curve. Figure 5.21 shows the normal distribution with mean \(0\) and standard deviation \(1\) (which is commonly referred to as the standard normal distribution) on top. A normal distributions with mean \(19\) and standard deviation \(4\) is shown on the bottom. Figure 5.22 shows the same two normal distributions on the same axis.
If a normal distribution has mean \(\mu\) and standard deviation \(\sigma,\) we may write the distribution as \(N(\mu, \sigma).\) The two distributions in Figure 5.22 can be written as
\[ N(\mu = 0, \sigma = 1)\quad\text{and}\quad N(\mu = 19, \sigma = 4) \]
Because the mean and standard deviation describe a normal distribution exactly, they are called the distribution’s parameters.
Write down the short-hand for a normal distribution with (a) mean 5 and standard deviation 3, (b) mean -100 and standard deviation 10, and (c) mean 2 and standard deviation 9.^{121}
Standardizing with Z scores
Table 5.8 shows the mean and standard deviation for total scores on the SAT and ACT. The distribution of SAT and ACT scores are both nearly normal. Suppose Ann scored 1800 on her SAT and Tom scored 24 on his ACT. Who performed better?^{122}
SAT | ACT | |
---|---|---|
Mean | 1500 | 21 |
SD | 300 | 5 |
The solution to the previous example relies on a standardization technique called a Z score, a method most commonly employed for nearly normal observations (but that may be used with any distribution). The Z score of an observation is defined as the number of standard deviations it falls above or below the mean. If the observation is one standard deviation above the mean, its Z score is 1. If it is 1.5 standard deviations below the mean, then its Z score is -1.5. If \(x\) is an observation from a distribution \(N(\mu, \sigma),\) we define the Z score mathematically as
\[ Z = \frac{x-\mu}{\sigma} \]
Using \(\mu_{SAT}=1500,\) \(\sigma_{SAT}=300,\) and \(x_{Ann}=1800,\) we find Ann’s Z score:
\[ Z_{Ann} = \frac{x_{Ann} - \mu_{SAT}}{\sigma_{SAT}} = \frac{1800-1500}{300} = 1 \]
The Z score.
The Z score of an observation is the number of standard deviations it falls above or below the mean. We compute the Z score for an observation \(x\) that follows a distribution with mean \(\mu\) and standard deviation \(\sigma\) using
\[ Z = \frac{x-\mu}{\sigma}\] If the observation \(x\) comes from a normal distribution centered at \(\mu\) with standard deviation of \(\sigma\), then the Z score will distributed according to a normal distribution with a center of 0 and a standard deviation of 1. That is, the normality remains when transforming from \(x\) to \(Z\) with a shift in both the center as well as the spread.
Use Tom’s ACT score, 24, along with the ACT mean and standard deviation to compute his Z score.^{123}
Observations above the mean always have positive Z scores while those below the mean have negative Z scores. If an observation is equal to the mean (e.g., SAT score of 1500), then the Z score is \(0.\)
Let \(X\) represent a random variable from \(N(\mu=3, \sigma=2),\) and suppose we observe \(x=5.19.\) (a) Find the Z score of \(x.\) (b) Use the Z score to determine how many standard deviations above or below the mean \(x\) falls.^{124}
Head lengths of brushtail possums follow a nearly normal distribution with mean 92.6 mm and standard deviation 3.6 mm. Compute the Z scores for possums with head lengths of 95.4 mm and 85.8 mm.^{125}
We can use Z scores to roughly identify which observations are more unusual than others. One observation \(x_1\) is said to be more unusual than another observation \(x_2\) if the absolute value of its Z score is larger than the absolute value of the other observation’s Z score: \(|Z_1| > |Z_2|.\) This technique is especially insightful when a distribution is symmetric.
Which of the two brushtail possum observations in the previous guided practice is more unusual?^{126}
Normal probability calculations
Ann from the SAT Guided Practice earned a score of 1800 on her SAT with a corresponding \(Z=1.\) She would like to know what percentile she falls in among all SAT test-takers.
Ann’s percentile is the percentage of people who earned a lower SAT score than Ann. We shade the area representing those individuals in Figure 5.24. The total area under the normal curve is always equal to 1, and the proportion of people who scored below Ann on the SAT is equal to the area shaded in Figure 5.24: 0.8413. In other words, Ann is in the \(84^{th}\) percentile of SAT takers.
We can use the normal model to find percentiles or probabilities. A normal probability table, which lists Z scores and corresponding percentiles, can be used to identify a percentile based on the Z score (and vice versa). Statistical software can also be used.
Normal probabilities are most commonly found using statistical software which we will show here using R.
We use the software to identify the percentile corresponding to any particular Z score.
For instance, the percentile of \(Z=0.43\) is 0.6664, or the \(66.64^{th}\) percentile.
The pnorm()
function is available in default R and will provide the percentile associated with any cutoff on a normal curve.
The normTail()
function is available in the OpenIntro R package and will draw the associated curve if it is helpful.
We can also find the Z score associated with a percentile.
For example, to identify Z for the \(80^{th}\) percentile, we use qnorm()
which identifies the quantile for a given percentage.
The quantile represents the cutoff value.
(To remember the function qnorm()
as providing a cutoff, notice that both qnorm()
and “cutoff” start with the sound “kuh.”
To remember the pnorm()
function as providing a probability from a given cutoff, notice that both pnorm()
and probability start with the sound “puh.”) We determine the Z score for the \(80^{th}\) percentile using qnorm()
: 0.84.
Determine the proportion of SAT test takers who scored better than Ann on the SAT.^{127}
Normal probability examples
Cumulative SAT scores are approximated well by a normal model, \(N(\mu=1500, \sigma=300).\)
Shannon is a randomly selected SAT taker, and nothing is known about Shannon’s SAT aptitude. What is the probability that Shannon scores at least 1630 on her SATs?
First, always draw and label a picture of the normal distribution. (Drawings need not be exact to be useful.) We are interested in the chance she scores above 1630, so we shade the upper tail. See the normal curve below.
The \(x\)-axis identifies the mean and the values at 2 standard deviations above and below the mean. The simplest way to find the shaded area under the curve makes use of the Z score of the cutoff value. With \(\mu=1500,\) \(\sigma=300,\) and the cutoff value \(x=1630,\) the Z score is computed as
\[ Z = \frac{x - \mu}{\sigma} = \frac{1630 - 1500}{300} = \frac{130}{300} = 0.43 \]
We use software to find the percentile of \(Z=0.43,\) which yields 0.6664. However, the percentile describes those who had a Z score lower than 0.43. To find the area above \(Z=0.43,\) we compute one minus the area of the lower tail, as seen below.
The probability Shannon scores at least 1630 on the SAT is 0.3336.
Always draw a picture first, and find the Z score second.
For any normal probability situation, always always always draw and label the normal curve and shade the area of interest first. The picture will provide an estimate of the probability.
After drawing a figure to represent the situation, identify the Z score for the observation of interest.
If the probability of Shannon scoring at least 1630 is 0.3336, then what is the probability she scores less than 1630? Draw the normal curve representing this exercise, shading the lower region instead of the upper one.^{128}
Edward earned a 1400 on his SAT. What is his percentile?
First, a picture is needed. Edward’s percentile is the proportion of people who do not get as high as a 1400. These are the scores to the left of 1400.
The mean \(\mu=1500,\) the standard deviation \(\sigma=300,\) and the cutoff for the tail area \(x=1400\) are used to compute the Z score:
\[ Z = \frac{x - \mu}{\sigma} = \frac{1400 - 1500}{300} = -0.33\]
Statistical software can be used to find the proportion of the \(N(0,1)\) curve to the left of \(-0.33\) which is 0.3707. Edward is at the \(37^{th}\) percentile.
Use the results of the previous example to compute the proportion of SAT takers who did better than Edward. Also draw a new picture.
If Edward did better than 37% of SAT takers, then about 63% must have done better than him.
Areas to the right.
Most statistical software, as well as normal probability tables in most books, give the area to the left. If you would like the area to the right, first find the area to the left and then subtract the amount from one.
Stuart earned an SAT score of 2100. Draw a picture for each part. (a) What is his percentile? (b) What percent of SAT takers did better than Stuart?^{129}
Based on a sample of 100 men,^{130} the heights of male adults between the ages 20 and 62 in the US is nearly normal with mean 70.0’’ and standard deviation 3.3’’.
Mike is 5’7’’ and Jim is 6’4’’. (a) What is Mike’s height percentile? (b) What is Jim’s height percentile? Also draw one picture for each part.^{131}
The last several problems have focused on finding the probability or percentile for a particular observation. What if you would like to know the observation corresponding to a particular percentile?
Erik’s height is at the \(40^{th}\) percentile. How tall is he?
As always, first draw the picture.
In this case, the lower tail probability is known (0.40), which can be shaded on the diagram. We want to find the observation that corresponds to the known probability of 0.4. As a first step in this direction, we determine the Z score associated with the \(40^{th}\) percentile.
Because the percentile is below 50%, we know \(Z\) will be negative. Statistical software provides the \(Z\) value to be \(-0.25.\)
Here, we show the format for calculating the value of \(Z\) using the R statistical software.
qnorm(0.4, mean = 0, sd = 1)
#> [1] -0.253
Knowing \(Z_{Erik}=-0.25\) and the population parameters \(\mu=70\) and \(\sigma=3.3\) inches, the Z score formula can be set up to determine Erik’s unknown height, labeled \(x_{Erik}\):
\[ -0.25 = Z_{Erik} = \frac{x_{Erik} - \mu}{\sigma} = \frac{x_{Erik} - 70}{3.3} \]
Solving for \(x_{Erik}\) yields the height 69.18 inches. That is, Erik is about 5’9’’ (this is notation for 5-feet, 9-inches).
What is the adult male height at the \(82^{nd}\) percentile?
Again, we draw the figure first.
And calculate the Z value associated with the \(82^{nd}\) percentile:
qnorm(0.82, m = 0, s = 1)
#> [1] 0.915
Next, we want to find the Z score at the \(82^{nd}\) percentile, which will be a positive value (because the percentile is bigger than 50%).
Using qnorm()
, the \(82^{nd}\) percentile corresponds to \(Z=0.92.\) Finally, the height \(x\) is found using the Z score formula with the known mean \(\mu,\) standard deviation \(\sigma,\) and Z score \(Z=0.92\):
\[ 0.92 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3} \]
This yields 73.04 inches or about 6’1’’ as the height at the \(82^{nd}\) percentile.
- What is the \(95^{th}\) percentile for SAT scores?
- What is the \(97.5^{th}\) percentile of the male heights? As always with normal probability problems, first draw a picture.^{132}
- What is the probability that a randomly selected male adult is at least 6’2’’ (74 inches)?
- What is the probability that a male adult is shorter than 5’9’’ (69 inches)?^{133}
What is the probability that a randomly selected adult male is between 5’9’’ and 6’2’’?
These heights correspond to 69 inches and 74 inches. First, draw the figure. The area of interest is no longer an upper or lower tail.
The total area under the curve is 1. If we find the area of the two tails that are not shaded (from the previous Guided Practice, these areas are \(0.3821\) and \(0.1131\)), then we can find the middle area:
That is, the probability of being between 5’9’’ and 6’2’’ is 0.5048.
What percent of SAT takers get between 1500 and 2000?^{134}
What percent of adult males are between 5’5’’ and 5’7’’?^{135}
68-95-99.7 rule
Here, we present a useful general rule for the probability of falling within 1, 2, and 3 standard deviations of the mean in the normal distribution. The rule will be useful in a wide range of practical settings, especially when trying to make a quick estimate without a calculator or Z table.
Use pnorm()
(or a Z table) to confirm that about 68%, 95%, and 99.7% of observations fall within 1, 2, and 3, standard deviations of the mean in the normal distribution, respectively.
For instance, first find the area that falls between \(Z=-1\) and \(Z=1,\) which should have an area of about 0.68.
Similarly there should be an area of about 0.95 between \(Z=-2\) and \(Z=2.\)^{136}
It is possible for a normal random variable to fall 4, 5, or even more standard deviations from the mean. However, these occurrences are very rare if the data are nearly normal. The probability of being further than 4 standard deviations from the mean is about 1-in-30,000. For 5 and 6 standard deviations, it is about 1-in-3.5 million and 1-in-1 billion, respectively.
SAT scores closely follow the normal model with mean \(\mu = 1500\) and standard deviation \(\sigma = 300.\)
(a) About what percent of test takers score 900 to 2100?
(b) What percent score between 1500 and 2100
?^{137}
5.3.3 Hypothesis testing case studies
The approach for using the normal model in the context of inference is very similar to the practice of applying the model to individual observations that are nearly normal. We will replace null distributions we previously obtained using the randomization or simulation techniques and verify the results once again using the normal model. When the sample size is sufficiently large, the normal approximation generally provides us with the same conclusions as the simulation model.
5.3.3.1 Standard error
Point estimates vary from sample to sample, and we quantify this variability with what is called the standard error (SE). The standard error is equal to the standard deviation associated with the statistic. So, for example, to quantify the variability of a point estimate from one sample to the next, the variability is called the standard error of the point estimate. Almost always, the standard error is itself an estimate, calculated from the sample of data.
The way we determine the standard error varies from one situation to the next. However, typically it is determined using a formula based on the Central Limit Theorem.
Opportunity cost
Observed data
In Section 5.1.2 we were introduced to the opportunity cost study, which found that students became thriftier when they were reminded that not spending money now means the money can be spent on other things in the future. Let’s re-analyze the data in the context of the normal distribution and compare the results.
Variability of the statistic
Figure 5.26 summarizes the null distribution as determined using the randomization method. The best fitting normal distribution for the null distribution has a mean of 0. We can calculate the standard error of this distribution by borrowing a formula that we will become familiar with in Section 6.2, but for now let’s just take the value \(SE = 0.078\) as a given. Recall that the point estimate of the difference was 0.20, as shown in Figure 5.26. Next, we’ll use the normal distribution approach to compute the two-tailed p-value.
Observed statistic vs. null statistics
As we learned in Section 5.3.2, it is helpful to draw and shade a picture of the normal distribution so we know precisely what we want to calculate. Here we want to find the area of the tail beyond 0.2, representing the p-value.
Next, we can calculate the Z score using the observed difference, 0.20, and the two model parameters. The standard error, \(SE = 0.078,\) is the equivalent of the model’s standard deviation.
\[Z = \frac{\text{observed difference} - 0}{SE} = \frac{0.20 - 0}{0.078} = 2.56\]
We can either use statistical software or look up \(Z = 2.56\) in the normal probability table to determine the right tail area: 0.0052, which is about the same as what we got for the right tail using the randomization approach (0.006). Using this area as the p-value, we see that the p-value is less than 0.05, we conclude that the treatment did indeed impact students’ spending.
Z score in a hypothesis test.
In the context of a hypothesis test, the Z score for a point estimate is
\[Z = \frac{\text{point estimate} - \text{null value}}{SE}\]
The standard error in this case is the equivalent of the standard deviation of the point estimate, and the null value comes from the claim made in the null hypothesis.
We have confirmed that the randomization approach we used earlier and the normal distribution approach provide almost identical p-values and conclusions in the opportunity cost case study. Next, let’s turn our attention to the medical consultant case study.
Medical consultant
Observed data
In Section 5.2.1 we learned about a medical consultant who reported that only 3 of her 62 clients who underwent a liver transplant had complications, which is less than the more common complication rate of 0.10. In that work, we did not model a null scenario, but we will discuss a simulation method for a one proportion null distribution in Section 6.1.1, such a distribution is provided in Figure 5.27. We have added the best-fitting normal curve to the figure, which has a mean of 0.10. Borrowing a formula that we’ll encounter in Chapter 6, the standard error of this distribution was also computed: \(SE = 0.038.\)
Variability of the statistic
Before we begin, we want to point out a simple detail that is easy to overlook: the null distribution we generated from the simulation is slightly skewed, and the histogram is not particularly smooth. In fact, the normal distribution only sort-of fits this model.
Observed statistic vs. null statistics
As always, we’ll draw a picture before finding the normal probabilities. Below is a normal distribution centered at 0.10 with a standard error of 0.038.
Next, we can calculate the Z score using the observed complication rate, \(\hat{p} = 0.048\) along with the mean and standard deviation of the normal model. Here again, we use the standard error for the standard deviation.
\[Z = \frac{\hat{p} - p_0}{SE_{\hat{p}}} = \frac{0.048 - 0.10}{0.038} = -1.37\]
Identifying \(Z = -1.37\) using statistical software or in the normal probability table, we can determine that the left tail area is 0.0853 which is the estimated p-value for the hypothesis test. There is a small problem: the p-value of 0.0853 is slightly different from the simulation p-value or 0.1222 which will be calculated in Section 6.1.1.
The discrepancy is explained by the normal model’s poor representation of the null distribution in Figure 5.27. As noted earlier, the null distribution from the simulations is not very smooth, and the distribution itself is slightly skewed. That’s the bad news. The good news is that we can foresee these problems using some simple checks. We’ll learn more about these checks in the following chapters.
In Section 5.3.1 we noted that the two common requirements to apply the Central Limit Theorem are (1) the observations in the sample must be independent, and (2) the sample must be sufficiently large. The guidelines for this particular situation – which we will learn in Section 6.1 – would have alerted us that the normal model was a poor approximation.
Conditions for applying the normal model
The success story in this section was the application of the normal model in the context of the opportunity cost data. However, the biggest lesson comes from the less successful attempt to use the normal approximation in the medical consultant case study.
Statistical techniques are like a carpenter’s tools. When used responsibly, they can produce amazing and precise results. However, if the tools are applied irresponsibly or under inappropriate conditions, they will produce unreliable results. For this reason, with every statistical method that we introduce in future chapters, we will carefully outline conditions when the method can reasonably be used. These conditions should be checked in each application of the technique.
After covering the introductory topics in this course, advanced study may lead to working with complex models which, for example, bring together many variables with different variability structure. Working with data that come from normal populations makes higher-order models easier to estimate and interpret. There are times when simulation, randomization, or bootstrapping are unwieldy in either structure or computational demand. Normality can often lead to excellent approximations of the data using straightforward modeling techniques.
5.3.4 Confidence interval case study
A point estimate is our best guess for the value of the parameter, so it makes sense to build the confidence interval around that value. The standard error, which is a measure of the uncertainty associated with the point estimate, provides a guide for how large we should make the confidence interval. The 68-95-99.7 rule tells us that, in general, 95% of observations are within 2 standard errors of the mean. Here, we use the value 1.96 to be slightly more precise.
Constructing a 95% confidence interval.
When the sampling distribution of a point estimate can reasonably be modeled as normal, the point estimate we observe will be within 1.96 standard errors of the true value of interest about 95% of the time. Thus, a 95% confidence interval for such a point estimate can be constructed:
\[\text{point estimate} \pm 1.96 \times SE\]
We can be 95% confident this interval captures the true value.
Compute the area between -1.96 and 1.96 for a normal distribution with mean 0 and standard deviation 1.^{138}
The point estimate from the opportunity cost study was that 20% fewer students would buy a video if they were reminded that money not spent now could be spent later on something else. The point estimate from this study can reasonably be modeled with a normal distribution, and a proper standard error for this point estimate is \(SE = 0.078.\) Construct a 95% confidence interval.^{139}
Since the conditions for the normal approximation have already been verified, we can move forward with the construction of the 95% confidence interval:
\[\text{point estimate} \pm 1.96 \times SE = 0.20 \pm 1.96 \times 0.078 = (0.047, 0.353)\]
We are 95% confident that the video purchase rate resulting from the treatment is between 4.7% and 35.3% lower than in the control group. Since this confidence interval does not contain 0, it is consistent with our earlier result where we rejected the notion of “no difference” using a hypothesis test.
Stents
Observed data
Consider an experiment that examined whether implanting a stent in the brain of a patient at risk for a stroke helps reduce the risk of a stroke. The results from the first 30 days of this study, which included 451 patients, are summarized in Table 5.9. These results are surprising! The point estimate suggests that patients who received stents may have a higher risk of stroke: \(p_{trmt} - p_{ctrl} = 0.090.\)
stroke | no event | Total | |
---|---|---|---|
treatment | 33 | 191 | 224 |
control | 13 | 214 | 227 |
Total | 46 | 405 | 451 |
Variability of the statistic
Consider the stent study and results. The conditions necessary to ensure the point estimate \(p_{trmt} - p_{ctrl} = 0.090\) is nearly normal have been verified for you, and the estimate’s standard error is \(SE = 0.028.\) Construct a 95% confidence interval for the change in 30-day stroke rates from usage of the stent.
The conditions for applying the normal model have already been verified, so we can proceed to the construction of the confidence interval:
\[\text{point estimate} \pm 1.96 \times SE = 0.090 \pm 1.96 \times 0.028 = (0.035, 0.145)\]
We are 95% confident that implanting a stent in a stroke patient’s brain increased the risk of stroke within 30 days by a rate of 0.035 to 0.145. This confidence interval can also be used in a way analogous to a hypothesis test: since the interval does not contain 0 (is completely above 0), it means the data provide statistically significant evidence that the stent used in the study increases the risk of stroke within 30 days.
As with hypothesis tests, confidence intervals are imperfect. About 1-in-20 properly constructed 95% confidence intervals will fail to capture the parameter of interest, simply due to natural variability in the observed data. Figure 5.28 shows 25 confidence intervals for a proportion that were constructed from 25 different datasets that all came from the same population where the true proportion was \(p = 0.3.\) However, 1 of these 25 confidence intervals happened not to include the true value. The interval which does not capture \(p=0.3\) is not due to bad science. Instead, it is due to natural variability, and we should expect some of our intervals to miss the parameter of interest. Indeed, over a lifetime of creating 95% intervals, you should expect 5% of your reported intervals to miss the parameter of interest (unfortunately, you will not ever know which of your reported intervals captured the parameter and which missed the parameter).
Interpreting confidence intervals
A careful eye might have observed the somewhat awkward language used to describe confidence intervals.
Correct confidence interval interpretation:
We are XX% confident that the population parameter is between lower and upper (where lower and upper are both numerical values).
Incorrect language might try to describe the confidence interval as capturing the population parameter with a certain probability.
This is one of the most common errors: while it might be useful to think of it as a probability, the confidence level only quantifies how plausible it is that the parameter is in the interval.
Another especially important consideration of confidence intervals is that they only try to capture the population parameter. Our intervals say nothing about the confidence of capturing individual observations, a proportion of the observations, or about capturing point estimates. Confidence intervals provide an interval estimate for and attempt to capture population parameters.
5.3.5 Mathematical model summary
We can summarise the normal model as follows:
- Frame the research question. The mathematical model can be applied to both the hypothesis testing and the confidence interval framework. Make sure that your research question is being addressed by the most appropriate inference procedure.
- Collect data with an observational study or experiment. To address the research question, collect data on the variables of interest. Note that your data may be a random sample from a population or may be part of a randomized experiment.
- Model the randomness of the statistic. In many cases, the normal distribution will be an excellent model for the randomness associated with the statistic of interest. The Central Limit Theorem tells us that if the sample size is large enough, sample averages (which can be calculated as either a proportion or a sample mean) will be approximately normally distributed when describing how the statistics change from sample to sample.
- Calculate the variability of the statistic. Using formulas, come up with the standard deviation (or more typically, an estimate of the standard deviation called the standard error) of the statistic. The SE of the statistic will give information on how far the observed statistic is from the null hypothesized value (if performing a hypothesis test) or from the unknown population parameter (if creating a confidence interval).
- Use the normal distribution to quantify the variability. The normal distribution will provide a probability which measures how likely it is for your observed and hypothesized (or observed and unknown) parameter to differ by the amount measured. The unusualness (or not) of the descrepancy will form the conclusion to the research question.
- Form a conclusion. Using the p-value or the confidence interval from the analysis, report on the research question of interest. Also, be sure to write the conclusion in plain language so casual readers can understand the results.
Table 5.10 is another look at the mathematical model approach to inference..
Mathematical Model | |
---|---|
What does it do? | Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples. |
What is the random process described? | Randomized experiment or random sampling. |
What other random processes can be approximated? | Randomized experiment or random sampling. |
What is it best for? | Quick analyses through, for example, calculating a Z score. |
What physical object represents the simulation process? | Not applicable |
5.3.6 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
5.4 Chapter review
5.4.1 Summary
In this chapter, we have provided three different methods for statistical inference. We will continue to build on all three of the methods throughout the text, and by the end, you should have an understanding of their similarities and differences between them. Meanwhile, it is important to note that the methods are designed to mimic variability with data, and we know that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure 1.11). In Table 5.11, we have summarized some of the ways the inferential procedures feature specific sources of variability. We hope that you refer back to the table often as you dive more deeply into inferential ideas in future chapters.
Randomization Test | Bootstrapping | Mathematical Model | |
---|---|---|---|
What does it do? | Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment. | Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population. | Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples. |
What is the random process described? | Randomized experiment. | Random sampling from a population. | Randomized experiment or random sampling. |
What other random processes can be approximated? | Can also be used to describe random sampling in an observational model | Can also be used to describe random allocation in an experiment | Randomized experiment or random sampling. |
What is it best for? | Hypothesis Testing (can be used for Confidence Intervals, but not covered in this text). | Confidence Intervals (bootstrap HT for one proportion covered in Chapter 6). | Quick analyses through, for example, calculating a Z score. |
What physical object represents the simulation process? | Shuffling cards | Pulling marbles from a bag | Not applicable |
5.4.2 Terms
We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.
95% confidence interval | hypothesis test | p-value | simulation |
95% confident | independent | parameter | standard error |
alternative hypothesis | normal curve | percentile | standard normal distribution |
bootstrap percentile confidence interval | normal distribution | permutation test | statistic |
bootstrap sample | normal model | point estimate | statistically significant |
bootstrapping | normal probability table | randomization test | success |
Central Limit Theorem | null distribution | sampling distribution | test statistic |
confidence interval | null hypothesis | sampling with replacement | Z score |
5.4.3 Chapter exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
5.4.4 Interactive R tutorials
Navigate the concepts you’ve learned in this chapter in R using the following self-paced tutorials. All you need is your browser to get started!
You can also access the full list of tutorials supporting this book here.