6 Inference for categorical responses
Focusing now on Statistical Inference for categorical data, we will revisit many of the foundational aspects of hypothesis testing from Chapter 5.
The three data structures we detail are one binary variable, summarized using a single proportion; two binary variables, summarized using a difference of two proportions; and two categorical variables, summarized using a twoway table. When appropriate, each of the data structures will be analyzed using the three methods from Chapter 5: randomization test, bootstrapping, and mathematical models.
As we build on the inferential ideas, we will visit new foundational concepts in statistical inference. For example, we will cover the conditions for when a normal model is appropriate; the two different error rates in hypothesis testing; and choosing the confidence level for a confidence interval.
6.1 One proportion
We encountered inference methods for a single proportion in Chapter 5, exploring point estimates, confidence intervals, and hypothesis tests. In this section, we’ll do a review of these topics and also how to choose an appropriate sample size when collecting data for single proportion contexts.
Note that there is only one variable being measured in a study which focuses on one proportion. For each observational unit, the single variable is measured as either a success or failure (e.g., “surgical complication” vs. “no surgical complication”). Because the nature of the research question at hand focuses on only a single variable, there is not a way to randomize the variable across a different (explanatory) variable. For this reason, we will not use randomization as an analysis tool when focusing on a single proportion. Instead, we will apply bootstrapping techniques to test a given hypothesis, and we will also revisit the associated mathematical models.
6.1.1 Bootstrap test for \(H_0: p = p_0\)
The bootstrap simulation concept when \(H_0\) is true is similar to the ideas used in the case studies presented in Section 5.2 where we bootstrapped without an assumption about \(H_0.\) Because we will be testing a hypothesized value of \(p\) (referred to as \(p_0),\) the bootstrap simulation for hypothesis testing has a fantastic advantage that it can be used for any sample size (a huge benefit for small samples, a nice alternative for large samples).
We expand on the medical consultant example, see Section 5.2.1, where instead of finding an interval estimate for the true complication rate, we work to test a specific research claim.
Observed data
Recall the setup for the example:
People providing an organ for donation sometimes seek the help of a special “medical consultant.” These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients. One consultant tried to attract patients by noting the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).
Using the data, is it possible to assess the consultant’s claim that her complication rate is less than 10%?
No. The claim is that there is a causal connection, but the data are observational. Patients who hire this medical consultant may have lower complication rates for other reasons.
While it is not possible to assess this causal claim, it is still possible to test for an association using these data. For this question we ask, could the low complication rate of \(\hat{p} = 0.048\) be due to chance?
Write out hypotheses in both plain and statistical language to test for the association between the consultant’s work and the true complication rate, \(p,\) for the consultant’s clients.^{141}
Because, as it turns out, the conditions of working with the normal distribution are not met (see Section 6.1.2), the uncertainty associated with the sample proportion should not be modeled using the normal distribution. However, we would still like to assess the hypotheses from the previous Guided Practice in absence of the normal framework. To do so, we need to evaluate the possibility of a sample value \((\hat{p})\) as far below the null value, \(p_0=0.10\) as what was observed. The deviation of the sample value from the hypothesized parameter is usually quantified with a pvalue.
The pvalue is computed based on the null distribution, which is the distribution of the test statistic if the null hypothesis is true. Supposing the null hypothesis is true, we can compute the pvalue by identifying the chance of observing a test statistic that favors the alternative hypothesis at least as strongly as the observed test statistic. Here we will use a bootstrap simulation to measure the pvalue.
Variability of the statistic
We want to identify the sampling distribution of the test statistic \((\hat{p})\) if the null hypothesis was true. In other words, we want to see how the sample proportion changes due to chance alone. Then we plan to use this information to decide whether there is enough evidence to reject the null hypothesis.
Under the null hypothesis, 10% of liver donors have complications during or after surgery. Suppose this rate was really no different for the consultant’s clients (for all the consultant’s clients, not just the 62 previously measured). If this was the case, we could simulate 62 clients to get a sample proportion for the complication rate from the null distribution. Simulating observations using a hypothesized null parameter value is often called a parametric bootstrap simulation.
Similar to the process described in Section 5.2, each client can be simulated using a bag of marbles with 10% red marbles and 90% white marbles. Sampling a marble from the bag (with 10% red marbles) is one way of simulating whether a patient has a complication if the true complication rate is 10% for the data. If we select 62 marbles and then compute the proportion of patients with complications in the simulation, \(\hat{p}_{sim1},\) then the resulting sample proportion is exactly a sample from the null distribution.
An undergraduate student was paid $2 to complete this simulation. There were 5 simulated cases with a complication and 57 simulated cases without a complication, i.e., \(\hat{p}_{sim1} = 5/62 = 0.081.\)
Is this one simulation enough to determine whether or not we should reject the null hypothesis?
No. To assess the hypotheses, we need to see a distribution of many values of \(\hat{p}_{sim},\) not just a single draw from this sampling distribution.
Observed statistic vs. null statistics
One simulation isn’t enough to get a sense of the null distribution; many simulation studies are needed. Roughly 10,000 seems sufficient. However, paying someone to simulate 10,000 studies by hand is a waste of time and money. Instead, simulations are typically programmed into a computer, which is much more efficient.
Figure 6.1 shows the results of 10,000 simulated studies. The proportions that are equal to or less than \(\hat{p}=0.048\) are shaded. The shaded areas represent sample proportions under the null distribution that provide at least as much evidence as \(\hat{p}\) favoring the alternative hypothesis. There were 1222 simulated sample proportions with \(\hat{p}_{sim} \leq 0.048.\) We use these to construct the null distribution’s lefttail area and find the pvalue: \[\begin{align} \text{left tail area }\label{estOfPValueBasedOnSimulatedNullForSingleProportion} &= \frac{\text{Number of observed simulations with }\hat{p}_{sim}\leq\text{ 0.048}}{10000} \end{align}\] Of the 10,000 simulated \(\hat{p}_{sim},\) 1222 were equal to or smaller than \(\hat{p}.\) Since the hypothesis test is onesided, the estimated pvalue is equal to this tail area: 0.1222.
Because the estimated pvalue is 0.1222, which is larger than the significance level 0.05, we do not reject the null hypothesis. Explain what this means in plain language in the context of the problem.^{142}
Does the conclusion in the previous Guided Practice imply there is no real association between the surgical consultant’s work and the risk of complications? Explain.^{143}
Null distribution of \(\hat{p}\) with bootstrap simulation
Regardless of the statistical method chosen, the pvalue is always derived by analyzing the null distribution of the test statistic. The normal model poorly approximates the null distribution for \(\hat{p}\) when the successfailure condition is not satisfied. As a substitute, we can generate the null distribution using simulated sample proportions and use this distribution to compute the tail area, i.e., the pvalue.
In the previous Guided Practice, the pvalue is estimated. It is not exact because the simulated null distribution itself is not exact, only a close approximation. An exact pvalue can be generated using the binomial distribution, but that method will not be covered in this text.
6.1.2 Mathematical model
Conditions
In Section 5.3.2, we introduced the normal distribution and showed how it can be used as a mathematical model to describe the variability of a statistic. There are conditions under which a sample proportion \(\hat{p}\) is well modeled using a normal distribution. When the sample observations are independent and the sample size is sufficiently large, the normal model will describe the variability quite well; when the observations violate the conditions, the normal model can be inaccurate
Sampling distribution of \(\hat{p}\)
The sampling distribution for \(\hat{p}\) based on a sample of size \(n\) from a population with a true proportion \(p\) is nearly normal when:
 The sample’s observations are independent, e.g., are from a simple random sample.
 We expected to see at least 10 successes and 10 failures in the sample, i.e., \(np\geq10\) and \(n(1p)\geq10.\) This is called the successfailure condition.
When these conditions are met, then the sampling distribution of \(\hat{p}\) is nearly normal with mean \(p\) and standard error of \(\hat{p}\) as \(SE = \sqrt{\frac{\ \hat{p}(1\hat{p})\ }{n}}.\)
Typically we don’t know the true proportion \(p,\) so we substitute some value to check conditions and estimate the standard error. For confidence intervals, the sample proportion \(\hat{p}\) is used to check the successfailure condition and compute the standard error. For hypothesis tests, typically the null value – that is, the proportion claimed in the null hypothesis – is used in place of \(p.\)
The independence condition is a more nuanced requirement. When it isn’t met, it is important to understand how and why it isn’t met. For example, there exist no statistical methods available to truly correct the inherent biases of data from a convenience sample. On the other hand, if we took a cluster sample (see Section 1.3.6), the observations wouldn’t be independent, but suitable statistical methods are available for analyzing the data (but they are beyond the scope of even most second or third courses in statistic).
In the examples based on large sample theory, we modeled \(\hat{p}\) using the normal distribution. Why is this not appropriate for the case study on the medical consultant?
The independence assumption may be reasonable if each of the surgeries is from a different surgical team. However, the successfailure condition is not satisfied. Under the null hypothesis, we would anticipate seeing \(62\times 0.10=6.2\) complications, not the 10 required for the normal approximation.
While this book is scoped to wellconstrained statistical problems, do remember that this is just the first book in what is a large library of statistical methods that are suitable for a very wide range of data and contexts.
Confidence interval for \(p\)
A confidence interval provides a range of plausible values for the parameter \(p,\) and when \(\hat{p}\) can be modeled using a normal distribution, the confidence interval for \(p\) takes the form \[ \hat{p} \pm z^{\star} \times SE.\] We have seen \(\hat{p}\) to be the sample proportion. The value \(z^{\star}\) determines the confidence level (previously set to be 1.96) and will be discussed in detail in the examples following. The value of the standard error, \(SE,\) depends heavily on the sample size.
Standard Error of one proportion, \(\hat{p}\)
When the conditions are met so that the distribution of \(\hat{p}\) is nearly normal, the variability of a single proportion, \(\hat{p}\) is well described by:
\[SE(\hat{p}) = \sqrt{\frac{p(1p)}{n}}\]
Note that we almost never know the true value of \(p.\) A more helpful formula to use is:
\[SE(\hat{p}) \approx \sqrt{\frac{(\mbox{best guess of }p)(1  \mbox{best guess of }p)}{n}}\]
For hypothesis testing, we often use \(p_0\) as the best guess of \(p.\) For confidence intervals, we typically use \(\hat{p}\) as the best guess of \(p.\)
Consider taking many polls of registered voters (i.e., random samples) of size 300 asking them if they support legalized marijuana. It is suspected that about 2/3 of all voters support legalized marijuana. To understand how the sample proportion \((\hat{p})\) would vary across the samples, calculate the standard error of \(\hat{p}.\)^{144}
Variability of the statistic
A simple random sample of 826 payday loan borrowers was surveyed to better understand their interests around regulation and costs. 70% of the responses supported new regulations on payday lenders.
Is it reasonable to model the variability of \(\hat{p}\) from sample to sample using a normal distribution?
Estimate the standard error of \(\hat{p}.\)
Construct a 95% confidence interval for \(p,\) the proportion of payday borrowers who support increased regulation for payday lenders.
 The data are a random sample, so the observations are independent and representative of the population of interest.
We also must check the successfailure condition, which we do using \(\hat{p}\) in place of \(p\) when computing a confidence interval: \[\begin{align*} \text{Support: } n p & \approx 826 \times 0.70 = 578\\ \text{Not: } n (1  p) & \approx 826 \times (1  0.70) = 248 \end{align*}\] Since both values are at least 10, we can use the normal distribution to model \(\hat{p}.\)
 Because \(p\) is unknown and the standard error is for a confidence interval, use \(\hat{p}\) in place of \(p\) in the formula.
\(SE = \sqrt{\frac{p(1p)}{n}} \approx \sqrt{\frac{0.70 (1  0.70)} {826}} = 0.016.\)
 Using the point estimate 0.70, \(z^{\star} = 1.96\) for a 95% confidence interval, and the standard error \(SE = 0.016\) from the previous Guided Practice, the confidence interval is \[ \text{point estimate} \ \pm\ z^{\star} \times SE \to 0.70 \ \pm\ 1.96 \times 0.016 \to (0.669, 0.731)\] We are 95% confident that the true proportion of payday borrowers who supported regulation at the time of the poll was between 0.669 and 0.731.
Constructing a confidence interval for a single proportion
There are three steps to constructing a confidence interval for \(p.\)
 Check independence and the successfailure condition using \(\hat{p}.\) If the conditions are met, the sampling distribution of \(\hat{p}\) may be wellapproximated by the normal model.
 Construct the standard error using \(\hat{p}\) in place of \(p\) in the standard error formula.
 Apply the general confidence interval formula.
For additional oneproportion confidence interval examples, see Section 5.2.3.
Changing the confidence level
Suppose we want to consider confidence intervals where the confidence level is somewhat higher than 95%: perhaps we would like a confidence level of 99%. Think back to the analogy about trying to catch a fish: if we want to be more sure that we will catch the fish, we should use a wider net. To create a 99% confidence level, we must also widen our 95% interval. On the other hand, if we want an interval with lower confidence, such as 90%, we could make our original 95% interval slightly slimmer.
The 95% confidence interval structure provides guidance in how to make intervals with new confidence levels. Below is a general 95% confidence interval for a point estimate that comes from a nearly normal distribution: \[\begin{eqnarray} \text{point estimate}\ \pm\ 1.96\times SE \end{eqnarray}\] There are three components to this interval: the point estimate, “1.96,” and the standard error. The choice of \(1.96\times SE\) was based on capturing 95% of the data since the estimate is within 1.96 standard errors of the true value about 95% of the time. The choice of 1.96 corresponds to a 95% confidence level.
If \(X\) is a normally distributed random variable, how often will \(X\) be within 2.58 standard deviations of the mean?^{145}
To create a 99% confidence interval, change 1.96 in the 95% confidence interval formula to be \(2.58.\) The previous Guided Practice highlights that 99% of the time a normal random variable will be within 2.58 standard deviations of its mean. This approach – using the Z scores in the normal model to compute confidence levels – is appropriate when the point estimate is associated with a normal distribution and we can properly compute the standard error. Thus, the formula for a 99% confidence interval is:
\[\begin{eqnarray*} \text{point estimate}\ \pm\ 2.58\times SE \end{eqnarray*}\]The normal approximation is crucial to the precision of the \(z^\star\) confidence intervals (in contrast to the bootstrap confidence intervals). When the normal model is not a good fit, we will use alternative distributions that better characterize the sampling distribution or we will use bootstrapping procedures.
Create a 99% confidence interval for the impact of the stent on the risk of stroke using the data from Section 1.1. The point estimate is 0.090, and the standard error is \(SE = 0.028.\) It has been verified for you that the point estimate can reasonably be modeled by a normal distribution.^{146}
Mathematical model confidence interval for any confidence level.
If the point estimate follows the normal model with standard error \(SE,\) then a confidence interval for the population parameter is \[\begin{eqnarray*} \text{point estimate}\ \pm\ z^{\star} \times SE \end{eqnarray*}\] where \(z^{\star}\) corresponds to the confidence level selected.
Figure 6.2 provides a picture of how to identify \(z^{\star}\) based on a confidence level. We select \(z^{\star}\) so that the area between \(z^{\star}\) and \(z^{\star}\) in the normal model corresponds to the confidence level.
Previously, we found that implanting a stent in the brain of a patient at risk for a stroke increased the risk of a stroke. The study estimated a 9% increase in the number of patients who had a stroke, and the standard error of this estimate was about \(SE = 2.8%.\) Compute a 90% confidence interval for the effect.^{147}
Hypothesis test for \(H_0: p = p_0\)
The test statistic for assessing a single proportion is a Z.
The Z score is a ratio of how the sample proportion differs from the hypothesized proportion as compared to the expected variability of the \(\hat{p}\) values.
\[\begin{align*} Z = \frac{\hat{p}  p_0}{\sqrt{p_0(1p_0)/n}} \end{align*}\]When the null hypothesis is true and the conditions are met, Z has a standard normal distribution.
Conditions:
 independently observed data
 large samples \((n p_0 \geq 10\) and \(n (1p_0) \geq 10)\)
One possible regulation for payday lenders is that they would be required to do a credit check and evaluate debt payments against the borrower’s finances. We would like to know: would borrowers support this form of regulation?
Set up hypotheses to evaluate whether borrowers have a majority support for this type of regulation.^{148}
To apply the normal distribution framework in the context of a hypothesis test for a proportion, the independence and successfailure conditions must be satisfied. In a hypothesis test, the successfailure condition is checked using the null proportion: we verify \(np_0\) and \(n(1p_0)\) are at least 10, where \(p_0\) is the null value.
Do payday loan borrowers support a regulation that would require lenders to pull their credit report and evaluate their debt payments? From a random sample of 826 borrowers, 51% said they would support such a regulation. Is it reasonable use a normal distribution to model \(\hat{p}\) for a hypothesis test here?^{149}
Using the hypotheses and data from the previous Guided Practices, evaluate whether the poll on lending regulations provides convincing evidence that a majority of payday loan borrowers support a new regulation that would require lenders to pull credit reports and evaluate debt payments.
With hypotheses already set up and conditions checked, we can move onto calculations. The standard error in the context of a oneproportion hypothesis test is computed using the null value, \(p_0:\) \[\begin{align*} SE = \sqrt{\frac{p_0 (1  p_0)}{n}} = \sqrt{\frac{0.5 (1  0.5)}{826}} = 0.017 \end{align*}\] A picture of the normal model is shown with the pvalue represented by the shaded region.
Based on the normal model, the test statistic can be computed as the Zscore of the point estimate: \[\begin{align*} Z = \frac{\text{point estimate}  \text{null value}}{SE} = \frac{0.51  0.50}{0.017} = 0.59 \end{align*}\] The single tail area which represents the pvalue is 0.2776. Because the pvalue is larger than 0.05, we do not reject \(H_0.\) The poll does not provide convincing evidence that a majority of payday loan borrowers support regulations around credit checks and evaluation of debt payments.
In Section 6.2.1 we discuss twosided hypothesis tests of which the payday example may have been better structured.
That is, we might have wanted to ask whether the borrows support or oppose the regulations (to study opinion in either direction away from the 50% benchmark).
In that case, the pvalue would have been doubled to 0.5552 (again, we would not reject \(H_0).\) In the twosided hypothesis setting, the appropriate conclusion would be to claim that the poll does not provide convincing evidence that a majority of payday loan borrowers support or oppose regulations around credit checks and evaluation of debt payments.
In both the onesided or twosided setting, the conclusion is somewhat unsatisfactory because there is no conclusion. That is, there is no resolution one way or the other about public opinion. We cannot claim that exactly 50% of people support the regulation, but we cannot claim a majority in either direction.
Mathematical model hypothesis test for a proportion.
Set up hypotheses and verify the conditions using the null value, \(p_0,\) to ensure \(\hat{p}\) is nearly normal under \(H_0.\) If the conditions hold, construct the standard error, again using \(p_0,\) and show the pvalue in a drawing. Lastly, compute the pvalue and evaluate the hypotheses.
For additional oneproportion hypothesis test examples, see Section 5.1.3.
Violating conditions
We’ve spent a lot of time discussing conditions for when \(\hat{p}\) can be reasonably modeled by a normal distribution. What happens when the successfailure condition fails? What about when the independence condition fails? In either case, the general ideas of confidence intervals and hypothesis tests remain the same, but the strategy or technique used to generate the interval or pvalue change.
When the successfailure condition isn’t met for a hypothesis test, we can simulate the null distribution of \(\hat{p}\) using the null value, \(p_0,\) as seen in Section 6.1.1. Unfortunately, methods for dealing with observations which are not independent are outside the scope of this book.
6.1.3 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
6.2 Difference of two proportions
We now extend the methods from Section 6.1 to apply confidence intervals and hypothesis tests to differences in population proportions that come from two groups, Group 1 and Group 2: \(p_1  p_2.\)
In our investigations, we’ll identify a reasonable point estimate of \(p_1  p_2\) based on the sample, and you may have already guessed its form: \(\hat{p}_1  \hat{p}_2.\) Then we’ll look at the inferential analysis in three different ways: using a randomization test, applying bootstrapping for interval estimates, and, if we verify that the point estimate can be modeled using a normal distribution, we compute the estimate’s standard error, and we apply the mathematical framework.
6.2.1 Randomization test for \(H_0: p_1  p_2 = 0\)
Observed data
We consider a study on a new malaria vaccine called PfSPZ. In this study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine or 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed to a drugsensitive malaria virus strain; the motivation of using a drugsensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively. The results are summarized in Table 6.1, where 9 of the 14 treatment patients remained free of signs of infection while all of the 6 patients in the control group patients showed some baseline signs of infection.
infection  no infection  Total  

vaccine  5  9  14  
treatment

placebo  6  0  6 
Total  11  9  20 
Is this an observational study or an experiment? What implications does the study type have on what can be inferred from the results?^{150}
In this study, a smaller proportion of patients who received the vaccine showed signs of an infection (35.7% versus 100%). However, the sample is very small, and it is unclear whether the difference provides convincing evidence that the vaccine is effective.
As we saw in Section 5.1, we can randomize the responses (infection
or no infection
) to the treatment conditions under the null hypothesis of independence and compute possible differences in proportions.
The process by which we randomize observations to two groups is summarized and visualized in Figure 5.8.
Variability of the statistic
Figure 2.24 shows a stacked plot of the differences found from 100 randomization simulations (i.e., repeated iterations as described in Figure 5.8), where each dot represents a simulated difference between the infection rates (control rate minus treatment rate).
Observed statistic vs null statistics
Note that the distribution of these simulated differences is centered around 0. We simulated the differences assuming that the independence model was true, and under this condition, we expect the difference to be near zero with some random fluctuation, where near is pretty generous in this case since the sample sizes are so small in this study.
How often would you observe a difference of at least 64.3% (0.643) according to Figure 2.24? Often, sometimes, rarely, or never?
It appears that a difference of at least 64.3% due to chance alone would only happen about 2% of the time according to Figure 2.24. Such a low probability indicates a rare event.
The difference of 64.3% being a rare event suggests two possible interpretations of the results of the study:
 \(H_0\) Independence model. The vaccine has no effect on infection rate, and we just happened to observe a difference that would only occur on a rare occasion.
 \(H_A\) Alternative model. The vaccine has an effect on infection rate, and the difference we observed was actually due to the vaccine being effective at combating malaria, which explains the large difference of 64.3%.
Based on the simulations, we have two options:
We conclude that the study results do not provide strong evidence against the independence model. That is, we do not have sufficiently strong evidence to conclude the vaccine had an effect in this clinical setting.
We conclude the evidence is sufficiently strong to reject \(H_0\) and assert that the vaccine was useful. When we conduct formal studies, usually we reject the notion that we just happened to observe a rare event.^{151} In this case, we reject the independence model in favor of the alternative. That is, we are concluding the data provide strong evidence that the vaccine provides some protection against malaria in this clinical setting.
Statistical inference, is built on evaluating whether such differences are due to chance. In statistical inference, data scientists evaluate which model is most reasonable given the data. Errors do occur, just like rare events, and we might choose the wrong model. While we do not always choose correctly, statistical inference gives us tools to control and evaluate how often these errors occur.
6.2.2 Decision errors
Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong conclusion. However, what distinguishes statistical hypothesis tests from a court system is that our framework allows us to quantify and control how often the data lead us to the incorrect conclusion.
In a hypothesis test, there are two competing hypotheses: the null and the alternative. We make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 6.2.
Reject \(H_0\)  Fail to reject \(H_0\)  

\(H_0\) true  Type 1 Error  good decision  
Truth  \(H_A\) true  good decision  Type 2 Error 
A Type 1 Error is rejecting the null hypothesis when \(H_0\) is actually true. Since we rejected the null hypothesis in the gender discrimination and opportunity cost studies, it is possible that we made a Type 1 Error in one or both of those studies. A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.
In a US court, the defendant is either innocent \((H_0)\) or guilty \((H_A).\) What does a Type 1 Error represent in this context? What does a Type 2 Error represent? Table 6.2 may be useful.
If the court makes a Type 1 Error, this means the defendant is innocent \((H_0\) true) but wrongly convicted. A Type 2 Error means the court failed to reject \(H_0\) (i.e., failed to convict the person) when they were in fact guilty \((H_A\) true).
Consider the opportunity cost study where we concluded students were less likely to make a DVD purchase if they were reminded that money not spent now could be spent later. What would a Type 1 Error represent in this context?^{152}
How could we reduce the Type 1 Error rate in US courts? What influence would this have on the Type 2 Error rate?
To lower the Type 1 Error rate, we might raise our standard for conviction from “beyond a reasonable doubt” to “beyond a conceivable doubt” so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.
How could we reduce the Type 2 Error rate in US courts? What influence would this have on the Type 1 Error rate?^{153}
The example and guided practice above provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.
6.2.2.1 Significance level
The significance level provides the cutoff for the pvalue which will lead to a decision of “reject the null hypothesis.” Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05. However, it is sometimes helpful to adjust the significance level based on the application. We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.
If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g., 0.01 or 0.001). If we want to be very cautious about rejecting the null hypothesis, we demand very strong evidence favoring the alternative \(H_A\) before we would reject \(H_0.\)
If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g., 0.10). Here we want to be cautious about failing to reject \(H_0\) when the null is actually false.
Significance levels should reflect consequences of errors.
The significance level selected for a test should reflect the realworld consequences associated with making a Type 1 or Type 2 Error.
Twosided hypotheses
In Section 5.1 we explored whether women were discriminated against and whether a simple trick could make students a little thriftier. In these two case studies, we’ve actually ignored some possibilities:
 What if men are actually discriminated against?
 What if the money trick actually makes students spend more?
These possibilities weren’t considered in our original hypotheses or analyses. The disregard of the extra alternatives may have seemed natural since the data pointed in the directions in which we framed the problems. However, there are two dangers if we ignore possibilities that disagree with our data or that conflict with our world view:
Framing an alternative hypothesis simply to match the direction that the data point will generally inflate the Type 1 Error rate. After all the work we’ve done (and will continue to do) to rigorously control the error rates in hypothesis tests, careless construction of the alternative hypotheses can disrupt that hard work.
If we only use alternative hypotheses that agree with our worldview, then we’re going to be subjecting ourselves to confirmation bias, which means we are looking for data that supports our ideas. That’s not very scientific, and we can do better!
The original hypotheses we’ve seen are called onesided hypothesis tests because they only explored one direction of possibilities. Such hypotheses are appropriate when we are exclusively interested in the single direction, but usually we want to consider all possibilities. To do so, let’s learn about twosided hypothesis tests in the context of a new study that examines the impact of using blood thinners on patients who have undergone CPR.
Cardiopulmonary resuscitation (CPR) is a procedure used on individuals suffering a heart attack when other emergency resources are unavailable. This procedure is helpful in providing some blood circulation to keep a person alive, but CPR chest compressions can also cause internal injuries. Internal bleeding and other injuries that can result from CPR complicate additional treatment efforts. For instance, blood thinners may be used to help release a clot that is causing the heart attack once a patient arrives in the hospital. However, blood thinners negatively affect internal injuries.
Here we consider an experiment with patients who underwent CPR for a heart attack and were subsequently admitted to a hospital.^{154} Each patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group). The outcome variable of interest was whether the patient survived for at least 24 hours.
Form hypotheses for this study in plain and statistical language. Let \(p_C\) represent the true survival rate of people who do not receive a blood thinner (corresponding to the control group) and \(p_T\) represent the survival rate for people receiving a blood thinner (corresponding to the treatment group).
We want to understand whether blood thinners are helpful or harmful. We’ll consider both of these possibilities using a twosided hypothesis test.
\(H_0:\) Blood thinners do not have an overall survival effect, i.e., the survival proportions are the same in each group. \(p_T  p_C = 0.\)
\(H_A:\) Blood thinners have an impact on survival, either positive or negative, but not zero. \(p_T  p_C \neq 0.\)
Note that if we had done a onesided hypothesis test, the resulting hypotheses would have been:
\(H_0:\) Blood thinners do not have a positive overall survival effect, i.e., the survival proportions for the blood thinner group is the same or lower than the control group. \(p_T  p_C \leq 0.\)
\(H_A:\) Blood thinners have a positive impact on survival. \(p_T  p_C > 0.\)
There were 50 patients in the experiment who did not receive a blood thinner and 40 patients who did. The study results are shown in Table 6.3.
Survived  Died  Total  

Control  11  39  50 
Treatment  14  26  40 
Total  25  65  90 
What is the observed survival rate in the control group? And in the treatment group? Also, provide a point estimate \((\hat{p}_T  \hat{p}_C)\) for the true difference in population survival proportions across the two groups: \(p_T  p_C.\)^{155}
According to the point estimate, for patients who have undergone CPR outside of the hospital, an additional 13% of these patients survive when they are treated with blood thinners. However, we wonder if this difference could be easily explainable by chance.
As we did in our past two studies this chapter, we will simulate what type of differences we might see from chance alone under the null hypothesis. By randomly assigning each of the patient’s files to a “simulated treatment” or “simulated control” allocation, we get a new grouping. If we repeat this simulation 10,000 times, we can build a null distribution of the differences shown in Figure 6.3.
The right tail area is 0.131. (Note: it is only a coincidence that we also have \(\hat{p}_T  \hat{p}_C=0.13.)\) However, contrary to how we calculated the pvalue in previous studies, the pvalue of this test is not 0.131!
The pvalue is defined as the chance we observe a result at least as favorable to the alternative hypothesis as the result (i.e., the difference) we observe. In this case, any differences less than or equal to 0.13 would also provide equally strong evidence favoring the alternative hypothesis as a difference of +0.13 did. A difference of 0.13 would correspond to 13% higher survival rate in the control group than the treatment group. In Figure 6.4 we’ve also shaded these differences in the left tail of the distribution. These two shaded tails provide a visual representation of the pvalue for a twosided test.
For a twosided test, take the single tail (in this case, 0.131) and double it to get the pvalue: 0.262. Since this pvalue is larger than 0.05, we do not reject the null hypothesis. That is, we do not find statistically significant evidence that the blood thinner has any influence on survival of patients who undergo CPR prior to arriving at the hospital.
Default to a twosided test.
We want to be rigorous and keep an open mind when we analyze data and evidence. Use a onesided hypothesis test only if you truly have interest in only one direction.
Computing a pvalue for a twosided test.
First compute the pvalue for one tail of the distribution, then double that value to get the twosided pvalue. That’s it!
Consider the situation of the medical consultant. Now that you know about onesided and twosided tests, which type of test do you think is more appropriate?
The setting has been framed in the context of the consultant being helpful (which is what led us to a onesided test originally), but what if the consultant actually performed worse than the average? Would we care? More than ever! Since it turns out that we care about a finding in either direction, we should run a twosided test. The pvalue for the twosided test is double that of the onesided test, here the simulated pvalue would be 0.2444.
Generally, to find a twosided pvalue we double the single tail area, which remains a reasonable approach even when the sampling distribution is asymmetric. However, the approach can result in pvalues larger than 1 when the point estimate is very near the mean in the null distribution; in such cases, we write that the pvalue is 1. Also, very large pvalues computed in this way (e.g., 0.85), may also be slightly inflated. Typically, we do not worry too much about the precision of very large pvalues because they lead to the same analysis conclusion, even if the value is slightly off.
Controlling the Type 1 Error rate
Now that we understand the difference between onesided and twosided tests, we must recognize when to use each type of test. Because of the result of increased error rates, it is never okay to change twosided tests to onesided tests after observing the data. We explore the consequences of ignoring this advice in the next example.
Using \(\alpha=0.05,\) we show that freely switching from twosided tests to onesided tests will lead us to make twice as many Type 1 Errors as intended.
Suppose we are interested in finding any difference from 0. We’ve created a smoothlooking null distribution representing differences due to chance in Figure 6.5.
Suppose the sample difference was larger than 0. Then if we can flip to a onesided test, we would use \(H_A:\) difference \(> 0.\) Now if we obtain any observation in the upper 5% of the distribution, we would reject \(H_0\) since the pvalue would just be a the single tail. Thus, if the null hypothesis is true, we incorrectly reject the null hypothesis about 5% of the time when the sample mean is above the null value, as shown in Figure 6.5.
Suppose the sample difference was smaller than 0. Then if we change to a onesided test, we would use \(H_A:\) difference \(< 0.\) If the observed difference falls in the lower 5% of the figure, we would reject \(H_0.\) That is, if the null hypothesis is true, then we would observe this situation about 5% of the time.
By examining these two scenarios, we can determine that we will make a Type 1 Error \(5\%+5\%=10\%\) of the time if we are allowed to swap to the “best” onesided test for the data. This is twice the error rate we prescribed with our significance level: \(\alpha=0.05\) (!).
Hypothesis tests should be set up before seeing the data.
After observing data, it is tempting to turn a twosided test into a onesided test. Avoid this temptation. Hypotheses should be set up before observing the data.
6.2.2.3 Power
Although we won’t go into extensive detail here, power is an important topic for followup consideration after understanding the basics of hypothesis testing. A good power analysis is a vital preliminary step to any study as it will inform whether the data you collect are sufficient for being able to conclude your research broadly.
Often times in experiment planning, there are two competing considerations:
 We want to collect enough data that we can detect important effects.
 Collecting data can be expensive, and in experiments involving people, there may be some risk to patients.
When planning a study, we want to know how likely we are to detect an effect we care about. In other words, if there is a real effect, and that effect is large enough that it has practical value, then what is the probability that we detect that effect? This probability is called the power, and we can compute it for different sample sizes or different effect sizes.
Power is the probability of rejecting the null claim when the alternative claim is true.
How easy it is to detect the effect depends on both how big the effect is (e.g., how good the medical treatment is) as well as the sample size.
We think of power as the probability that you will become rich and famous from your science. In order for your science to make a splash, you need to have good ideas! That is, you won’t become famous if you happen to find a single Type 1 error which rejects the null hypothesis. Instead, you’ll become famous if your science i very good and important (that is, if the alternative hypothesis is true). The better your science is (i.e., the better the medical treatment), the larger the effect size and the easier it will be for you to convince people of your work.
Not only does your science need to be solid, but you also need to have evidence (i.e., data) that shows the effect. The data comes from an experiment or an observational study. A few observations (e.g., \(n=2)\) is unlikely to be convincing because of well known ideas of natural variability. Indeed, the larger the dataset which provides evidence for your scientific claim, the more likely you are to convince the community that your idea is correct.
6.2.3 Bootstrap confidence interval for \(p_1  p_2\)
In Section 6.2.1, we worked with the randomization distribution to understand the distribution of \(\hat{p}_1  \hat{p}_2\) when the null hypothesis \(H_0: p_1  p_2 = 0\) is true. Now, through bootstrapping, we study the variability of \(\hat{p}_1  \hat{p}_2\) without the null assumption.
Observed data
Reconsider the CPR data from Section 6.2.1 which is provided in Table 6.3. The experiment consisted of two treatments on patients who underwent CPR for a heart attack and were subsequently admitted to a hospital. Each patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group). The outcome variable of interest was whether the patient survived for at least 24 hours.
Again, we use the difference in sample proportions as the observed statistic of interest. Here, the value of the statistic is: \(\hat{p}_T  \hat{p}_C = 0.35  0.22 = 0.13.\)
Variability of the statistic
The bootstrap method applied to two samples is an extension of the method described in Section 5.2.
Now, we have two samples, so each sample estimates the population from which they came.
In the CPR setting, the treatment
sample estimates the population of all individuals who have gotten (or will get) the treatment; the control
sample estimate the population of all individuals who do not get the treatment and are controls.
Figure 6.6 extends Figure 5.9 to show the bootstrapping process from two samples simultaneously.
As before, once the population is estimated, we can randomly resample observations to create bootstrap samples, as seen in Figure 6.7.
The variability of the statistic (the difference in sample proportions) can be calculated by taking one bootstrap resample from Sample 1 and one bootstrap resample from Sample 2 and calculating the difference of the bootstrap proportions. One resample from each of the estimated populations has been taken with the bootstrap proportions calculated for each of the bootstrap resamples.
As always, the variability of the difference in proportions can only be estimated by repeated simulations, in this case, repeated bootstrap resamples. Figure 6.8 shows multiple bootstrap differences calculated for each of the repeated bootstrap samples.
Repeated bootstrap simulations lead to a bootstrap sampling distribution of the statistic of interest, here the difference in sample proportions. Figure 6.10 visualizes the process in the toy example, and Figure 6.11 shows 1000 bootstrap differences in proportions for the CPR data. Note that the CPR data includes 40 and 50 people in the respective groups, and the toy example includes 7 and 9 people in the two groups. Accordingly, the variability in the distribution of sample proportions is higher for the toy example. When using the mathematical model (see Section 6.2.4), the standard error for the difference in proportions is inversely related to the sample size.
Bootstrap percentile vs. SE confidence intervals
Figure 6.11 provides an estimate for the variability of the difference in survival proportions from sample to sample, The values in the histogram can be used in two different ways to create a confidence interval for the parameter of interest: \(p_1  p_2.\)
Bootstrap percentile confidence interval
As in Section 5.2, the bootstrap confidence interval can be calculated directly from the bootstrapped differences in Figure 6.11. The interval created from the percentiles of the distribution is called the percentile interval. Note that here we calculate the 90% confidence interval by finding the \(5^{th}\) and \(95^{th}\) percentile values from the bootstrapped differences. The bootstrap 5 percentile proportion is 0.155 and the 95 percentile is 0.167. The result is: we are 90% confident that, in the population, the true difference in probability of survival is between 0.155 and 0.167. The interval shows that we do not have much definitive evidence of the affect of blood thinners, one way or another.
Bootstrap SE confidence interval
Alternatively, we can use the variability in the bootstrapped differences to calculate a standard error of the difference. The resulting interval is called the SE interval. Section 6.2.4 details the mathematical model for the standard error of the difference in sample proportions, but the bootstrap distribution typically does an excellent job of estimating the variability.
\[SE(\hat{p}_T  \hat{p}_C) \approx SE(\hat{p}_{T, boot}  \hat{p}_{C, boot}) = 0.0975\]
The variability of the difference in proportions was calculated in R using the sd()
function, but any statistical software will calculate the standard deviation of the differences, here, the exact quantity we hope to approximate.
Note that we do not know know the true distribution of \(\hat{p}_T  \hat{p}_C,\) so we will use a rough approximation to find a confidence interval for \(p_T  p_C.\) As seen in the bootstrap histograms, the shape of the distribution is roughly symmetric and bellshaped. So for a rough approximation, we will apply the 679599.7 rule which tells us that 95% of observed differences should be roughly no farther than 2 SE from the true parameter difference. A 95% confidence interval for \(p_T  p_C\) is given by:
\[\begin{align*} \hat{p}_T  \hat{p}_C \pm 2 \cdot SE \ \ \ \rightarrow \ \ \ 14/40  11/50 \pm 2 \cdot 0.0975 \ \ \ \rightarrow \ \ \ (0.065, 0.325) \end{align*}\]We are 95% confident that the true value of \(p_T  p_C\) is between 0.065 and 0.325. Again, the wide confidence interval that overlaps zero indicates that the study provides very little evidence about the effectiveness of blood thinners. For other percentages, e.g., a 90% bootstrap SE confidence interval, we will use quantiles given by the standard normal distribution, as seen in Section 5.3.2 and Figure 5.25.
What does 95% mean?
Recall that the goal of a confidence interval is to find a plausible range of values for a parameter of interest. The estimated statistic is not the value of interest, but it is typically the best guess for the unknown parameter. The confidence level (often 95%) is a number that takes a while to get used to. Surprisingly, the percentage doesn’t describe the dataset at hand, it describes many possible datasets. One way to understand a confidence interval is to think about all the confidence intervals that you have ever made or that you will ever make a scientist, the confidence level describes those intervals.
Figure 6.13 demonstrates a hypothetical situation in which 25 different studies are performed on the exact same population (with the same goal of estimating the true parameter value of \(p_1  p_2 = 0.47).\) The study at hand represents one point estimate (a dot) and a corresponding interval. It is not possible to know whether the interval at hand is to the right of the unknown true parameter value (the black line) or to the left of that line. It is also impossible to know whether the interval captures the true parameter (is blue) or doesn’t (is red). If we are making 95% intervals, then 5% of the intervals we create over our lifetime will not capture the parameter of interest (e.g., will be red as in Figure 6.13 ). What we know is that over our lifetimes as scientists, 95% of the intervals created and reported on will capture the parameter value of interest: thus the language “95% confident.”
The choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however, it is related to the logic we used when deciding that a pvalue should be declared as significant if it is lower than 0.05 (or 0.10 or 0.01, respectively). Indeed, one can show mathematically, that a 95% confidence interval and a twosided hypothesis test at a cutoff of 0.05 will provide the same conclusion when the same data and mathematical tools are applied for the analysis. A full derivation of the explicit connection between confidence intervals and hypothesis tests is beyond the scope of this text.
6.2.4 Mathematical model
Variability of \(\hat{p}_1  \hat{p}_2\)
Like with \(\hat{p},\) the difference of two sample proportions \(\hat{p}_1  \hat{p}_2\) can be modeled using a normal distribution when certain conditions are met. First, we require a broader independence condition, and secondly, the successfailure condition must be met by both groups.
Conditions for the sampling distribution of \(\hat{p}_1 \hat{p}_2\) to be normal.
The difference \(\hat{p}_1  \hat{p}_2\) can be modeled using a normal distribution when
 Independence (extended). The data are independent within and between the two groups. Generally this is satisfied if the data come from two independent random samples or if the data come from a randomized experiment.
 Successfailure condition. The successfailure condition holds for both groups, where we check successes and failures in each group separately. That is, we should have at least 10 successes and 10 failures in each of the two groups.
When these conditions are satisfied, the standard error of \(\hat{p}_1  \hat{p}_2\) is:
\[\begin{eqnarray*} SE(\hat{p}_1  \hat{p}_2) = \sqrt{\frac{p_1(1p_1)}{n_1} + \frac{p_2(1p_2)}{n_2}} \end{eqnarray*}\] where \(p_1\) and \(p_2\) represent the population proportions, and \(n_1\) and \(n_2\) represent the sample sizes.
Note that in most cases, the standard error is approximated using the observed data:
\[\begin{eqnarray*} SE(\hat{p}_1  \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1\hat{p}_2)}{n_2}} \end{eqnarray*}\] where \(\hat{p}_1\) and \(\hat{p}_2\) represent the observed sample proportions, and \(n_1\) and \(n_2\) represent the sample sizes.
Confidence interval for \(p_1  p_2\)
We can apply the generic confidence interval formula for a difference of two proportions, where we use \(\hat{p}_1  \hat{p}_2\) as the point estimate and substitute the \(SE\) formula:
\[\text{point estimate} \ \pm\ z^{\star} \times SE \to \hat{p}_1  \hat{p}_2 \ \pm\ z^{\star} \times \sqrt{\frac{\hat{p}_1(1\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1\hat{p}_2)}{n_2}} \]
Standard Error of the difference in two proportions, \(\hat{p}_1 \hat{p}_2.\)
When the conditions are met so that the distribution of \(\hat{p}_1\) and \(\hat{p}_2\) are both nearly normal, the variability of the difference in proportions, \(\hat{p}_1 \hat{p}_2,\) is well described by:
\[ SE(\hat{p}_1 \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1\hat{p}_2)}{n_2}} \]
We reconsider the experiment for patients who underwent cardiopulmonary resuscitation (CPR) for a heart attack and were subsequently admitted to a hospital. These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The outcome variable of interest was whether the patients survived for at least 24 hours. The results are shown in Table 6.3. Check whether we can model the difference in sample proportions using the normal distribution.
We first check for independence: since this is a randomized experiment, this condition is satisfied.
Next, we check the successfailure condition for each group. We have at least 10 successes and 10 failures in each experiment arm (11, 14, 39, 26), so this condition is also satisfied.
With both conditions satisfied, the difference in sample proportions can be reasonably modeled using a normal distribution for these data.
Create and interpret a 90% confidence interval of the difference for the survival rates in the CPR study.
We’ll use \(p_T\) for the survival rate in the treatment group and \(p_C\) for the control group:
\[ \hat{p}_{T}  \hat{p}_{C} = \frac{14}{40}  \frac{11}{50} = 0.35  0.22 = 0.13\] We use the standard error formula previously provided. As with the onesample proportion case, we use the sample estimates of each proportion in the formula in the confidence interval context:
\[ SE \approx \sqrt{\frac{0.35 (1  0.35)}{40} + \frac{0.22 (1  0.22)}{50}} = 0.095 \]
For a 90% confidence interval, we use \(z^{\star} = 1.65:\)
\[ \text{point estimate} \ \pm\ z^{\star} \times SE \to \] \[0.13 \ \pm\ 1.65 \times 0.095 \to (0.027, 0.287) \]
We are 90% confident that blood thinners have a difference of 2.7% to +28.7% percentage point impact on survival rate for patients who are like those in the study. Because 0% is contained in the interval, we do not have enough information to say whether blood thinners help or harm heart attack patients who have been admitted after they have undergone CPR.
Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such a 95% or 99%). A lower degree of confidence increases potential for error, but it also produces a more narrow interval.
A 5year experiment was conducted to evaluate the effectiveness of fish oils on reducing cardiovascular events, where each subject was randomized into one of two treatment groups. We’ll consider heart attack outcomes in the patients listed in Table 6.4.
Create a 95% confidence interval for the effect of fish oils on heart attacks for patients who are wellrepresented by those in the study. Also interpret the interval in the context of the study.^{156}
heart attack  no event  Total  

fish oil  145  12788  12933 
placebo  200  12738  12938 
Hypothesis test for \(H_0: p_1  p_2 = 0\)
The details for calculating a SE and for checking technical conditions are very similar to that of confidence intervals. However, when the null hypothesis is that \(p_1  p_2 = 0,\) we use a special proportion called the pooled proportion to estimate the SE and to check the successfailure condition.
Use the pooled proportion when \(H_0\) is \(p_1  p_2 = 0.\)
When the null hypothesis is that the proportions are equal, use the pooled proportion \((\hat{p}_{\textit{pool}})\) of successes to verify the successfailure condition and estimate the standard error:
\[ \hat{p}_{\textit{pool}} = \frac{\mbox{number of } ``\mbox{successes}"} {\mbox{number of cases}} = \frac{\hat{p}_1 n_1 + \hat{p}_2 n_2}{n_1 + n_2}\]
Here \(\hat{p}_1 n_1\) represents the number of successes in sample 1 because \[ \hat{p}_1 = \frac{\text{number of successes in sample 1}}{n_1} \]
Similarly, \(\hat{p}_2 n_2\) represents the number of successes in sample 2.
The test statistic for assessing two proportions is a Z.
The Z score is a ratio of how the two sample proportions differs as compared to the expected variability of the two \(\hat{p}\) values.
\[ Z = \frac{\hat{p}_1  \hat{p}_2}{\sqrt{\hat{p}_{pool}(1\hat{p}_{pool}) \bigg(\frac{1}{n_1} + \frac{1}{n_2} \bigg)}} \]
When the null hypothesis is true and the conditions are met, Z has a standard normal distribution. See the box below for calculation of the pooled proportion of successes.
Conditions:
 independently observed data
 large samples: \((n_1 p_1 \geq 10\) and \(n_1 (1p_1) \geq 10\) and \(n_2 p_2 \geq 10\) and \(n_2 (1p_2) \geq 10)\)
 check conditions using: \((n_1 \hat{p}_{\textit{pool}} \geq 10\) and \(n_1 (1\hat{p}_{\textit{pool}}) \geq 10\) and \(n_2 \hat{p}_{\textit{pool}}\geq 10\) and \(n_2 (1\hat{p}_{\textit{pool}}) \geq 10)\)
A mammogram is an Xray procedure used to check for breast cancer. Whether mammograms should be used is part of a controversial discussion, and it’s the topic of our next example where we learn about 2proportion hypothesis tests when \(H_0\) is \(p_1  p_2 = 0\) (or equivalently, \(p_1 = p_2).\)
A 30year study was conducted with nearly 90,000 female participants. During a 5year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular nonmammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full 30year period. Results from the study are summarized in Figure 6.5.
If mammograms are much more effective than nonmammogram breast cancer exams, then we would expect to see additional deaths from breast cancer in the control group. On the other hand, if mammograms are not as effective as regular breast cancer exams, we would expect to see an increase in breast cancer deaths in the mammogram group.
Yes  No  

Mammogram  500  44,425 
Control  505  44,405 
Is this study an experiment or an observational study?^{157}
Set up hypotheses to test whether there was a difference in breast cancer deaths in the mammogram and control groups.^{158}
The research question describing mammograms is set up to address specific hypotheses (in contrast to a confidence interval for a parameter). In order to fully take advantage of the hypothesis testing structure, we asses the randomness under the condition that the null hypothesis is true (as we always do for hypothesis testing). Using the data from Table 6.5, we will check the conditions for using a normal distribution to analyze the results of the study using a hypothesis test.
\[\begin{align*} \hat{p}_{\textit{pool}} &= \frac {\text{# of patients who died from breast cancer in the entire study}} {\text{# of patients in the entire study}} \\ &= \frac{500 + 505}{500 + \text{44,425} + 505 + \text{44,405}} \\ &= 0.0112 \end{align*}\] This proportion is an estimate of the breast cancer death rate across the entire study, and it’s our best estimate of the proportions \(p_{MGM}\) and \(p_{C}\) if the null hypothesis is true that \(p_{MGM} = p_{C}.\) We will also use this pooled proportion when computing the standard error.
Is it reasonable to model the difference in proportions using a normal distribution in this study?
Because the patients are randomized, they can be treated as independent, both within and between groups. We also must check the successfailure condition for each group. Under the null hypothesis, the proportions \(p_{MGM}\) and \(p_{C}\) are equal, so we check the successfailure condition with our best estimate of these values under \(H_0,\) the pooled proportion from the two samples, \(\hat{p}_{\textit{pool}} = 0.0112:\)
\[\begin{align*} \hat{p}_{\textit{pool}} \times n_{MGM} &= 0.0112 \times \text{44,925} = 503\\ (1  \hat{p}_{\textit{pool}}) \times n_{MGM} &= 0.9888 \times \text{44,925} = \text{44,422} \\ \hat{p}_{\textit{pool}} \times n_{C} &= 0.0112 \times \text{44,910} = 503\\ (1  \hat{p}_{\textit{pool}}) \times n_{C} &= 0.9888 \times \text{44,910} = \text{44,407} \end{align*}\]The successfailure condition is satisfied since all values are at least 10. With both conditions satisfied, we can safely model the difference in proportions using a normal distribution.
In the previous example, the pooled proportion was used to check the successfailure condition^{159}. In the next example, we see an additional place where the pooled proportion comes into play: the standard error calculation.
Compute the point estimate of the difference in breast cancer death rates in the two groups, and use the pooled proportion \(\hat{p}_{\textit{pool}} = 0.0112\) to calculate the standard error.
The point estimate of the difference in breast cancer death rates is \[\begin{align*} \hat{p}_{MGM}  \hat{p}_{C} &= \frac{500}{500 + 44,425}  \frac{505}{505 + 44,405} \\ &= 0.01113  0.01125 \\ &= 0.00012 \end{align*}\] The breast cancer death rate in the mammogram group was 0.012% less than in the control group. Next, the standard error is calculated using the pooled proportion, \(\hat{p}_{\textit{pool}}:\)
\[ SE = \sqrt{ \frac{\hat{p}_{\textit{pool}}(1\hat{p}_{\textit{pool}})} {n_{MGM}} + \frac{\hat{p}_{\textit{pool}}(1\hat{p}_{\textit{pool}})} {n_{C}} } = 0.00070\]
Using the point estimate \(\hat{p}_{MGM}  \hat{p}_{C} = 0.00012\) and standard error \(SE = 0.00070,\) calculate a pvalue for the hypothesis test and write a conclusion.
Just like in past tests, we first compute a test statistic and draw a picture: \[ Z = \frac{\text{point estimate}  \text{null value}}{SE} = \frac{0.00012  0}{0.00070} = 0.17 \]
The lower tail area is 0.4325, which we double to get the pvalue: 0.8650. Because this pvalue is larger than 0.05, we do not reject the null hypothesis. That is, the difference in breast cancer death rates is reasonably explained by chance, and we do not observe benefits or harm from mammograms relative to a regular breast exam.
Can we conclude that mammograms have no benefits or harm? Here are a few considerations to keep in mind when reviewing the mammogram study as well as any other medical study:
 We do not accept the null hypothesis, which means we don’t have sufficient evidence to conclude that mammograms reduce or increase breast cancer deaths.
 If mammograms are helpful or harmful, the data suggest the effect isn’t very large.
 Are mammograms more or less expensive than a nonmammogram breast exam? If one option is much more expensive than the other and doesn’t offer clear benefits, then we should lean towards the less expensive option.
 The study’s authors also found that mammograms led to overdiagnosis of breast cancer, which means some breast cancers were found (or thought to be found) but that these cancers would not cause symptoms during patients’ lifetimes. That is, something else would kill the patient before breast cancer symptoms appeared. This means some patients may have been treated for breast cancer unnecessarily, and this treatment is another cost to consider. It is also important to recognize that overdiagnosis can cause unnecessary physical or emotional harm to patients.
These considerations highlight the complexity around medical care and treatment recommendations. Experts and medical boards who study medical treatments use considerations like those above to provide their best recommendation based on the current evidence.
6.2.5 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
6.3 Independence in twoway tables
In Section 6.2 our focus was on the difference in proportions, a statistic calculated from finding the success proportions (from the binary response variable) measured across two groups (the binary explanatory variable). As we will see in the examples below, sometimes the explanatory or response variables have more than two possible options. In that setting, a difference across two groups is not sufficient, and the proportion of “success” is not well defined if there are 3 or 4 or more possible response levels. The primary way to summarize categorical data where the explanatory and response variables both have 2 or more levels is through a twoway table as in Table 6.6.
Note that with twoway tables, there is not an obvious single parameter of interest. Instead, research questions usually focus on how the proportions of the response variable changes (or not) across the different levels of the explanatory variable. Because there is not a population parameter to estimate, bootstrapping to find the standard error of the estimate is not meaningful. As such, for twoway tables, we will focus on the randomization test and corresponding mathematical approximation (and not bootstrapping).
6.3.1 Randomization test of \(H_0:\) independence
We all buy used products – cars, computers, textbooks, and so on – and we sometimes assume the sellers of those products will be forthright about any underlying problems with what they’re selling. This is not something we should take for granted. Researchers recruited 219 participants in a study where they would sell a used iPod^{160} that was known to have frozen twice in the past. The participants were incentivized to get as much money as they could for the iPod since they would receive a 5% cut of the sale on top of $10 for participating. The researchers wanted to understand what types of questions would elicit the seller to disclose the freezing issue.
Unbeknownst to the participants who were the sellers in the study, the buyers were collaborating with the researchers to evaluate the influence of different questions on the likelihood of getting the sellers to disclose the past issues with the iPod. The scripted buyers started with “Okay, I guess I’m supposed to go first. So you’ve had the iPod for 2 years …” and ended with one of three questions:
 General: What can you tell me about it?
 Positive Assumption: It doesn’t have any problems, does it?
 Negative Assumption: What problems does it have?
The question is the treatment given to the sellers, and the response is whether the question prompted them to disclose the freezing issue with the iPod. The results are shown in Table 6.6, and the data suggest that asking the, What problems does it have?, was the most effective at getting the seller to disclose the past freezing issues. However, you should also be asking yourself: could we see these results due to chance alone, or is this in fact evidence that some questions are more effective for getting at the truth?
General  Positive Assumptions  Negative Assumptions  Total  

Disclose Problem  2  23  36  61 
Hide Problem  71  50  37  158 
Total  73  73  73  219 
The hypothesis test for the iPod experiment is really about assessing whether there is statistically significant evidence that there was a difference in the success rates that each question had on getting the participant to disclose the problem with the iPod. In other words, the goal is to check whether the buyer’s question was independent of whether the seller disclosed a problem.
Expected counts in twoway tables
While we would not expect the number of disclosures to be exactly the same across the three groups, the rate of disclosure seems substantially different across the three groups. In order to investigate whether the differences in rates is due to natural variability or due to a treatment effect (i.e., the question causing the differences), we need to compute estimated counts for each cell in a twoway table.
From the experiment, we can compute the proportion of all sellers who disclosed the freezing problem as \(61/219 = 0.2785.\) If there really is no difference among the questions and 27.85% of sellers were going to disclose the freezing problem no matter the question that was put to them, how many of the 73 people in the General
group would we have expected to disclose the freezing problem?
We would predict that \(0.2785 \times 73 = 20.33\) sellers would disclose the problem. Obviously we observed fewer than this, though it is not yet clear if that is due to chance variation or whether that is because the questions vary in how effective they are at getting to the truth.
If the questions were actually equally effective, meaning about 27.85% of respondents would disclose the freezing issue regardless of what question they were asked, about how many sellers would we expect to hide the freezing problem from the Positive Assumption group?^{161}
We can compute the expected number of sellers who we would expect to disclose or hide the freezing issue for all groups, if the questions had no impact on what they disclosed, using the same strategies employed in the previous Example and Guided Practice to computed expected counts. These expected counts were used to construct Table 6.7, which is the same as Table 6.6, except now the expected counts have been added in parentheses.
Disclose Problem  2  (20.33)  23  (20.33)  36  (20.33)  61 
Hide Problem  71  (52.67)  50  (52.67)  37  (52.67)  158 
Total  73  73  73  219 
The examples and exercises above provided some help in computing expected counts. In general, expected counts for a twoway table may be computed using the row totals, column totals, and the table total. For instance, if there was no difference between the groups, then about 27.85% of each column should be in the first row:
\[\begin{align*} 0.2785\times (\text{column 1 total}) &= 20.33 \\ 0.2785\times (\text{column 2 total}) &= 20.33 \\ 0.2785\times (\text{column 3 total}) &= 20.33 \end{align*}\] Looking back to how 0.2785 was computed – as the fraction of sellers who disclosed the freezing issue \((158/219)\) – these three expected counts could have been computed as \[\begin{align*} \left(\frac{\text{row 1 total}}{\text{table total}}\right) \text{(column 1 total)} &= 20.33 \\ \left(\frac{\text{row 1 total}}{\text{table total}}\right) \text{(column 2 total)} &= 20.33 \\ \left(\frac{\text{row 1 total}}{\text{table total}}\right) \text{(column 3 total)} &= 20.33 \end{align*}\] This leads us to a general formula for computing expected counts in a twoway table when we would like to test whether there is strong evidence of an association between the column variable and row variable.
Computing expected counts in a twoway table.
To identify the expected count for the \(i^{th}\) row and \(j^{th}\) column, compute \[\begin{align*} \text{Expected Count}_{\text{row }i,\text{ col }j} = \frac{(\text{row $i$ total}) \times (\text{column $j$ total})}{\text{table total}} \end{align*}\]
The chisquare statistic
Observed data
The chisquare test statistic for a twoway table is found by comparing the observed and expected counts for each cell in the table. For each table count, compute:
\[\begin{align*} &\text{General formula} && \frac{(\text{observed count }  \text{expected count})^2} {\text{expected count}} \\ &\text{Row 1, Col 1} && \frac{(2  20.33)^2}{20.33} = 16.53 \\ &\text{Row 1, Col 2} && \frac{(23  20.33)^2}{20.33} = 0.35 \\ & \hspace{9mm}\vdots && \hspace{13mm}\vdots \\ &\text{Row 2, Col 3} && \frac{(37  52.67)^2}{52.67} = 4.66 \end{align*}\] Adding the computed value for each cell gives the chisquare test statistic \(X^2:\) \[\begin{align*} X^2 = 16.53 + 0.35 + \dots + 4.66 = 40.13 \end{align*}\]
Randomization distribution of the chisquare statistic
Variability of the statistic
Is 40.13 a big number? That is, does it indicate that the observed and expected values are really different? Or is 40.13 a value of the statistic that we’d expect to see just due to natural variability? Previously, we applied the randomization test to the setting where the research question investigated a difference in proportions. The same idea of shuffling the data under the null hypothesis can be used in the setting of the twoway table.
Assuming that the individuals would disclose or hide the problems regardless of the question they are given (i.e., that the null hypothesis is true), we can randomize the data by reassigning the 61 disclosed problems and 158 hidden problems to the three groups at random. Table 6.8 shows a possible randomization of the observed data under the condition that the null hypothesis is true (in contrast to the original observed data in Table 6.6).
General  Positive Assumptions  Negative Assumptions  Total  

Disclose Problem  15  26  20  61 
Hide Problem  58  47  53  158 
Total  73  73  73  219 
As before, the randomized data is used to find a single value for the test statistic (here a chisquared statistic). The chisquare statistic for the randomized twoway table is found by comparing the observed and expected counts for each cell in the randomized table. For each cell, compute:
\[\begin{align*} &\text{General formula} && \frac{(\text{observed count }  \text{expected count})^2} {\text{expected count}} \\ &\text{Row 1, Col 1} && \frac{(15  20.33)^2}{20.33} = 1.399 \\ &\text{Row 1, Col 2} && \frac{(26  20.33)^2}{20.33} = 1.579 \\ & \hspace{9mm}\vdots && \hspace{13mm}\vdots \\ &\text{Row 2, Col 3} && \frac{(53  52.67)^2}{52.67} = 0.002 \end{align*}\] Adding the computed value for each cell gives the chisquare test statistic \(X^2:\) \[\begin{align*} X^2 = 1.399 + 1.579 + \dots + 0.002 = 4.136 \end{align*}\]
Observed statistic vs null statistics
As before, one randomization will not be sufficient for understanding if the observed data are particularly different from the expected chisquare statistics when \(H_0\) is true.
To investigate whether 40.13 is large enough to indicate the the observed and expected counts are significantly different, we need to understand what values of the chisquared statistic would happen just due to change.
Figure 6.14 plots 1000 chisquared statistics generated under the null hypothesis.
We can see that the observed value is so far from the null statistics that the simulated pvalue is zero.
That is, the probability of seeing the observed statistic when the null hypothesis is true is virtually zero.
In this case we can conclude that the decision of whether or not to disclose the iPod’s problem is changed by the question asked.
(We use the causal language of “changed” because the study was an experiment.) Note that with a chisquared test, we only know that the two variables (here: question
and disclosure
) are related (i.e., not independent).
We are not able to claim which type of question causes which type of disclosure.
6.3.2 Mathematical model
The chisquare test of \(H_0:\) independence
Previously, in Section 6.2.4, we applied the Central Limit Theorem to the sampling variability of \(\hat{p}_1  \hat{p}_2.\) The result was that we could use the normal distribution (e.g., \(z^*\) values (see Figure 6.2 ) and pvalues from \(Z\) scores) to complete the mathematical inferential procedure. The chisquare test statistic has a different mathematical distribution called the chisquared distribution. The important specification to make in describing the chisquare distribution is something called degrees of freedom. The degrees of freedom change the shape of the chisquare distribution to fit the problem at hand. Figure 6.15 visualizes different chisquare distributions corresponding to different degrees of freedom.
Variability of the statistic
As it turns out, the chisquared test statistic follows a chisquare distribution when the null hypothesis is true. For two way tables, the degrees of freedom is equal to: \[\begin{align*} df = \text{(number of rows minus 1)}\times \text{(number of columns minus 1)} \end{align*}\] In our example, the degrees of freedom parameter is \[\begin{align*} df = (21)\times (31) = 2 \end{align*}\]
Observed statistic vs. null statistics
The test statistic for assessing two categorical variables is a \(X^2.\)
The \(X^2\) statistic is a ratio of how the expected counts vary from the observed counts as compared to the expected counts (which are a measure of how large the sample size is).
\[\begin{align*} X^2 = \sum_{i,j} \frac{(\text{observed count }  \text{expected count})^2} {\text{expected count}} \end{align*}\]When the null hypothesis is true and the conditions are met, \(X^2\) has a chisquare distribution with \(df = (r1) \times (c1).\)
Conditions:
 independently observed data
 large samples (5 observed values in each cell)
To bring it back to the example, if the null hypothesis is true (i.e., the questions had no impact on the sellers in the experiment), then the test statistic \(X^2 = 40.13\) closely follows a chisquare distribution with 2 degrees of freedom. Using this information, we can compute the pvalue for the test, which is depicted in Figure 6.16.
Computing degrees of freedom for a twoway table. When applying the chisquare test to a twoway table, we use \[\begin{align*} df = (R1)\times (C1) \end{align*}\] where \(R\) is the number of rows in the table and \(C\) is the number of columns.
The software R can be used to find the pvalue with the function pchisq()
.
Just like pnorm()
, pchisq()
always gives the area to the left of the cutoff value.
Because, in this example, the pvalue is represented by the area to the right of 40.13, we subtract the output of pchisq()
from 1.
1  pchisq(40.13, df = 2)
#> [1] 1.93e09
Find the pvalue and draw a conclusion about whether the question affects the sellers likelihood of reporting the freezing problem.
Using a computer, we can compute a very precise value for the tail area above \(X^2 = 40.13\) for a chisquare distribution with 2 degrees of freedom: 0.000000002.
Using a significance level of \(\alpha=0.05,\) the null hypothesis is rejected since the pvalue is smaller. That is, the data provide convincing evidence that the question asked did affect a seller’s likelihood to tell the truth about problems with the iPod.
Table 6.9 summarizes the results of an experiment evaluating three treatments for Type 2 Diabetes in patients aged 1017 who were being treated with metformin.
The three treatments considered were continued treatment with metformin (met
), treatment with metformin combined with rosiglitazone (rosi
), or a lifestyle intervention program.
Each patient had a primary outcome, which was either lacked glycemic control (failure) or did not lack that control (success).
What are appropriate hypotheses for this test?
 \(H_0:\) There is no difference in the effectiveness of the three treatments.

\(H_A:\) There is some difference in effectiveness between the three treatments, e.g., perhaps the
rosi
treatment performed better thanlifestyle
.
Failure  Success  Total  

lifestyle

109  125  234 
met

120  112  232 
rosi

90  143  233 
Total  319  380  699 
Typically we will use a computer to do the computational work of finding the chisquare statistic. However, it is always good to have a sense for what the computer is doing, and in particular, calculating the values which would be expected if the null hypothesis is true can help to understand the null hypothesis claim. Additionally, comparing the expected and observed values by eye often gives the researcher some insight into why or why not the result of the test ends up being significant.
A chisquare test for a twoway table may be used to test the hypotheses in the diabetes Example above. To get a sense for the statistic used in the chisquare test, first compute the expected values for each of the six table cells.^{162}
Note, when analyzing 2by2 contingency tables (that is, when both variables only have two possible options), one guideline is to use the twoproportion methods introduced in Section 6.2.
6.3.3 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
6.4 Chapter review
In this chapter we extended the randomization / bootstrap / mathematical model paradigm to research questions involving categorical variables. We continued working with one population proportion as well as the difference in populations proportions, but the test of independence allowed for hypothesis testing on categorical variables with more than two levels. We note that the normal model was an excellent mathematical approximation to the sampling distribution of sample proportions (or differences in sample proportions), but that the questions with categorical variables with more than 2 levels required a new mathematical model, the chisquare distribution. As seen in Chapter 5, but fully realized here in Chapter 6, almost all the research questions can be approached using computational methods (e.g., randomization tests or bootstrapping) or using mathematical models. We continue to emphasize the importance of experimental design in making conclusions about research claims. In particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure 1.11).
Randomization Test  Bootstrapping  Mathematical Model  

What does it do?  Shuffles the explanatory variable to mimic the natural variability found in a randomized experiment.  Resamples (with replacement) from the observed data to mimic the sampling variability found by collecting data from a population.  Uses theory (primarily the Central Limit Theorem) to describe the hypothetical variability resulting from either repeated randomized experiments or random samples. 
What is the random process described?  Randomized experiment.  Random sampling from a population.  Randomized experiment or random sampling. 
What other random processes can be approximated?  Can also be used to describe random sampling in an observational model  Can also be used to describe random allocation in an experiment  Randomized experiment or random sampling. 
What is it best for?  Hypothesis Testing (can be used for Confidence Intervals, but not covered in this text).  Primarily Confidence Intervals (also Bootstrap HT for one proportion).  Quick analyses through, for example, calculating a Z score. 
What physical object represents the simulation process?  Shuffling cards  Pulling marbles from a bag  Not applicable 
What are the technical conditions?  Independence  Independence, large n  Independence, large n 
6.4.1 Terms
We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.
categorical data  percentile interval  significance level  Type 1 Error 
confirmation bias  point estimate  standard error for difference in proportions  Type 2 Error 
null distribution  pooled proportion  standard error of single proportion  
onesided hypothesis test  power  successfailure condition  
parametric bootstrap  SE interval  twosided hypothesis test 
6.4.2 Chapter exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
6.4.3 Interactive R tutorials
Navigate the concepts you’ve learned in this chapter in R using the following selfpaced tutorials. All you need is your browser to get started!
You can also access the full list of tutorials supporting this book here.