8 Inference for regression
We now bring together ideas of inferential analyses from Chapter 5 with the descriptive models seen in Chapters 3 and 4. The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample. The sampling variability will inform the conclusions about the population that can be drawn.
Many of the inferential ideas are remarkably similar to those covered in previous chapters. The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.
We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact pvalues that are reported from the statistical software. Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims. We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.
8.1 Inference for linear regression
In this chapter, we bring together the inferential ideas (see Chapter 5) used to make claims about a population from information in a sample and the modeling ideas seen in Chapters 3 and 4. In particular, we will use the least squares regression line to test whether or not there is a relationship between two continuous variables. Additionally, we will build confidence intervals which quantify the slope of the linear regression line.
Observed data
We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue. The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample. Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability. Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample. The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.
Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure 8.1. In this madeup world, the CEO actually has all the relevant data, which is why they can plot it here. The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue. There appears to be linear relationship between advertising dollars and revenue (both in$000).
You may remember from Chapter 3 that the population model is: \[y = \beta_0 + \beta_1 x + \varepsilon.\]
Again, the omniscient CEO (with the full population information) can write down the true population model as: \[\mbox{expected revenue} = 11.23 + 4.8 \times \mbox{advertising}.\]
Variability of the statistic
Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether or not set up a new franchise. The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.
As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample. Consider the linear model which describes revenue (in $000) based on advertising dollars (in $0000).
The least squares regression model uses the data to find a sample linear fit: \[\hat{y} = b_0 + b_1 x.\]
A random sample of 20 stores shows a different least square regression line depending on which observations are selected. A subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure 8.1 which described the population) despite having fewer observations on the plot.
A second sample of size 20 also shows a positive trend!
But the line is slightly different!
That is, there is variability in the regression line from sample to sample. The concept of the sampling variability is something you’ve seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: the slope of the line.
You might notice in Figure 8.5 that the \(\hat{y}\) values given by the lines are much more consistent in the middle of the dataset than at the ends. The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud. The effect of the fanshaped lines is that predicted revenue for advertising close to $4,000 will be much more precise than the revenue predictions made for $1,000 or $7,000 of advertising.
The distribution of slopes (for samples of size \(n=20\)) can be seen in a histogram, as in Figure 8.6.
Recall, the example described in this introduction is hypothetical.
That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.
The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.
We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.
However, it is nice to visually consider the linear models produced by different slopes.
Additionally, as with measuring the variability of previous statistics (e.g., \(\overline{X}_1  \overline{X}_2\) or \(\hat{p}_1  \hat{p}_2\)), the histogram of the sample statistics can provide information related to inferential considerations.
In the following sections, the distribution (i.e., histogram) of \(b_1\) (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you. First (in Section 8.1.1), the distribution of \(b_1\) when \(\beta_1 = 0\) is constructed by randomizing (permuting) the response variable. Next (in Section 8.1.2), we can bootstrap the data by taking random samples of size n from the original dataset. And last (in Section 8.1.3), we use mathematical tools to describe the variability using the \(t\)distribution that was first encountered in Section 7.1.2.
8.1.1 Randomization test for \(H_0: \beta_1= 0\)
Consider the data on Global Crop Yields compiled by Our World in Data and presented as part of the TidyTuesday series seen in Figure 8.7. The scientific research interest at hand will be in determining the linear relationship between wheat yield (for a countryyear) and other crop yields. The dataset is quite rich and deserves exploring, but for this example, we will focus only on the annual crop yield in the United States.
As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.
In the linear model setting, we might hope to have a linear relationship between maize
and wheat
in settings where maize
production is known and wheat
production needs to be predicted.
The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.
Here the population refers to a larger set of years where maize
and wheat
are both grown in the US.

\(H_0: \beta_1= 0\), there is no linear relationship between
wheat
andmaize
. 
\(H_A: \beta_1 \ne 0\), there is some linear relationship between
wheat
andmaize
.
Recall that for the randomization test, we permute one variables to eliminate any existing relationship between the variables.
That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but not due to variables being correlated.
Figure 8.8 shows the observed data and a scatterplot of one permutation of the wheat
variable.
The careful observer can see that each of the observed the values for wheat
(and for maize
) exist in both the original data plot as well as the permuted wheat
plot, but the given wheat
and maize
yields are no longer matched for a given year.
That is, each wheat
yield is randomly assigned to a new maize
yield.
By repeatedly permuting the response variable any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship). The randomization test compares the slopes calculated from the permuted response variable with the observed slope. If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).
Observed data
We will continue to use the crop data to investigate the linear relationship between wheat
and maize
.
Note that the least squares model (see Chapter 3 ) describing the relationship is given in Table 8.1.
The columns in Table 8.1 are further described in Section 8.1.3.
term  estimate  std.error  statistic  p.value 

(Intercept)  1.033  0.091  11.3  0 
maize  0.195  0.012  16.4  0 
Variability of the statistic
After permuting the data, the least squares estimate of the line can be computed.
Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between wheat
and maize
.
Figure 8.9 shows two different permutations of wheat
and the resulting linear models.
As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative. Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.
Observed statistic vs. null statistics
As we can see from Figure 8.10, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the wheat
variable.
That is, if indeed there were no linear relationship between wheat
and maize
, the natural variability of the slopes would produce estimates between approximately 0.1 and +0.1.
We reject the null hypothesis.
Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between wheat
and maize
crop yield in the US.
8.1.2 Bootstrap confidence interval for \(\beta_1\)
As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test). Because interest is now in creating a CI, there is no null hypothesis, so there won’t be any reason to permute either of the variables.
Observed data
Returning to the crop data, we may want to consider the relationship between peas
and wheat
.
Are peas
a good predictor of wheat
?
And if so, what is their relationship?
That is, what is the slope that models average wheat yield
as a function of peas
?
Variability of the statistic
Because we are not focused on a null distribution, we sample with replacement \(n=58\) observations from the original dataset.
Recall that with bootstrapping we always resample the same number of observations as we start with in order to mimic the process of taking a sample from the population.
When sampling in the linear model case, consider each observation to be a single dot.
If the dot is resampled, both the wheat
and the peas
measurement are observed.
The measurements are linked to the dot (i.e., to the year in which the measurements were taken).
Figure 8.12 shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models. The orange circle represent points in the original data which were not included in the bootstrap sample. The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples. By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure 8.13.
Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.
Figure 8.14 is a histogram of the relevant bootstrapped slopes.
We can see that a 95% bootstrap percentile interval for the true population slope is given by (0.061, 0.52).
We are 95% confident that for the model describing the population of crops of peas
and wheat
, a one unit increase in peas
yield (in tonnes per hectare) will be associated with an increase in predicted average wheat
yield of between 0.061 and 0.52 tonnes per hectare.
8.1.3 Mathematical model
When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter. The approximations will build on the tdistribution which were described in Chapter 7. The mathematical model is often correct and is usually easy to implement computationally. The validity of the technical conditions will be considered in detail in Section 8.2.
In this section, we discuss uncertainty in the estimates of the slope and yintercept for a regression line. Just as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.
Midterm elections and unemployment
Observed data
Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S. Presidential election. The set of House elections occurring during the middle of a Presidential term are called midterm elections. In America’s twoparty system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President’s party will do in the midterm elections. In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.
To assess the validity of this claim, we can compile historical data and look for a connection. We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression. The House of Representatives is made up of 435 voting members
Figure 8.15 shows these data and the leastsquares regression line:
\[ &\text{% change in House seats for President's party} \\ &\qquad\qquad= 7.36  0.89 \times \text{(unemployment rate)} \]
We consider the percent change in the number of seats of the President’s party (e.g. percent change in the number of seats for Republicans in 2018) against the unemployment rate.
Examining the data, there are no clear deviations from linearity or substantial outliers (see Section 3.1.3 for a discussion on using residuals to visualize how well a linear model fits the data). While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.
The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively. Do you agree that they should be removed for this investigation? Why or why not?^{187}
There is a negative slope in the line shown in Figure 8.15. However, this slope (and the yintercept) are only estimates of the parameter values. We might wonder, is this convincing evidence that the “true” linear model has a negative slope? That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election? We can frame this investigation into a statistical hypothesis test:
 \(H_0\): \(\beta_1 = 0\). The true linear model has slope zero.
 \(H_A\): \(\beta_1 \neq 0\). The true linear model has a slope different than zero. The unemployment is predictive of whether the President’s party wins or loses seats in the House of Representatives.
We would reject \(H_0\) in favor of \(H_A\) if the data provide strong evidence that the true slope parameter is different than zero. To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the pvalue.
Regression output from software
Variability of the statistic
Just like other point estimates we have seen before, we can compute a standard error and test statistic for \(b_1\). We will generally label the test statistic using a \(T\), since it follows the \(t\)distribution.
We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.
Table 8.2 shows software output for the least squares regression line in Figure 8.15.
The row labeled unemp
includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).
term  estimate  std.error  statistic  p.value 

(Intercept)  7.36  5.155  1.43  0.165 
unemp  0.89  0.835  1.07  0.296 
What do the first and second columns of Table 8.2 represent?
The entries in the first column represent the least squares estimates, \(b_0\) and \(b_1\), and the values in the second column correspond to the standard errors of each estimate. Using the estimates, we could write the equation for the least square regression line as
\[ \hat{y} = 7.36  0.89 x \]
where \(\hat{y}\) in this case represents the predicted change in the number of seats for the president’s party, and \(x\) represents the unemployment rate.
We previously used a \(t\)test statistic for hypothesis testing in the context of numerical data. Regression is very similar. In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the Tscore formula:
\[ T \ = \ \frac{\text{estimate}  \text{null value}}{\text{SE}} = \ \frac{0.89  0}{0.835} = \ 1.07 \]
This corresponds to the third column of Table 8.2 .
Use Table 8.2 to determine the pvalue for th hypothesis test
The last column of the table gives the pvalue for th twosided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President’s party in the House of Representatives in midterm elections.
Observed statistic vs. null statistics
As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether or not the data could have come from a population where the true slope was zero (i.e., \(\beta_1 = 0\)). Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition. Based on everything we’ve seen in the examples above describing the variability of a line from sample to sample, as yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.
Examine Figure 5.20, which relates the Elmhurst College aid and student family income. How sure are you that the slope is statistically significantly different from zero? That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?
While the relationship between the variables is not perfect, there is an evident decreasing trend in the data. This suggests the hypothesis test will reject the null claim that the slope is zero.
The point of the tools in this section are to go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about the statistical significance of the slope estimate.
term  estimate  std.error  statistic  p.value 

(Intercept)  24319.329  1291.450  18.83  0 
family_income  0.043  0.011  3.98  0 
Table 8.3 shows statistical software output from fitting the least squares regression line shown in Figure 5.20. Use the output to formally evaluate the following hypotheses.^{188}
 \(H_0\): The true coefficient for family income is zero.
 \(H_A\): The true coefficient for family income is not zero.
Inference for regression We usually rely on statistical software to identify point estimates, standard errors, test statistics, and pvalues in practice. However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met. See Section 8.2.
Confidence interval for a coefficient
Observed data
Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.
Compute the 95% confidence interval for the coefficient using the regression output from Table 8.3.
The point estimate is 0.0431 and the standard error is \(SE = 0.0108\). When constructing a confidence interval for a model coefficient, we generally use a \(t\)distribution. The degrees of freedom for the distribution are noted in the regression output, \(df = 48\), allowing us to identify \(t_{48}^{\star} = 2.01\) for use in the confidence interval.
We can now construct the confidence interval in the usual way:
\[ \begin{align*} \text{point estimate} &\pm t_{48}^{\star} \times SE \\ 0.0431 &\pm 2.01 \times 0.0108 \\ (0.0648 &, 0.0214) \end{align*} \]
We are 95% confident that with each dollar increase in , the university’s gift aid is predicted to decrease on average by $0.0214 to $0.0648.
Variability of the statistic
Confidence intervals for coefficients
Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the \(t\)distribution:
\[ b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} \]
where \(t_{df}^{\star}\) is the appropriate \(t\)value corresponding to the confidence level with the model’s degrees of freedom.
On the topic of intervals in this book, we’ve focused exclusively on confidence intervals for model parameters. However, there are other types of intervals that may be of interest, including prediction intervals for a response value and also confidence intervals for a mean response value in the context of regression.
8.1.4 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
8.2 Checking model conditions
In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions. In this section, we’ll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures. .
What are the technical conditions for the mathematical model?
When fitting a least squares line, we generally require
Linearity. The data should show a linear trend. If there is a nonlinear trend (e.g. first panel of Figure 8.16) an advanced regression method from another book or later course should be applied.
Independent observations. Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. An example of a data set where successive observations are not independent is shown in the fourth panel of Figure 8.16. There are also other instances where correlations within the data are important, which is further discussed in Chapter 4.
Nearly normal residuals. Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we’ll talk about more in Section 3.3. An example of a residual that would be a potentially concern is shown in the second panel of Figure 8.16, where one observation is clearly much further from the regression line than the others.
Constant or equal variability. The variability of points around the least squares line remains roughly constant. An example of nonconstant variability is shown in the third panel of Figure 8.16, which represents the most common pattern observed when this condition fails: the variability of \(y\) is larger when \(x\) is larger.
Should we have concerns about applying least squares regression to the Elmhurst data in Figure 3.14?^{189}
The technical conditions are often remembered using the LINE mnemonic. The linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure 8.16. A careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.
 L: linear model
 I: independent observations
 N: points are normally distributed around the line
 E: equal variability around the line for all values of the explanatory variable
Why do we need technical conditions?
As with other inferential techniques we have covered in this text, if the technical conditions above don’t hold, then it is not possible to make concluding claims about the population. That is, without the technical conditions, the Tscore (or Zscore) will not have the assumed tdistribution (or standard normal Z distribution). That said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions. If there are large deviations, we will be unable to trust the calculated pvalue or the endpoints of the resulting confidence interval.
The model based on Linearity
The linearity condition is among the most important if your goal is to understand a linear model between \(x\) and \(y\). For example, the value of the slope will not be at all meaningful if the true relationship between \(x\) and \(y\) is quadratic, as in Figure 3.3. Not only should we be cautious about the inference, but the model itself is also not an accurate portrayal of the relationship between the variables.
In Section 8.3 we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones. However, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.
The importance of Independence
The technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose. It is also extremely difficult to gather a dataset which is a true random sample from the population of interest. (Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)
Dependent observations can bias results in ways that produce fundamentally flawed analyses. That is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university. At best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).
In lieu of trying to answer whether or not your observations are a true random sample, you might instead focus on whether or not you believe your observations are representative of the populations. Humans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.
Some thoughts on Normality
The normality condition requires that points vary symmetrically around the line, spreading out in a bellshaped fashion. You should consider the “bell” of the normal distribution as sitting on top of the line (coming off the paper in a 3D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.
The normality condition is less important than linearity or independence for a few reasons. First, the linear model fit with least squares will still be an unbiased estimate of the true population model. However, the standard errors associated with variability of the line will not be well estimated. Fortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, pvalues, confidence intervals) done using the mathematical model will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough. One analysis method that does require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given \(x\) value, using the linear model. One additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.
Equal variability for prediction in particular
As with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of \(x\)) will not cause problems for the estimate of the linear model. That said, the inference on the model (e.g., computing pvalues) will be incorrect if the variability around the line is heterogeneous. Data that exhibit nonequal variance across the range of xvalues will have the potential to seriously misestimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).
The inference results for both a randomization test or a bootstrap confidence interval are robust to the equal variability condition, so they give the analyst methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line). Although randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation). When the equal variability condition is violated and a mathematical analysis (e.g., pvalue from Tscore) is needed, there are other existing methods (outside the scope of this text) which can easily handle the unequal variance (e.g., weighted least squares analysis).
What if all the technical conditions are met?
When the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software. In addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests). In particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions. When the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand. We will discuss some of the extended modeling and associated inference in Section 8.3 and Section 8.4. Many of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one. If you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.
8.2.1 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
8.3 Inference for multiple regression
In Chapter 4, the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable. Here, we discuss whether each of the variables individually is a significant predictor or whether the model might be just as strong without that variable. That is, as before, we apply inferentially methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero. If one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero. The inference task at hand is to decide whether the coefficient’s difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero. Both the derivations from the mathematical model and the randomization model are beyond the scope of this book, but we are able to calculate pvalues using statistical software. We will discuss interpreting pvalues in the multiple regression setting and note some scenarios where careful understanding of the context and the relationship between variables is important.
8.3.1 Multiple regression output from software
Recall the loans
data from Chapter 4.
The data can be found in the openintro package: loans_full_schema
.
Based on the data in this dataset we have created two new variables: credit_util
which is calculated as the total credit utilized divided by the total credit limit and bankruptcy
which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies).
We will refer to this modified dataset as loans
.
Now, our goal is to create a model where interest_rate
can be predicted using the variables debt_to_income
, term
, and credit_checks
.
As learned in Chapter 4, least squares can be used to find the coefficient estimates for the linear model.
The unknown population model can be written as: \[E[\mbox{interest_rate}] = \beta_0 + \beta_1\times \mbox{debt_to_income} + \beta_2 \times \mbox{term} + \beta_3 \times \mbox{credit_checks}\]
term  estimate  std.error  statistic  p.value 

(Intercept)  4.309  0.195  22.1  <0.0001 
debt_to_income  0.041  0.003  13.3  <0.0001 
term  0.158  0.004  37.9  <0.0001 
credit_checks  0.247  0.019  12.8  <0.0001 
The estimated equation for the regression model may be written as a model with three predictor variables:
\[ \widehat{\texttt{interest_rate}} = 4.31 + 0.041 \times \texttt{debt_to_income} + 0.16 \times \texttt{term} + 0.25 \times \texttt{credit_checks} \]
Not only does Table 8.4 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.
In Section 8.1, we learned that the hypothesis test for a linear model with one predictor can be written as:
if only one predictor, \(H_0: \beta_1 = 0.\)
That is, if the true population slope is zero, the pvalue measures how likely it would be to select data which produced the observed slope (\(b_1\)) value.
With multiple predictors, the hypothesis is similar, however, it is now conditioned on each of the other variables remaining in the model.
if multiple predictors, \(H_0: \beta_i = 0\) given other variables in the model
Using the example above and focusing on each of the variable pvalues (here we won’t discuss the pvalue associated with the intercept), we can write out the three different hypotheses:

\(H_0: \beta_1 = 0\), given
term
andcredit_checks
are included in the model 
\(H_0: \beta_2 = 0\), given
debt_to_income
andcredit_checks
are included in the model 
\(H_0: \beta_3 = 0\), given
debt_to_income
andterm
are included in the model
The very low pvalues from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of the other two.
Consider the pvalue on \(H_0: \beta_1\).
The low pvalue says that it would be extremely unlikely to see data that produce a coefficient on debt_to_income
as large as 0.041 if the true relationship between debt_to_income
and interest_rate
was nonexistent (i.e., if \(\beta_1 = 0\)) and the model also included term
and credit_checks
.
You might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the problem, 0.041 turns out to be far away from zero, it’s all about context!
The pvalues on term
and on credit_checks
are interpreted similarly.
Sometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor variables themselves being correlated.
8.3.2 Multicollinearity
In practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model. For regression models, it is important to understand the entire context of the model, particularly for correlated variables. Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the significance (i.e., the pvalue) of each coefficient.
Consider an example where we’d like to predict how much money is in a coin dish based only on the number of coins in the dish. We ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.^{190} The number of low coins is the number of coins minus the number of quarters (a quarter is the largest commonly used US coin, at US$0.25). Figure 8.17 illustrates a sample of U.S. coins, their total worth (amount), the number of total coins, and the number of low coins.
The collected data is given in Figure 8.18 and shows that the total amount of money is more highly correlated with the total number of coins than it is with the number of low coins. We also note that the number of high coins and the number of low coins are positively correlated.
Using the total number of coins as the predictor variable, Table 8.5 provides the least squares estimate of the coefficient is 0.13. For every additional coin in the dish, we would predict that the student had US$0.13 more. The \(b_1 = 0.13\) coefficient is highly significant, suggesting we would not have seen data like this if number of coins and amount of money were not linearly related.
\[ \widehat{\mbox{amount}} = 0.55 + 0.13 \times \mbox{number of coins} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  0.55  0.44  1.23  0.2301 
number.of.coins  0.13  0.02  5.54  <0.0001 
Using the number of low coins as the predictor variable, Table 8.6 provides the least squares estimate of the coefficient is 0.02. For every additional coin in the dish, we would predict that the student had US$0.02 more. The \(b_1 = 0.02\) coefficient is not at all significant, suggesting we could easily have seen data like ours even if the number of low coins and amount of money are not at all linearly related.
\[ \widehat{\mbox{amount}} = 2.28 + 0.02 \times \mbox{number of low coins} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  2.28  0.58  3.9  0.0 
number.of.low.coins  0.02  0.05  0.4  0.7 
Come up with an example of two observations that have the same number of low coins but the number of total coins differs by one. What is the difference in amount?
Come up with an example of two observations that have the same total number of coins but a different number of low coins. What is the difference in amount?
Using both the total number of coins and the number of low coins as predictor variables, Table 8.7 provides the least squares estimates of both coefficients as 0.21 and 0.16. Now, with two variables in the model, the interpretation is more nuanced.
 The coefficient indicates a change in one variable while keeping the other variable constant.
For every additional coin in the dish while the number of low coins stays constant, we would predict that the student had US$0.21 more. Reconsidering the phrase “every additional coin in the dish while the number of low coins stays constant” makes us realize that each increase is a single additional quarter (larger samples sizes would have led to a \(b_1\) coefficient closer to 0.25 because of the deterministic relationship described here).
 For every additional low coin in the dish while the number of total coins stays constant, we would predict that the student had US$0.16 less. Reconsidering the phrase “every additional low coin in the dish while the number of total coins stays constant” makes us realize that a quarter is being swapped out for a penny, nickel, or dime.
Considering the coefficients across Tables 8.5, 8.6, and 8.7 within the context and knowledge we have of US coins allows us to understand the correlation between variables and why the signs of the coefficients would change depending on the model. Note also, however, that the significance on the low coin coefficient changed from Table 8.6 to Table 8.7. It makes sense that the variable describing the number of low coins provides more information about the amount of money when it is part of a model which also includes the total number of coins than it does when it is used as a single variable in a simple linear regression model.
\[ \widehat{\text{amount}} = 0.80 + 0.21 \times \text{number of coins}  0.16 \times \text{number of low coins} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  0.80  0.30  2.65  0.0142 
number.of.coins  0.21  0.02  9.89  <0.0001 
number.of.low.coins  0.16  0.03  5.51  <0.0001 
When working with multiple regression models, understanding the significance and sign of the coefficient is not always as straightforward as it was with the coin example. However, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.
Although diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity. If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or its significance. However, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable. So if the task at hand is only to do prediction, mutlicollinearity is likely to not cause you substantial problems.
8.3.3 Cross validation for prediction error
In Section 8.3.1, pvalues were calculated on each of the model coefficients. The pvalue gives a sense of which variables are important to the model; however, a more extensive treatment of variable selection is warranted in a followup course or textbook. Here, we use cross validation prediction error to focus on which variable(s) are important for predicting the response variable of interest. In general, linear models are also used to make predictions of individual observations. In addition to model building, cross validation provides a method for generating predictions that are not overfit to the particular dataset at hand. We continue to encourage you to take up further study on the topic of cross validation, as it is among the most important ideas in modern data analysis, and we are only able to scratch the surface here.
Cross validation is a computational technique which removes some observations before a model is run, then assesses the model accuracy on the heldout sample. By removing some observations, we provide ourselves with an independent evaluation of the model (that is, the removed observations do not contribute to finding the parameters which minimize the least squares equation). Cross validation can be used in many different ways (as an independent assessment), and here we will just scratch the surface with respect to one way the technique can be used to compare models. See Figure 8.21 for a visual representation of the cross validation process.
The data can be found in the palmerpenguings package: penguins
.
The observations of three different penguin species include measurements on body size and sex.
The data were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER as part of the Long Term Ecological Research Network.
Our goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams.
Although not exactly aligned with this research project, you might be able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial photographs) but the mass is not known.
The first model will predict body_mass_g
by using only the bill_length_mm
, a variable denoting the length of a penguin’s bill, in mm.
The second model will predict body_mass_g
by using bill_length_mm
, bill_depth_mm
, flipper_length_mm
, sex
, and species
.
The presentation below (see the comparison of Figures 8.23 and 8.25) shows that the model with more variables predicts body_mass_g
with much smaller errors (predicted minus actual body mass) than the model which uses only bill_length_g
.
We have deliberately used a model that intuitively makes sense (the more body measurements, the more predictable mass is).
However, in many settings, it is not obvious which variables or which models contribute most to accurate predictions.
Cross validation is one way to get accurate independent predictions with which to compare different models.
Comparing two models to predict body_mass_g
in penguins
The question we will seek to answer is whether the predictions of body_mass_g
are substantially better when bill_length_mm
, bill_depth_mm
, flipper_length_mm
, sex
, and species
are used in the model, as compared with a model on bill_length_mm
only.
We refer to the model given with only bill_lengh_mm
as the smaller model.
It is seen in Table 8.8 with coefficient estimates of the parameters as well as standard errors and pvalues.
We refer to the model given with bill_lengh_mm
, bill_depth_mm
, flipper_length_mm
, sex
, and species
as the larger model.
It is seen in Table 8.9 with coefficient estimates of the parameters as well as standard errors and pvalues.
Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables are significant (small pvalues) predictors of body_mass_g
.
However, in this section, we will go beyond the use of pvalues to consider independent predictions of body_mass_g
as a way to compare the smaller and larger models.
The smaller model:
\[ \begin{align*} E[\texttt{body_mass_g}] &= \ \beta_0 + \beta_1 \times \texttt{bill_length_mm}\\ \widehat{\texttt{body_mass_g}} &= \ 362.31 + 87.42 \times \texttt{bill_length_mm}\\ \end{align*} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  362.3  283.4  1.28  0.202 
bill_length_mm  87.4  6.4  13.65  <0.0001 
The larger model:
\[ \begin{align*} E[\texttt{body_mass_g}] = \beta_0 &+ \beta_1 \times \texttt{bill_length_mm} + \beta_2 \times \texttt{bill_depth_mm} \\ &+ \beta_3 \times \texttt{flipper_length_mm} + \beta_4 \times \texttt{sex}_{male} \\ &+ \beta_5 \times \texttt{species}_{Chinstrap} + \beta_6 \times \texttt{species}_{Gentoo}\\ \widehat{\texttt{body_mass_g}} = 1460.99 &+ 18.20 \times \texttt{bill_length_mm} + 67.22 \times \texttt{bill_depth_mm} \\ &+ 15.95 \times \texttt{flipper_length_mm} + 389.89 \times \texttt{sex}_{male} \\ & 251.48 \times \texttt{species}_{Chinstrap} + 1014.63 \times \texttt{species}_{Gentoo}\\ \end{align*} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  1461.0  571.31  2.56  0.011 
bill_length_mm  18.2  7.11  2.56  0.011 
bill_depth_mm  67.2  19.74  3.40  0.001 
flipper_length_mm  15.9  2.91  5.48  <0.0001 
sexmale  389.9  47.85  8.15  <0.0001 
speciesChinstrap  251.5  81.08  3.10  0.002 
speciesGentoo  1014.6  129.56  7.83  <0.0001 
In order to compare the smaller and larger models in terms of their ability to predict penguin mass, we need to build models that can provide independent predictions based on the penguins in the holdout samples created by cross validation.
To reiterate, each of the predictions that (when combined together) will allow us to distinguish between the smaller and larger are independent of the data which were used to build the model.
In this example, using cross validation, we remove one quarter of the data before running the least squares calculations.
Then the least squares model is used to predict the body_mass_g
of the penguins in the holdout sample.
Here we use a 4fold cross validation (meaning that one quarter of the data is removed each time) to produce four different versions of each model (other times it might be more appropriate to use 2fold or 10fold or even run the model separately after removing each individual data point one at a time).
Figure 8.22 displays how a model is fit to 3/4 of the data (note the slight differences in coefficients as compared to Table 8.8), and then predictions are made on the holdout sample.
By repeating the process for each holdout quarter sample, the residuals from the model can be plotted against the predicted values. We see that the predictions are scattered which shows a good model fit but that the prediction errors vary \(\pm\) 1000g of the true body mass.
The same process is repeated for the larger number of explanatory variables. Note that the coefficients estimated for the first cross validation model (in Figure 8.24) are slightly different from the estimates computed on the entire dataset (seen in Table 8.9). Figure 8.24 displays the cross validation process for the multivariable model with a full set of residual plots given in Figure 8.25. Note that the residuals are mostly within \(\pm\) 500g, providing much more precise predictions for the independent body mass values of the individual penguins.
Figure 8.23 shows that the independent predictions are centered around the true values (i.e., errors are centered around zero), but that the predictions can be as much as 1000g off when using only bill_length_mm
to predict body_mass_g
.
On the other hand, when using bill_length_mm
, bill_depth_mm
, flipper_length_mm
, sex
, and species
to predict body_mass_g
, the prediction errors seem to be about half as big, as seen in Figure 8.25.
We have provided a very brief overview to and example using cross validation. Cross validation is a computational approach to model building and model validation as an alternative to reliance on pvalues. While pvalues have a role to play in understanding model coefficients, throughout this text, we have continued to present computational methods that broaden statistical approaches to data analysis. Cross validation will be used again in Section 8.4 with logistic regression. We encourage you to consider both standard inferential methods (such as pvalues) and computational approaches (such as cross validation) as you build and use multivariable models of all varieties.
8.3.4 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
8.4 Inference for logistic regression
As with multiple linear regression, the inference aspect for logistic regression will focus on interpretation of coefficients and relationships between explanatory variables. Both pvalues and cross validation will be used for assessing a logistic regression model.
Consider the email
data which describes email characteristics which can be used to predict whether a particular incoming email is SPAM (unsolicited bulk email).
Without reading every incoming message, it might be nice to have an automated way to identify SPAM emails.
Which of the variables describing each email are important for predicting the status of the email?
variable  description 

spam  Indicator for whether the email was spam. 
to_multiple  Indicator for whether the email was addressed to more than one recipient. 
from  Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). 
cc  Number of people cc’ed. 
sent_email  Indicator for whether the sender had been sent an email in the last 30 days. 
attach  The number of attached files. 
dollar  The number of times a dollar sign or the word “dollar” appeared in the email. 
winner  Indicates whether “winner” appeared in the email. 
format  Indicates whether the email was written using HTML (e.g. may have included bolding or active links). 
re_subj  Whether the subject started with “Re:” “RE:” “re:” or “rE:” 
exclaim_subj  Whether there was an exclamation point in the subject. 
urgent_subj  Whether the word “urgent” was in the email subject. 
exclaim_mess  The number of exclamation points in the email message. 
number  Factor variable saying whether there was no number, a small number (under 1 million), or a big number. 
8.4.1 Multiple logistic regression output from software
As learned in Chapter 4, optimization can be used to find the coefficient estimates for the logistic model. The unknown population model can be written as:
\[ \begin{align*} \log_e\bigg(\frac{p}{1p}\bigg) = \beta_0 &+ \beta_1 \times \texttt{to_multiple} + \beta_2 \times \texttt{cc} \\ &+ \beta_3 \times \texttt{dollar} + \beta_4 \times \texttt{urgent_subj} \end{align*} \]
The estimated equation for the regression model may be written as a model with four predictor variables (where \(\hat{p}\) is the estimated probability of being a SPAM email message):
\[ \begin{align*} \log_e\bigg(\frac{\hat{p}}{1\hat{p}}\bigg) = 2.05 &+ 1.91 \times \texttt{to_multiple} + 0.02 \times \texttt{cc} \\ & 0.07 \times \texttt{dollar} + 2.66 \times \texttt{urgent_subj} \end{align*} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  2.05  0.06  34.67  <0.0001 
to_multiple1  1.91  0.30  6.37  <0.0001 
cc  0.02  0.02  1.16  0.24502 
dollar  0.07  0.02  3.38  0.00071 
urgent_subj1  2.66  0.80  3.32  0.00091 
Not only does Table 8.10 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.
As in Section 8.3, with multiple predictors, each hypothesis test (for each of the explanatory variables) is conditioned on each of the other variables remaining in the model.
if multiple predictors \(H_0: \beta_i = 0\) given other variables in the model
Using the example above and focusing on each of the variable pvalues (here we won’t discuss the pvalue associated with the intercept), we can write out the four different hypotheses (associated the pvalue corresponding to each of the coefficients / row in Table 8.10):

\(H_0: \beta_1 = 0\) given
cc
,dollar
, andurgent_subj
are included in the model 
\(H_0: \beta_2 = 0\) given
to_multiple
,dollar
, andurgent_subj
are included in the model 
\(H_0: \beta_3 = 0\) given
to_multiple
,cc
, andurgent_subj
are included in the model 
\(H_0: \beta_4 = 0\) given
to_multiple
,dollar
, anddollar
are included in the model
The very low pvalues from the software output tell us that three of the variables (that is, not cc
) act as important predictors in the model, despite the inclusion of any of the other variables.
Consider the pvalue on \(H_0: \beta_1\).
The low pvalue says that it would be extremely unlikely to see data that produce a coefficient on to_multiple
as large as 1.91 if the true relationship between to_multiple
and spam
was nonexistent (i.e., if \(\beta_1 = 0\)) and the model also included cc
and dollar
and urgent_subj
.
Note also that the coefficient on dollar
is quite significant, but the magnitude of the coefficient is small (0.07).
It turns out that in units of standard errors (0.02 here), 0.07 is actually quite far from zero, it’s all about context!
The pvalues on the remaining variables are be interpreted similarly.
From the initial output (pvalues) in Table 8.10 it seems as though to_multiple
, dollar
, and urgent_subj
are important variables for modeling whether an email is SPAM.
We remind you that although pvalues provide some information about the model, there are many aspects to consider when choosing the best model.
As with linear regression (see Section 8.3.2), correlated explanatory variables can impact both the coefficient estimates and the associated pvalues. Investigating multicollinearity in a logistic regression model is saved for a text which provides more detail about logistic regression. Next, as a model building alternative (or enhancement) to pvalues, we revisit cross validation within the context of predicting SPAM status for each of the individual emails.
8.4.2 Cross validation for prediction error
The pvalue is a probability measure under a setting of no relationship.
That pvalue provides information about the degree of the relationship (e.g., above we measure the relationship between spam
and to_multiple
using a pvalue), but the pvalue does not measure how well the model will predict the individual emails (e.g., the accuracy of the model).
Depending on the goal of the research project, you might be inclined to focus on variable importance (through pvalues) or you might be inclined to focus on prediction accuracy (through cross validation).
Here we present a method for using cross validation accuracy to determine which variables (if any) should be used in a model which predicts whether an email is SPAM.
A full treatment of cross validation and logistic regression models is beyond the scope of this text.
Using cross validation, we can build \(k\) different models which are used to predict the observations in each of the \(k\) holdout samples.
The smaller model uses only the to_multiple
variable, see the complete dataset (not cross validated) model output in Table 8.11.
The logistic regression model can be written as (where \(\hat{p}\) is the estimated probability of being a SPAM email message):
\[ \log_e\bigg(\frac{\hat{p}}{1\hat{p}}\bigg) = 2.12 + 1.81\times \mbox{to_multiple} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  2.12  0.06  37.67  <0.0001 
to_multiple1  1.81  0.30  6.09  <0.0001 
For each cross validated model, the coefficients change slightly, and the model is used to make independent predictions on the holdout sample. The model from the first cross validation sample is given in Table 8.26 and can be compared to the coefficients in Table 8.11.
fold  count  accuracy  notspamTP  spamTP 

1st quarter  980  0.26  0.19  0.98 
2nd quarter  981  0.23  0.15  0.96 
3rd quarter  979  0.25  0.18  0.96 
4th quarter  981  0.24  0.17  0.98 
Because the SPAM dataset has a ratio of 90% nonSPAM and 10% SPAM emails, a model which randomly guessed all nonSPAM would have an overall accuracy of 90%! Clearly, we’d like to capture the information with the SPAM emails, so our interest is in the percent of SPAM emails which are identified as SPAM (see Table 8.12). Additionally, in the logistic regression model, we use a 10% cutoff to predict whether or not the email is SPAM. Fortunately, we’ve done a great job of predicting SPAM! However, the tradeoff was that most of the nonSPAM emails are now predicted to be SPAM which is not acceptable for a SPAM prediction algorithm. Adding more variables to the model may help with both the SPAM and notSPAM predictions.
The larger model uses to_multiple
, attach
, winner
, format
, re_subj
, exclaim_mess
, and number
as the set of seven predictor variables, see the complete dataset (not cross validated) model output in Table 8.13.
The logistic regression model can be written as (where \(\hat{p}\) is the estimated probability of being a SPAM email message):
\[ \begin{align*} \log_e\bigg(\frac{\hat{p}}{1\hat{p}}\bigg) = 0.34 & 2.56 \times \texttt{to_multiple} + 0.20 \times \texttt{attach} \\ &+ 1.73 \times \texttt{winner}_{yes}  1.28 \times \texttt{format} \\ & 2.86 \times \texttt{re_subj} + 0 \times \texttt{exclaim_mess} \\ & 1.07 \times \texttt{number}_{small}  0.42 \times \texttt{number}_{big} \end{align*} \]
term  estimate  std.error  statistic  p.value 

(Intercept)  0.34  0.11  3.02  0.00255 
to_multiple1  2.56  0.31  8.28  <0.0001 
attach  0.20  0.06  3.29  0.00099 
winneryes  1.73  0.33  5.33  <0.0001 
format1  1.28  0.13  9.80  <0.0001 
re_subj1  2.86  0.37  7.83  <0.0001 
exclaim_mess  0.00  0.00  0.26  0.79247 
numbersmall  1.07  0.14  7.54  <0.0001 
numberbig  0.42  0.20  2.10  0.03569 
fold  count  accuracy  notspamTP  spamTP 

1st quarter  980  0.77  0.77  0.71 
2nd quarter  981  0.80  0.81  0.70 
3rd quarter  979  0.76  0.77  0.65 
4th quarter  981  0.78  0.79  0.75 
Somewhat expected, the larger model (see Table 8.14) was able to capture more nuance in the emails which lead to better predictions. However, it is not true that adding variables will always lead to better predictions, as correlated or noise variables may dampen the signal from those variables which truly predict the SPAM status. We encourage you to learn more about multiple variable models and cross validation in your future exploration of statistical topics.
8.4.3 Exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
8.5 Chapter review
Throughout the text, we have presented a modern view to introduction to statistics. Early we presented graphical techniques which communicated relationships across multiple variables. We also used modeling to formalize the relationships. Many chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data. Not only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods. In Chapter @ref{inferencereg} we brought many of the ideas together by considering inferential claims on models which include many variables. We continue to emphasize the importance of experimental design in making conclusions about research claims. In particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure 1.11).
As you might guess, this text has only scratched the surface of the world of statistical analyses that can be applied to different datasets. In particular, to do justice to the topic, the linear models and generalized linear models we have introduced can each be covered with their own course or book. Hierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO), and advanced computational methods applied to models (e.g., permuting the response variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this book. However, your successful understanding of the ideas we have covered has set you up perfectly to move on to a higher level of statistical modeling and inference. Enjoy!
8.5.1 Terms
We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.
8.5.2 Chapter exercises
Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.
8.5.3 Interactive R tutorials
Navigate the concepts you’ve learned in this chapter in R using the following selfpaced tutorials. All you need is your browser to get started!
You can also access the full list of tutorials supporting this book here.