8 Inference for regression

We now bring together ideas of inferential analyses from Chapter 5 with the descriptive models seen in Chapters 3 and 4. The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample. The sampling variability will inform the conclusions about the population that can be drawn.

Many of the inferential ideas are remarkably similar to those covered in previous chapters. The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.

We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software. Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims. We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.

8.1 Inference for linear regression

In this chapter, we bring together the inferential ideas (see Chapter 5) used to make claims about a population from information in a sample and the modeling ideas seen in Chapters 3 and 4. In particular, we will use the least squares regression line to test whether or not there is a relationship between two continuous variables. Additionally, we will build confidence intervals which quantify the slope of the linear regression line.

Observed data

We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue. The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample. Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability. Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample. The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.

Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure 8.1. In this made-up world, the CEO actually has all the relevant data, which is why they can plot it here. The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue. There appears to be linear relationship between advertising dollars and revenue (both in$000).

Revenue as a linear model of advertising dollars for a population of sandwich stores, in $000.

Figure 8.1: Revenue as a linear model of advertising dollars for a population of sandwich stores, in $000.

You may remember from Chapter 3 that the population model is: \[y = \beta_0 + \beta_1 x + \varepsilon.\]

Again, the omniscient CEO (with the full population information) can write down the true population model as: \[\mbox{expected revenue} = 11.23 + 4.8 \times \mbox{advertising}.\]

Variability of the statistic

Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether or not set up a new franchise. The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.

As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample. Consider the linear model which describes revenue (in $000) based on advertising dollars (in $0000).

The least squares regression model uses the data to find a sample linear fit: \[\hat{y} = b_0 + b_1 x.\]

A random sample of 20 stores shows a different least square regression line depending on which observations are selected. A subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure 8.1 which described the population) despite having fewer observations on the plot.

A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

Figure 8.2: A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

A second sample of size 20 also shows a positive trend!

A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

Figure 8.3: A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

But the line is slightly different!

The linear models from the two different random samples are quite similar, but they are not the same line.

Figure 8.4: The linear models from the two different random samples are quite similar, but they are not the same line.

That is, there is variability in the regression line from sample to sample. The concept of the sampling variability is something you’ve seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: the slope of the line.

If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

Figure 8.5: If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

You might notice in Figure 8.5 that the \(\hat{y}\) values given by the lines are much more consistent in the middle of the dataset than at the ends. The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud. The effect of the fan-shaped lines is that predicted revenue for advertising close to $4,000 will be much more precise than the revenue predictions made for $1,000 or $7,000 of advertising.

The distribution of slopes (for samples of size \(n=20\)) can be seen in a histogram, as in Figure 8.6.

Variability of slope estimates taken from many different samples of stores, each of size 20.

Figure 8.6: Variability of slope estimates taken from many different samples of stores, each of size 20.

Recall, the example described in this introduction is hypothetical. That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample. The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes. We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models. However, it is nice to visually consider the linear models produced by different slopes. Additionally, as with measuring the variability of previous statistics (e.g., \(\overline{X}_1 - \overline{X}_2\) or \(\hat{p}_1 - \hat{p}_2\)), the histogram of the sample statistics can provide information related to inferential considerations.

In the following sections, the distribution (i.e., histogram) of \(b_1\) (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you. First (in Section 8.1.1), the distribution of \(b_1\) when \(\beta_1 = 0\) is constructed by randomizing (permuting) the response variable. Next (in Section 8.1.2), we can bootstrap the data by taking random samples of size n from the original dataset. And last (in Section 8.1.3), we use mathematical tools to describe the variability using the \(t\)-distribution that was first encountered in Section 7.1.2.

8.1.1 Randomization test for \(H_0: \beta_1= 0\)

Consider the data on Global Crop Yields compiled by Our World in Data and presented as part of the TidyTuesday series seen in Figure 8.7. The scientific research interest at hand will be in determining the linear relationship between wheat yield (for a country-year) and other crop yields. The dataset is quite rich and deserves exploring, but for this example, we will focus only on the annual crop yield in the United States.

Yield (in tonnes per hectare) for six different crops in the US.  The color of the dot indicates the year.

Figure 8.7: Yield (in tonnes per hectare) for six different crops in the US. The color of the dot indicates the year.

As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected. In the linear model setting, we might hope to have a linear relationship between maize and wheat in settings where maize production is known and wheat production needs to be predicted.

The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter. Here the population refers to a larger set of years where maize and wheat are both grown in the US.

  • \(H_0: \beta_1= 0\), there is no linear relationship between wheat and maize.
  • \(H_A: \beta_1 \ne 0\), there is some linear relationship between wheat and maize.

Recall that for the randomization test, we permute one variables to eliminate any existing relationship between the variables. That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but not due to variables being correlated. Figure 8.8 shows the observed data and a scatterplot of one permutation of the wheat variable. The careful observer can see that each of the observed the values for wheat (and for maize) exist in both the original data plot as well as the permuted wheat plot, but the given wheat and maize yields are no longer matched for a given year. That is, each wheat yield is randomly assigned to a new maize yield.

Original (left) and permuted (right) data.  The permutation removes the linear relationship between `wheat` and `maize`.  Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).Original (left) and permuted (right) data.  The permutation removes the linear relationship between `wheat` and `maize`.  Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).

Figure 8.8: Original (left) and permuted (right) data. The permutation removes the linear relationship between wheat and maize. Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).

By repeatedly permuting the response variable any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship). The randomization test compares the slopes calculated from the permuted response variable with the observed slope. If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).

Observed data

We will continue to use the crop data to investigate the linear relationship between wheat and maize. Note that the least squares model (see Chapter 3 ) describing the relationship is given in Table 8.1. The columns in Table 8.1 are further described in Section 8.1.3.

Table 8.1: The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.195.
term estimate std.error statistic p.value
(Intercept) 1.033 0.091 11.3 0
maize 0.195 0.012 16.4 0

Variability of the statistic

After permuting the data, the least squares estimate of the line can be computed. Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between wheat and maize. Figure 8.9 shows two different permutations of wheat and the resulting linear models.

Two different permutations of the wheat variable with slightly different least squares regression lines.Two different permutations of the wheat variable with slightly different least squares regression lines.

Figure 8.9: Two different permutations of the wheat variable with slightly different least squares regression lines.

As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative. Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.

Observed statistic vs. null statistics

Histogram of slopes given different permutations of the wheat variable.  The vertical red line is at the observed value of the slope, 0.195.

Figure 8.10: Histogram of slopes given different permutations of the wheat variable. The vertical red line is at the observed value of the slope, 0.195.

As we can see from Figure 8.10, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the wheat variable. That is, if indeed there were no linear relationship between wheat and maize, the natural variability of the slopes would produce estimates between approximately -0.1 and +0.1. We reject the null hypothesis. Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between wheat and maize crop yield in the US.

8.1.2 Bootstrap confidence interval for \(\beta_1\)

As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test). Because interest is now in creating a CI, there is no null hypothesis, so there won’t be any reason to permute either of the variables.

Observed data

Returning to the crop data, we may want to consider the relationship between peas and wheat. Are peas a good predictor of wheat? And if so, what is their relationship? That is, what is the slope that models average wheat yield as a function of peas?

Original data: wheat yield as a linear model of peas yield, in tonnes per hectare.  Notice that the relationship between `peas` and `wheat` is not as strong as the relationship we saw previously between `maize` and `wheat`.

Figure 8.11: Original data: wheat yield as a linear model of peas yield, in tonnes per hectare. Notice that the relationship between peas and wheat is not as strong as the relationship we saw previously between maize and wheat.

Variability of the statistic

Because we are not focused on a null distribution, we sample with replacement \(n=58\) observations from the original dataset. Recall that with bootstrapping we always resample the same number of observations as we start with in order to mimic the process of taking a sample from the population. When sampling in the linear model case, consider each observation to be a single dot. If the dot is resampled, both the wheat and the peas measurement are observed. The measurements are linked to the dot (i.e., to the year in which the measurements were taken).

Original and one bootstrap sample of the crop data.  Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another.  The orange circle represent points in the original data which were not included in the bootstrap sample.  The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.  The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.Original and one bootstrap sample of the crop data.  Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another.  The orange circle represent points in the original data which were not included in the bootstrap sample.  The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.  The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.

Figure 8.12: Original and one bootstrap sample of the crop data. Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another. The orange circle represent points in the original data which were not included in the bootstrap sample. The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.

Figure 8.12 shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models. The orange circle represent points in the original data which were not included in the bootstrap sample. The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples. By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure 8.13.

Repeated bootstrap resamples of size 58 are taken from the original data.  Each of the bootstrapped linear model is slightly different.

Figure 8.13: Repeated bootstrap resamples of size 58 are taken from the original data. Each of the bootstrapped linear model is slightly different.

Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples. Figure 8.14 is a histogram of the relevant bootstrapped slopes. We can see that a 95% bootstrap percentile interval for the true population slope is given by (0.061, 0.52). We are 95% confident that for the model describing the population of crops of peas and wheat, a one unit increase in peas yield (in tonnes per hectare) will be associated with an increase in predicted average wheat yield of between 0.061 and 0.52 tonnes per hectare.

The original crop data on wheat and peas is bootstrapped 1,000 times. The histogram provides a sense for the variability of the standard deviation of the linear model slope from sample to sample.

Figure 8.14: The original crop data on wheat and peas is bootstrapped 1,000 times. The histogram provides a sense for the variability of the standard deviation of the linear model slope from sample to sample.

8.1.3 Mathematical model

When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter. The approximations will build on the t-distribution which were described in Chapter 7. The mathematical model is often correct and is usually easy to implement computationally. The validity of the technical conditions will be considered in detail in Section 8.2.

In this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line. Just as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.

Midterm elections and unemployment

Observed data

Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S. Presidential election. The set of House elections occurring during the middle of a Presidential term are called midterm elections. In America’s two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President’s party will do in the midterm elections. In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.

To assess the validity of this claim, we can compile historical data and look for a connection. We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression. The House of Representatives is made up of 435 voting members

Figure 8.15 shows these data and the least-squares regression line:

\[ &\text{% change in House seats for President's party} \\ &\qquad\qquad= -7.36 - 0.89 \times \text{(unemployment rate)} \]

We consider the percent change in the number of seats of the President’s party (e.g. percent change in the number of seats for Republicans in 2018) against the unemployment rate.

Examining the data, there are no clear deviations from linearity or substantial outliers (see Section 3.1.3 for a discussion on using residuals to visualize how well a linear model fits the data). While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.

The percent change in House seats for the President's party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

Figure 8.15: The percent change in House seats for the President’s party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively. Do you agree that they should be removed for this investigation? Why or why not?187

There is a negative slope in the line shown in Figure 8.15. However, this slope (and the y-intercept) are only estimates of the parameter values. We might wonder, is this convincing evidence that the “true” linear model has a negative slope? That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election? We can frame this investigation into a statistical hypothesis test:

  • \(H_0\): \(\beta_1 = 0\). The true linear model has slope zero.
  • \(H_A\): \(\beta_1 \neq 0\). The true linear model has a slope different than zero. The unemployment is predictive of whether the President’s party wins or loses seats in the House of Representatives.

We would reject \(H_0\) in favor of \(H_A\) if the data provide strong evidence that the true slope parameter is different than zero. To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.

Regression output from software

Variability of the statistic

Just like other point estimates we have seen before, we can compute a standard error and test statistic for \(b_1\). We will generally label the test statistic using a \(T\), since it follows the \(t\)-distribution.

We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course. Table 8.2 shows software output for the least squares regression line in Figure 8.15. The row labeled unemp includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).

Table 8.2: Output from statistical software for the regression line modeling the midterm election losses for the President’s party as a response to unemployment.
term estimate std.error statistic p.value
(Intercept) -7.36 5.155 -1.43 0.165
unemp -0.89 0.835 -1.07 0.296

What do the first and second columns of Table 8.2 represent?


The entries in the first column represent the least squares estimates, \(b_0\) and \(b_1\), and the values in the second column correspond to the standard errors of each estimate. Using the estimates, we could write the equation for the least square regression line as

\[ \hat{y} = -7.36 - 0.89 x \]

where \(\hat{y}\) in this case represents the predicted change in the number of seats for the president’s party, and \(x\) represents the unemployment rate.

We previously used a \(t\)-test statistic for hypothesis testing in the context of numerical data. Regression is very similar. In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T-score formula:

\[ T \ = \ \frac{\text{estimate} - \text{null value}}{\text{SE}} = \ \frac{-0.89 - 0}{0.835} = \ -1.07 \]

This corresponds to the third column of Table 8.2 .

Use Table 8.2 to determine the p-value for th hypothesis test


The last column of the table gives the p-value for th two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President’s party in the House of Representatives in midterm elections.

Observed statistic vs. null statistics

As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether or not the data could have come from a population where the true slope was zero (i.e., \(\beta_1 = 0\)). Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition. Based on everything we’ve seen in the examples above describing the variability of a line from sample to sample, as yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.

Examine Figure 5.20, which relates the Elmhurst College aid and student family income. How sure are you that the slope is statistically significantly different from zero? That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?


While the relationship between the variables is not perfect, there is an evident decreasing trend in the data. This suggests the hypothesis test will reject the null claim that the slope is zero.

The point of the tools in this section are to go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about the statistical significance of the slope estimate.

Table 8.3: Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.
term estimate std.error statistic p.value
(Intercept) 24319.329 1291.450 18.83 0
family_income -0.043 0.011 -3.98 0

Table 8.3 shows statistical software output from fitting the least squares regression line shown in Figure 5.20. Use the output to formally evaluate the following hypotheses.188

  • \(H_0\): The true coefficient for family income is zero.
  • \(H_A\): The true coefficient for family income is not zero.

Inference for regression We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice. However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met. See Section 8.2.

Confidence interval for a coefficient

Observed data

Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.

Compute the 95% confidence interval for the coefficient using the regression output from Table 8.3.


The point estimate is -0.0431 and the standard error is \(SE = 0.0108\). When constructing a confidence interval for a model coefficient, we generally use a \(t\)-distribution. The degrees of freedom for the distribution are noted in the regression output, \(df = 48\), allowing us to identify \(t_{48}^{\star} = 2.01\) for use in the confidence interval.

We can now construct the confidence interval in the usual way:

\[ \begin{align*} \text{point estimate} &\pm t_{48}^{\star} \times SE \\ -0.0431 &\pm 2.01 \times 0.0108 \\ (-0.0648 &, -0.0214) \end{align*} \]

We are 95% confident that with each dollar increase in , the university’s gift aid is predicted to decrease on average by $0.0214 to $0.0648.

Variability of the statistic

Confidence intervals for coefficients

Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the \(t\)-distribution:

\[ b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} \]

where \(t_{df}^{\star}\) is the appropriate \(t\)-value corresponding to the confidence level with the model’s degrees of freedom.

On the topic of intervals in this book, we’ve focused exclusively on confidence intervals for model parameters. However, there are other types of intervals that may be of interest, including prediction intervals for a response value and also confidence intervals for a mean response value in the context of regression.

8.1.4 Exercises

Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.

8.2 Checking model conditions

In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions. In this section, we’ll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures. .

What are the technical conditions for the mathematical model?

When fitting a least squares line, we generally require

  • Linearity. The data should show a linear trend. If there is a nonlinear trend (e.g. first panel of Figure 8.16) an advanced regression method from another book or later course should be applied.

  • Independent observations. Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. An example of a data set where successive observations are not independent is shown in the fourth panel of Figure 8.16. There are also other instances where correlations within the data are important, which is further discussed in Chapter 4.

  • Nearly normal residuals. Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we’ll talk about more in Section 3.3. An example of a residual that would be a potentially concern is shown in the second panel of Figure 8.16, where one observation is clearly much further from the regression line than the others.

  • Constant or equal variability. The variability of points around the least squares line remains roughly constant. An example of non-constant variability is shown in the third panel of Figure 8.16, which represents the most common pattern observed when this condition fails: the variability of \(y\) is larger when \(x\) is larger.

Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the $x$ and $y$ relationship.  The bottom set of graphs is a residual plot.  First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of $x$. Fourth panel: a time series data set is shown, where successive observations are highly correlated.

Figure 8.16: Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the \(x\) and \(y\) relationship. The bottom set of graphs is a residual plot. First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of \(x\). Fourth panel: a time series data set is shown, where successive observations are highly correlated.

Should we have concerns about applying least squares regression to the Elmhurst data in Figure 3.14?189

The technical conditions are often remembered using the LINE mnemonic. The linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure 8.16. A careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.

  • L: linear model
  • I: independent observations
  • N: points are normally distributed around the line
  • E: equal variability around the line for all values of the explanatory variable

Why do we need technical conditions?

As with other inferential techniques we have covered in this text, if the technical conditions above don’t hold, then it is not possible to make concluding claims about the population. That is, without the technical conditions, the T-score (or Z-score) will not have the assumed t-distribution (or standard normal Z distribution). That said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions. If there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.

The model based on Linearity

The linearity condition is among the most important if your goal is to understand a linear model between \(x\) and \(y\). For example, the value of the slope will not be at all meaningful if the true relationship between \(x\) and \(y\) is quadratic, as in Figure 3.3. Not only should we be cautious about the inference, but the model itself is also not an accurate portrayal of the relationship between the variables.

In Section 8.3 we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones. However, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.

The importance of Independence

The technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose. It is also extremely difficult to gather a dataset which is a true random sample from the population of interest. (Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)

Dependent observations can bias results in ways that produce fundamentally flawed analyses. That is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university. At best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).

In lieu of trying to answer whether or not your observations are a true random sample, you might instead focus on whether or not you believe your observations are representative of the populations. Humans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.

Some thoughts on Normality

The normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion. You should consider the “bell” of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.

The normality condition is less important than linearity or independence for a few reasons. First, the linear model fit with least squares will still be an unbiased estimate of the true population model. However, the standard errors associated with variability of the line will not be well estimated. Fortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough. One analysis method that does require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given \(x\) value, using the linear model. One additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.

Equal variability for prediction in particular

As with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of \(x\)) will not cause problems for the estimate of the linear model. That said, the inference on the model (e.g., computing p-values) will be incorrect if the variability around the line is heterogeneous. Data that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).

The inference results for both a randomization test or a bootstrap confidence interval are robust to the equal variability condition, so they give the analyst methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line). Although randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation). When the equal variability condition is violated and a mathematical analysis (e.g., p-value from T-score) is needed, there are other existing methods (outside the scope of this text) which can easily handle the unequal variance (e.g., weighted least squares analysis).

What if all the technical conditions are met?

When the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software. In addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests). In particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions. When the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand. We will discuss some of the extended modeling and associated inference in Section 8.3 and Section 8.4. Many of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one. If you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.

8.2.1 Exercises

Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.

8.3 Inference for multiple regression

In Chapter 4, the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable. Here, we discuss whether each of the variables individually is a significant predictor or whether the model might be just as strong without that variable. That is, as before, we apply inferentially methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero. If one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero. The inference task at hand is to decide whether the coefficient’s difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero. Both the derivations from the mathematical model and the randomization model are beyond the scope of this book, but we are able to calculate p-values using statistical software. We will discuss interpreting p-values in the multiple regression setting and note some scenarios where careful understanding of the context and the relationship between variables is important.

8.3.1 Multiple regression output from software

Recall the loans data from Chapter 4.

The data can be found in the openintro package: loans_full_schema. Based on the data in this dataset we have created two new variables: credit_util which is calculated as the total credit utilized divided by the total credit limit and bankruptcy which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies). We will refer to this modified dataset as loans.

Now, our goal is to create a model where interest_rate can be predicted using the variables debt_to_income, term, and credit_checks.
As learned in Chapter 4, least squares can be used to find the coefficient estimates for the linear model. The unknown population model can be written as: \[E[\mbox{interest_rate}] = \beta_0 + \beta_1\times \mbox{debt_to_income} + \beta_2 \times \mbox{term} + \beta_3 \times \mbox{credit_checks}\]

Table 8.4: Summary of a linear model for predicting interest rate based on the variables debt_to_income, term, and credit_checks. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) 4.309 0.195 22.1 <0.0001
debt_to_income 0.041 0.003 13.3 <0.0001
term 0.158 0.004 37.9 <0.0001
credit_checks 0.247 0.019 12.8 <0.0001

The estimated equation for the regression model may be written as a model with three predictor variables:

\[ \widehat{\texttt{interest_rate}} = 4.31 + 0.041 \times \texttt{debt_to_income} + 0.16 \times \texttt{term} + 0.25 \times \texttt{credit_checks} \]

Not only does Table 8.4 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.

In Section 8.1, we learned that the hypothesis test for a linear model with one predictor can be written as:

if only one predictor, \(H_0: \beta_1 = 0.\)

That is, if the true population slope is zero, the p-value measures how likely it would be to select data which produced the observed slope (\(b_1\)) value.

With multiple predictors, the hypothesis is similar, however, it is now conditioned on each of the other variables remaining in the model.

if multiple predictors, \(H_0: \beta_i = 0\) given other variables in the model

Using the example above and focusing on each of the variable p-values (here we won’t discuss the p-value associated with the intercept), we can write out the three different hypotheses:

  • \(H_0: \beta_1 = 0\), given term and credit_checks are included in the model
  • \(H_0: \beta_2 = 0\), given debt_to_income and credit_checks are included in the model
  • \(H_0: \beta_3 = 0\), given debt_to_income and term are included in the model

The very low p-values from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of the other two. Consider the p-value on \(H_0: \beta_1\). The low p-value says that it would be extremely unlikely to see data that produce a coefficient on debt_to_income as large as 0.041 if the true relationship between debt_to_incomeand interest_rate was non-existent (i.e., if \(\beta_1 = 0\)) and the model also included term and credit_checks. You might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the problem, 0.041 turns out to be far away from zero, it’s all about context! The p-values on term and on credit_checks are interpreted similarly.

Sometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor variables themselves being correlated.

8.3.2 Multicollinearity

In practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model. For regression models, it is important to understand the entire context of the model, particularly for correlated variables. Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the significance (i.e., the p-value) of each coefficient.

Consider an example where we’d like to predict how much money is in a coin dish based only on the number of coins in the dish. We ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.190 The number of low coins is the number of coins minus the number of quarters (a quarter is the largest commonly used US coin, at US$0.25). Figure 8.17 illustrates a sample of U.S. coins, their total worth (amount), the number of total coins, and the number of low coins.

A sample of coins with 16 total coins, 10 low coins, and a net worth of $1.90.

Figure 8.17: A sample of coins with 16 total coins, 10 low coins, and a net worth of $1.90.

The collected data is given in Figure 8.18 and shows that the total amount of money is more highly correlated with the total number of coins than it is with the number of low coins. We also note that the number of high coins and the number of low coins are positively correlated.

Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins.  As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins.  As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.

Figure 8.18: Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins. As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.

Using the total number of coins as the predictor variable, Table 8.5 provides the least squares estimate of the coefficient is 0.13. For every additional coin in the dish, we would predict that the student had US$0.13 more. The \(b_1 = 0.13\) coefficient is highly significant, suggesting we would not have seen data like this if number of coins and amount of money were not linearly related.

\[ \widehat{\mbox{amount}} = 0.55 + 0.13 \times \mbox{number of coins} \]

Table 8.5: Linear model output predicting the total amount of money based on the total number of coins.
term estimate std.error statistic p.value
(Intercept) 0.55 0.44 1.23 0.2301
number.of.coins 0.13 0.02 5.54 <0.0001

Using the number of low coins as the predictor variable, Table 8.6 provides the least squares estimate of the coefficient is 0.02. For every additional coin in the dish, we would predict that the student had US$0.02 more. The \(b_1 = 0.02\) coefficient is not at all significant, suggesting we could easily have seen data like ours even if the number of low coins and amount of money are not at all linearly related.

\[ \widehat{\mbox{amount}} = 2.28 + 0.02 \times \mbox{number of low coins} \]

Table 8.6: Linear model output predicting the total amount of money based on the number of low coins.
term estimate std.error statistic p.value
(Intercept) 2.28 0.58 3.9 0.0
number.of.low.coins 0.02 0.05 0.4 0.7

Come up with an example of two observations that have the same number of low coins but the number of total coins differs by one. What is the difference in amount?


Two samples of coins with the same number of low coins (3), but a different number of total coins (4 vs 5) and a different total amount ($0.41 vs $0.66).

Figure 8.19: Two samples of coins with the same number of low coins (3), but a different number of total coins (4 vs 5) and a different total amount ($0.41 vs $0.66).

Come up with an example of two observations that have the same total number of coins but a different number of low coins. What is the difference in amount?


Two samples of coins with the same total number of coins (4), but a different number of low coins (3 vs 4) and a different total amount ($0.41 vs $0.17).

Figure 8.20: Two samples of coins with the same total number of coins (4), but a different number of low coins (3 vs 4) and a different total amount ($0.41 vs $0.17).

Using both the total number of coins and the number of low coins as predictor variables, Table 8.7 provides the least squares estimates of both coefficients as 0.21 and -0.16. Now, with two variables in the model, the interpretation is more nuanced.

  • The coefficient indicates a change in one variable while keeping the other variable constant.
    For every additional coin in the dish while the number of low coins stays constant, we would predict that the student had US$0.21 more. Re-considering the phrase “every additional coin in the dish while the number of low coins stays constant” makes us realize that each increase is a single additional quarter (larger samples sizes would have led to a \(b_1\) coefficient closer to 0.25 because of the deterministic relationship described here).
  • For every additional low coin in the dish while the number of total coins stays constant, we would predict that the student had US$0.16 less. Re-considering the phrase “every additional low coin in the dish while the number of total coins stays constant” makes us realize that a quarter is being swapped out for a penny, nickel, or dime.

Considering the coefficients across Tables 8.5, 8.6, and 8.7 within the context and knowledge we have of US coins allows us to understand the correlation between variables and why the signs of the coefficients would change depending on the model. Note also, however, that the significance on the low coin coefficient changed from Table 8.6 to Table 8.7. It makes sense that the variable describing the number of low coins provides more information about the amount of money when it is part of a model which also includes the total number of coins than it does when it is used as a single variable in a simple linear regression model.

\[ \widehat{\text{amount}} = 0.80 + 0.21 \times \text{number of coins} - 0.16 \times \text{number of low coins} \]

Table 8.7: Linear model output predicting the total amount of money based on both the total number of coins and the number of low coins.
term estimate std.error statistic p.value
(Intercept) 0.80 0.30 2.65 0.0142
number.of.coins 0.21 0.02 9.89 <0.0001
number.of.low.coins -0.16 0.03 -5.51 <0.0001

When working with multiple regression models, understanding the significance and sign of the coefficient is not always as straightforward as it was with the coin example. However, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.

Although diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity. If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or its significance. However, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable. So if the task at hand is only to do prediction, mutlicollinearity is likely to not cause you substantial problems.

8.3.3 Cross validation for prediction error

In Section 8.3.1, p-values were calculated on each of the model coefficients. The p-value gives a sense of which variables are important to the model; however, a more extensive treatment of variable selection is warranted in a follow-up course or textbook. Here, we use cross validation prediction error to focus on which variable(s) are important for predicting the response variable of interest. In general, linear models are also used to make predictions of individual observations. In addition to model building, cross validation provides a method for generating predictions that are not overfit to the particular dataset at hand. We continue to encourage you to take up further study on the topic of cross validation, as it is among the most important ideas in modern data analysis, and we are only able to scratch the surface here.

Cross validation is a computational technique which removes some observations before a model is run, then assesses the model accuracy on the held-out sample. By removing some observations, we provide ourselves with an independent evaluation of the model (that is, the removed observations do not contribute to finding the parameters which minimize the least squares equation). Cross validation can be used in many different ways (as an independent assessment), and here we will just scratch the surface with respect to one way the technique can be used to compare models. See Figure 8.21 for a visual representation of the cross validation process.

The dataset is broken into k folds (here k=4).  One at a time, a model is built using k-1 of the folds, and predictions are calculated on the single held out sample which will be completely independent of the model estimation.

Figure 8.21: The dataset is broken into k folds (here k=4). One at a time, a model is built using k-1 of the folds, and predictions are calculated on the single held out sample which will be completely independent of the model estimation.

The data can be found in the palmerpenguings package: penguins. The observations of three different penguin species include measurements on body size and sex. The data were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER as part of the Long Term Ecological Research Network.

Our goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams. Although not exactly aligned with this research project, you might be able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial photographs) but the mass is not known. The first model will predict body_mass_g by using only the bill_length_mm, a variable denoting the length of a penguin’s bill, in mm. The second model will predict body_mass_g by using bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species.

The presentation below (see the comparison of Figures 8.23 and 8.25) shows that the model with more variables predicts body_mass_g with much smaller errors (predicted minus actual body mass) than the model which uses only bill_length_g. We have deliberately used a model that intuitively makes sense (the more body measurements, the more predictable mass is). However, in many settings, it is not obvious which variables or which models contribute most to accurate predictions. Cross validation is one way to get accurate independent predictions with which to compare different models.

Comparing two models to predict body_mass_g in penguins

The question we will seek to answer is whether the predictions of body_mass_g are substantially better when bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species are used in the model, as compared with a model on bill_length_mm only.

We refer to the model given with only bill_lengh_mm as the smaller model. It is seen in Table 8.8 with coefficient estimates of the parameters as well as standard errors and p-values. We refer to the model given with bill_lengh_mm, bill_depth_mm, flipper_length_mm, sex, and species as the larger model. It is seen in Table 8.9 with coefficient estimates of the parameters as well as standard errors and p-values. Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables are significant (small p-values) predictors of body_mass_g. However, in this section, we will go beyond the use of p-values to consider independent predictions of body_mass_g as a way to compare the smaller and larger models.

The smaller model:

\[ \begin{align*} E[\texttt{body_mass_g}] &= \ \beta_0 + \beta_1 \times \texttt{bill_length_mm}\\ \widehat{\texttt{body_mass_g}} &= \ 362.31 + 87.42 \times \texttt{bill_length_mm}\\ \end{align*} \]

Table 8.8: The smalller model: least squares estimates of the regression model predicting body_mass_g from bill_length_mm.
term estimate std.error statistic p.value
(Intercept) 362.3 283.4 1.28 0.202
bill_length_mm 87.4 6.4 13.65 <0.0001
The larger model:

\[ \begin{align*} E[\texttt{body_mass_g}] = \beta_0 &+ \beta_1 \times \texttt{bill_length_mm} + \beta_2 \times \texttt{bill_depth_mm} \\ &+ \beta_3 \times \texttt{flipper_length_mm} + \beta_4 \times \texttt{sex}_{male} \\ &+ \beta_5 \times \texttt{species}_{Chinstrap} + \beta_6 \times \texttt{species}_{Gentoo}\\ \widehat{\texttt{body_mass_g}} = -1460.99 &+ 18.20 \times \texttt{bill_length_mm} + 67.22 \times \texttt{bill_depth_mm} \\ &+ 15.95 \times \texttt{flipper_length_mm} + 389.89 \times \texttt{sex}_{male} \\ &- 251.48 \times \texttt{species}_{Chinstrap} + 1014.63 \times \texttt{species}_{Gentoo}\\ \end{align*} \]

Table 8.9: The larger model: least squares estimates of the regression model predicting body_mass_g from bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species.
term estimate std.error statistic p.value
(Intercept) -1461.0 571.31 -2.56 0.011
bill_length_mm 18.2 7.11 2.56 0.011
bill_depth_mm 67.2 19.74 3.40 0.001
flipper_length_mm 15.9 2.91 5.48 <0.0001
sexmale 389.9 47.85 8.15 <0.0001
speciesChinstrap -251.5 81.08 -3.10 0.002
speciesGentoo 1014.6 129.56 7.83 <0.0001

In order to compare the smaller and larger models in terms of their ability to predict penguin mass, we need to build models that can provide independent predictions based on the penguins in the holdout samples created by cross validation. To reiterate, each of the predictions that (when combined together) will allow us to distinguish between the smaller and larger are independent of the data which were used to build the model. In this example, using cross validation, we remove one quarter of the data before running the least squares calculations. Then the least squares model is used to predict the body_mass_g of the penguins in the holdout sample. Here we use a 4-fold cross validation (meaning that one quarter of the data is removed each time) to produce four different versions of each model (other times it might be more appropriate to use 2-fold or 10-fold or even run the model separately after removing each individual data point one at a time).

Figure 8.22 displays how a model is fit to 3/4 of the data (note the slight differences in coefficients as compared to Table 8.8), and then predictions are made on the holdout sample.

The coefficients are estimated using the least squares model on 3/4 of the dataset with only a single predictor variable. Predictions are made on the remaining 1/4 of the observations.  The y-axis in the scatterplot represents the residual: true observed value minus the predicted value.  Note that the predictions are independent of the estimated model coefficients.

Figure 8.22: The coefficients are estimated using the least squares model on 3/4 of the dataset with only a single predictor variable. Predictions are made on the remaining 1/4 of the observations. The y-axis in the scatterplot represents the residual: true observed value minus the predicted value. Note that the predictions are independent of the estimated model coefficients.

By repeating the process for each holdout quarter sample, the residuals from the model can be plotted against the predicted values. We see that the predictions are scattered which shows a good model fit but that the prediction errors vary \(\pm\) 1000g of the true body mass.

One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted.  The least squares regression model was fit independently of the removed penguins.  The predictions of body mass are based on bill length only.  The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.23: One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

The same process is repeated for the larger number of explanatory variables. Note that the coefficients estimated for the first cross validation model (in Figure 8.24) are slightly different from the estimates computed on the entire dataset (seen in Table 8.9). Figure 8.24 displays the cross validation process for the multivariable model with a full set of residual plots given in Figure 8.25. Note that the residuals are mostly within \(\pm\) 500g, providing much more precise predictions for the independent body mass values of the individual penguins.

The coefficients are estimated using the least squares model on 3/4 of the dataset with the five specified predictor variables. Predictions are made on the remaining 1/4 of the observations.  The y-axis in the scatterplot represents the residual: true observed value minus the predicted value.  Note that the predictions are independent of the estimated model coefficients.

Figure 8.24: The coefficients are estimated using the least squares model on 3/4 of the dataset with the five specified predictor variables. Predictions are made on the remaining 1/4 of the observations. The y-axis in the scatterplot represents the residual: true observed value minus the predicted value. Note that the predictions are independent of the estimated model coefficients.

One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted.  The least squares regression model was fit independently of the removed penguins.  The predictions of body mass are based on bill length only.  The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.25: One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.23 shows that the independent predictions are centered around the true values (i.e., errors are centered around zero), but that the predictions can be as much as 1000g off when using only bill_length_mm to predict body_mass_g. On the other hand, when using bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species to predict body_mass_g, the prediction errors seem to be about half as big, as seen in Figure 8.25.

We have provided a very brief overview to and example using cross validation. Cross validation is a computational approach to model building and model validation as an alternative to reliance on p-values. While p-values have a role to play in understanding model coefficients, throughout this text, we have continued to present computational methods that broaden statistical approaches to data analysis. Cross validation will be used again in Section 8.4 with logistic regression. We encourage you to consider both standard inferential methods (such as p-values) and computational approaches (such as cross validation) as you build and use multivariable models of all varieties.

8.3.4 Exercises

Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.

8.4 Inference for logistic regression

As with multiple linear regression, the inference aspect for logistic regression will focus on interpretation of coefficients and relationships between explanatory variables. Both p-values and cross validation will be used for assessing a logistic regression model.

Consider the email data which describes email characteristics which can be used to predict whether a particular incoming email is SPAM (unsolicited bulk email). Without reading every incoming message, it might be nice to have an automated way to identify SPAM emails. Which of the variables describing each email are important for predicting the status of the email?

The data from this study can be found in the openintro package: email.

Table 4.9: Descriptions for the spam variable along with 13 other variables in the email data set. Many of the variables are indicator variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
variable description
spam Indicator for whether the email was spam.
to_multiple Indicator for whether the email was addressed to more than one recipient.
from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc Number of people cc’ed. 
sent_email Indicator for whether the sender had been sent an email in the last 30 days.
attach The number of attached files.
dollar The number of times a dollar sign or the word “dollar” appeared in the email.
winner Indicates whether “winner” appeared in the email.
format Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
re_subj Whether the subject started with “Re:” “RE:” “re:” or “rE:”
exclaim_subj Whether there was an exclamation point in the subject.
urgent_subj Whether the word “urgent” was in the email subject.
exclaim_mess The number of exclamation points in the email message.
number Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

8.4.1 Multiple logistic regression output from software

As learned in Chapter 4, optimization can be used to find the coefficient estimates for the logistic model. The unknown population model can be written as:

\[ \begin{align*} \log_e\bigg(\frac{p}{1-p}\bigg) = \beta_0 &+ \beta_1 \times \texttt{to_multiple} + \beta_2 \times \texttt{cc} \\ &+ \beta_3 \times \texttt{dollar} + \beta_4 \times \texttt{urgent_subj} \end{align*} \]

The estimated equation for the regression model may be written as a model with four predictor variables (where \(\hat{p}\) is the estimated probability of being a SPAM email message):

\[ \begin{align*} \log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -2.05 &+ -1.91 \times \texttt{to_multiple} + 0.02 \times \texttt{cc} \\ &- 0.07 \times \texttt{dollar} + 2.66 \times \texttt{urgent_subj} \end{align*} \]

Table 8.10: Summary of a logistic model for predicting whether an email is SPAM based on the variables to_multiple, cc, dollar, and urgent_subj. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) -2.05 0.06 -34.67 <0.0001
to_multiple1 -1.91 0.30 -6.37 <0.0001
cc 0.02 0.02 1.16 0.24502
dollar -0.07 0.02 -3.38 0.00071
urgent_subj1 2.66 0.80 3.32 0.00091

Not only does Table 8.10 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.

As in Section 8.3, with multiple predictors, each hypothesis test (for each of the explanatory variables) is conditioned on each of the other variables remaining in the model.

if multiple predictors \(H_0: \beta_i = 0\) given other variables in the model

Using the example above and focusing on each of the variable p-values (here we won’t discuss the p-value associated with the intercept), we can write out the four different hypotheses (associated the p-value corresponding to each of the coefficients / row in Table 8.10):

  • \(H_0: \beta_1 = 0\) given cc, dollar, and urgent_subj are included in the model
  • \(H_0: \beta_2 = 0\) given to_multiple, dollar, and urgent_subj are included in the model
  • \(H_0: \beta_3 = 0\) given to_multiple, cc, and urgent_subj are included in the model
  • \(H_0: \beta_4 = 0\) given to_multiple, dollar, and dollar are included in the model

The very low p-values from the software output tell us that three of the variables (that is, not cc) act as important predictors in the model, despite the inclusion of any of the other variables. Consider the p-value on \(H_0: \beta_1\). The low p-value says that it would be extremely unlikely to see data that produce a coefficient on to_multiple as large as -1.91 if the true relationship between to_multiple and spam was non-existent (i.e., if \(\beta_1 = 0\)) and the model also included cc and dollar and urgent_subj. Note also that the coefficient on dollar is quite significant, but the magnitude of the coefficient is small (0.07). It turns out that in units of standard errors (0.02 here), 0.07 is actually quite far from zero, it’s all about context! The p-values on the remaining variables are be interpreted similarly. From the initial output (p-values) in Table 8.10 it seems as though to_multiple, dollar, and urgent_subj are important variables for modeling whether an email is SPAM. We remind you that although p-values provide some information about the model, there are many aspects to consider when choosing the best model.

As with linear regression (see Section 8.3.2), correlated explanatory variables can impact both the coefficient estimates and the associated p-values. Investigating multicollinearity in a logistic regression model is saved for a text which provides more detail about logistic regression. Next, as a model building alternative (or enhancement) to p-values, we revisit cross validation within the context of predicting SPAM status for each of the individual emails.

8.4.2 Cross validation for prediction error

The p-value is a probability measure under a setting of no relationship. That p-value provides information about the degree of the relationship (e.g., above we measure the relationship between spam and to_multiple using a p-value), but the p-value does not measure how well the model will predict the individual emails (e.g., the accuracy of the model). Depending on the goal of the research project, you might be inclined to focus on variable importance (through p-values) or you might be inclined to focus on prediction accuracy (through cross validation).

Here we present a method for using cross validation accuracy to determine which variables (if any) should be used in a model which predicts whether an email is SPAM. A full treatment of cross validation and logistic regression models is beyond the scope of this text. Using cross validation, we can build \(k\) different models which are used to predict the observations in each of the \(k\) holdout samples. The smaller model uses only the to_multiple variable, see the complete dataset (not cross validated) model output in Table 8.11. The logistic regression model can be written as (where \(\hat{p}\) is the estimated probability of being a SPAM email message):

\[ \log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -2.12 + -1.81\times \mbox{to_multiple} \]

Table 8.11: Summary of a logistic model for predicting whether an email is SPAM based on only the predictor variable to_multiple. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) -2.12 0.06 -37.67 <0.0001
to_multiple1 -1.81 0.30 -6.09 <0.0001

For each cross validated model, the coefficients change slightly, and the model is used to make independent predictions on the holdout sample. The model from the first cross validation sample is given in Table 8.26 and can be compared to the coefficients in Table 8.11.

The coefficients are estimated using the least squares model on 3/4 of the dataset with a single predictor variable. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients, and the prediction error rate is quite high.

Figure 8.26: The coefficients are estimated using the least squares model on 3/4 of the dataset with a single predictor variable. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients, and the prediction error rate is quite high.

Table 8.12: One quarter at a time, the data were removed from the model building, and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was fit independently of the removed emails. Only to_multiple is used to predict whether the email is spam. Because we used a cutoff designed to identify SPAM emails, the accuracy of the non-SPAM email predictions is very low.
fold count accuracy notspamTP spamTP
1st quarter 980 0.26 0.19 0.98
2nd quarter 981 0.23 0.15 0.96
3rd quarter 979 0.25 0.18 0.96
4th quarter 981 0.24 0.17 0.98

Because the SPAM dataset has a ratio of 90% non-SPAM and 10% SPAM emails, a model which randomly guessed all non-SPAM would have an overall accuracy of 90%! Clearly, we’d like to capture the information with the SPAM emails, so our interest is in the percent of SPAM emails which are identified as SPAM (see Table 8.12). Additionally, in the logistic regression model, we use a 10% cutoff to predict whether or not the email is SPAM. Fortunately, we’ve done a great job of predicting SPAM! However, the trade-off was that most of the non-SPAM emails are now predicted to be SPAM which is not acceptable for a SPAM prediction algorithm. Adding more variables to the model may help with both the SPAM and not-SPAM predictions.

The larger model uses to_multiple, attach, winner, format, re_subj, exclaim_mess, and number as the set of seven predictor variables, see the complete dataset (not cross validated) model output in Table 8.13. The logistic regression model can be written as (where \(\hat{p}\) is the estimated probability of being a SPAM email message):

\[ \begin{align*} \log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -0.34 &- 2.56 \times \texttt{to_multiple} + 0.20 \times \texttt{attach} \\ &+ 1.73 \times \texttt{winner}_{yes} - 1.28 \times \texttt{format} \\ &- 2.86 \times \texttt{re_subj} + 0 \times \texttt{exclaim_mess} \\ &- 1.07 \times \texttt{number}_{small} - 0.42 \times \texttt{number}_{big} \end{align*} \]

Table 8.13: Summary of a logistic model for predicting whether an email is SPAM based on only the predictor variable to_multiple. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) -0.34 0.11 -3.02 0.00255
to_multiple1 -2.56 0.31 -8.28 <0.0001
attach 0.20 0.06 3.29 0.00099
winneryes 1.73 0.33 5.33 <0.0001
format1 -1.28 0.13 -9.80 <0.0001
re_subj1 -2.86 0.37 -7.83 <0.0001
exclaim_mess 0.00 0.00 0.26 0.79247
numbersmall -1.07 0.14 -7.54 <0.0001
numberbig -0.42 0.20 -2.10 0.03569
The coefficients are estimated using the least squares model on 3/4 of the dataset with the seven specified predictor variables. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients.  The predictions are now much better for both the SPAM and the non-SPAM emails (than they were with a single predictor variable).

Figure 8.27: The coefficients are estimated using the least squares model on 3/4 of the dataset with the seven specified predictor variables. Predictions are made on the remaining 1/4 of the observations. Note that the predictions are independent of the estimated model coefficients. The predictions are now much better for both the SPAM and the non-SPAM emails (than they were with a single predictor variable).

Table 8.14: One quarter at a time, the data were removed from the model building, and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was fit independently of the removed emails. Now, the variables to_multiple, attach, winner, format, re_subj, exclaim_mess, and number are used to predict whether the email is spam.
fold count accuracy notspamTP spamTP
1st quarter 980 0.77 0.77 0.71
2nd quarter 981 0.80 0.81 0.70
3rd quarter 979 0.76 0.77 0.65
4th quarter 981 0.78 0.79 0.75

Somewhat expected, the larger model (see Table 8.14) was able to capture more nuance in the emails which lead to better predictions. However, it is not true that adding variables will always lead to better predictions, as correlated or noise variables may dampen the signal from those variables which truly predict the SPAM status. We encourage you to learn more about multiple variable models and cross validation in your future exploration of statistical topics.

8.4.3 Exercises

Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.

8.5 Chapter review

Throughout the text, we have presented a modern view to introduction to statistics. Early we presented graphical techniques which communicated relationships across multiple variables. We also used modeling to formalize the relationships. Many chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data. Not only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods. In Chapter @ref{inference-reg} we brought many of the ideas together by considering inferential claims on models which include many variables. We continue to emphasize the importance of experimental design in making conclusions about research claims. In particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see Figure 1.11).

As you might guess, this text has only scratched the surface of the world of statistical analyses that can be applied to different datasets. In particular, to do justice to the topic, the linear models and generalized linear models we have introduced can each be covered with their own course or book. Hierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO), and advanced computational methods applied to models (e.g., permuting the response variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this book. However, your successful understanding of the ideas we have covered has set you up perfectly to move on to a higher level of statistical modeling and inference. Enjoy!

8.5.1 Terms

We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.

8.5.2 Chapter exercises

Exercises for this section will be available in the 1st edition of this book, which will be available in Summer 2021. In the meantime, OpenIntro::Introduction to Statistics with Randomization and Simulation and OpenIntro::Statistics, both of which are available for free, have many exercises you can use alongside this book.

8.5.3 Interactive R tutorials

Navigate the concepts you’ve learned in this chapter in R using the following self-paced tutorials. All you need is your browser to get started!

You can also access the full list of tutorials supporting this book here.

8.5.4 R labs

Further apply the concepts you’ve learned in this chapter in R with computational labs that walk you through a data analysis case study.