Chapter 8 Inference for regression

This chapter is currently under construction, however the content to be presented in this chapter is covered in the R tutorials. We encourage you to review the content there in the meantime. Click here to take a look!

We now bring together ideas of inferential analyses from Chapter 5 with the descriptive models seen in Chapters 3 and 4. The setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample. The sampling variability will inform the conclusions about the population that can be drawn.

Many of the inferential ideas are remarkably similar to those covered in previous chapters. The technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.

We encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software. Inference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims. We provide some guidance and hope the reader will further their statistical learning after working through the material in this text.

8.1 Inference for linear regression

In this chapter, we bring together the inferential ideas (see Chapter 5) used to make claims about a population from information in a sample and the modeling ideas seen in Chapters 3 and 4. In particular, we will use the least squares regression line to test whether or not there is a relationship between two continuous variables. Additionally, we will build confidence intervals which quantify the slope of the linear regression line.

Observed data

We start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue. The hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample. Because we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability. Note that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample. The change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.

Consider the following hypothetical population of all of the sandwich stores of a particular chain seen in Figure 8.1. In this made-up world, the CEO actually has all the relevant data, which is why they can plot it here. The CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue. There appears to be linear relationship between advertising dollars and revenue (both in $000).

Revenue as a linear model of advertising dollars for a population of sandwich stores, in $000.

Figure 8.1: Revenue as a linear model of advertising dollars for a population of sandwich stores, in $000.

You may remember from Chapter 3 that the population model is: \[y = \beta_0 + \beta_1 x + \varepsilon.\]

Again, the omniscient CEO (with the full population information) can write down the true population model as: \[\mbox{expected revenue} = 11.23 + 4.8 \cdot \mbox{advertising}.\]

Variability of the statistic

Unfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether or not set up a new franchise. The CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.

As with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample. Consider the linear model which describes revenue (in $000) based on advertising dollars (in $0000).

The least squares regression model uses the data to find a sample linear fit: \[\hat{y} = b_0 + b_1 x.\]

A random sample of 20 stores shows a different least square regression line depending on which observations are selected. A subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in Figure 8.1 which described the population) despite having fewer observations on the plot.

A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

Figure 8.2: A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

A second sample of size 20 also shows a positive trend!

A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

Figure 8.3: A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

But the line is slightly different!

The linear models from the two different random samples are quite similar, but they are not the same line.

Figure 8.4: The linear models from the two different random samples are quite similar, but they are not the same line.

That is, there is variability in the regression line from sample to sample. The concept of the sampling variability is something you’ve seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: the slope of the line.

If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

Figure 8.5: If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

You might notice in Figure 8.5 that the \(\hat{y}\) values given by the lines are much more consistent in the middle of the dataset than at the ends. The reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud. The effect of the fan-shaped lines is that predicted revenue for advertising close to $4,000 will be much more precise than the revenue predictions made for $1,000 or $7,000 of advertising.

The distribution of slopes (for samples of size \(n=20\)) can be seen in a histogram, as in Figure 8.6.

Variability of slope estimates taken from many different samples of stores, each of size 20.

Figure 8.6: Variability of slope estimates taken from many different samples of stores, each of size 20.

Recall, the example described in this introduction is hypothetical. That is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample. The tools in this textbook are designed to evaluate only one single sample of data.
With actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes. We have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models. However, it is nice to visually consider the linear models produced by different slopes. Additionally, as with measuring the variability of previous statistics (e.g., \(\overline{X}_1 - \overline{X}_2\) or \(\hat{p}_1 - \hat{p}_2\)), the histogram of the sample statistics can provide information related to inferential considerations.

In the following sections, the distribution (i.e., histogram) of \(b_1\) (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you. First (in Section 8.1.1), the distribution of \(b_1\) when \(\beta_1 = 0\) is constructed by randomizing (permuting) the response variable. Next (in Section 8.1.2), we can bootstrap the data by taking random samples of size n from the original dataset. And last (in Section 8.1.3), we use mathematical tools to describe the variability using the \(t\)-distribution that was first encountered in Section 7.1.2.

8.1.1 Randomization test for \(H_0: \beta_1= 0\)

Consider the data on Global Crop Yields compiled by Our World in Data and presented as part of the TidyTuesday series seen in Figure 8.7. The scientific research interest at hand will be in determining the linear relationship between wheat yield (for a country-year) and other crop yields. The dataset is quite rich and deserves exploring, but for this example, we will focus only on the annual crop yield in the United States.

Yield (in tonnes per hectare) for six different crops in the US.  The color of the dot indicates the year.

Figure 8.7: Yield (in tonnes per hectare) for six different crops in the US. The color of the dot indicates the year.

As you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected. In the linear model setting, we might hope to have a linear relationship between maize and wheat in settings where maize production is known and wheat production needs to be predicted.

The relevant hypotheses for the linear model setting can be written in terms of the population slope parameter. Here the population refers to a larger set of years where maize and wheat are both grown in the US.

  • \(H_0: \beta_1= 0\), there is no linear relationship between wheat and maize.
  • \(H_A: \beta_1 \ne 0\), there is some linear relationship between wheat and maize.

Recall that for the randomization test, we permute one variables to eliminate any existing relationship between the variables. That is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but not due to variables being correlated. Figure 8.8 shows the observed data and a scatterplot of one permutation of the wheat variable. The careful observer can see that each of the observed the values for wheat (and for maize) exist in both the original data plot as well as the permuted wheat plot, but the given wheat and maize yields are no longer matched for a given year. That is, each wheat yield is randomly assigned to a new maize yield.

Original (left) and permuted (right) data.  The permutation removes the linear relationship between `wheat` and `maize`.  Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).Original (left) and permuted (right) data.  The permutation removes the linear relationship between `wheat` and `maize`.  Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).

Figure 8.8: Original (left) and permuted (right) data. The permutation removes the linear relationship between wheat and maize. Repeated permutations allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e., that the null hypothesis is true).

By repeatedly permuting the response variable any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship). The randomization test compares the slopes calculated from the permuted response variable with the observed slope. If the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).

Observed data

We will continue to use the crop data to investigate the linear relationship between wheat and maize. Note that the least squares model (see Chapter 3 ) describing the relationship is given in Table 8.1. The columns in Table 8.1 are further described in Section 8.1.3.

Table 8.1: The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.195.
term estimate std.error statistic p.value
(Intercept) 1.033 0.091 11.3 0
maize 0.195 0.012 16.4 0

Variability of the statistic

After permuting the data, the least squares estimate of the line can be computed. Repeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between wheat and maize. Figure 8.9 shows two different permutations of wheat and the resulting linear models.

Two different permutations of the wheat variable with slightly different least squares regression lines.Two different permutations of the wheat variable with slightly different least squares regression lines.

Figure 8.9: Two different permutations of the wheat variable with slightly different least squares regression lines.

As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative. Because the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.

Observed statistic vs. null statistics

Histogram of slopes given different permutations of the wheat variable.  The vertical red line is at the observed value of the slope, 0.195.

Figure 8.10: Histogram of slopes given different permutations of the wheat variable. The vertical red line is at the observed value of the slope, 0.195.

As we can see from Figure 8.10, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the wheat variable. That is, if indeed there were no linear relationship between wheat and maize, the natural variability of the slopes would produce estimates between approximately -0.1 and +0.1. We reject the null hypothesis. Therefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between wheat and maize crop yield in the US.

8.1.2 Bootstrap confidence interval for \(\beta_1\)

As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test). Because interest is now in creating a CI, there is no null hypothesis, so there won’t be any reason to permute either of the variables.

Observed data

Returning to the crop data, we may want to consider the relationship between peas and wheat. Are peas a good predictor of wheat? And if so, what is their relationship? That is, what is the slope that models average wheat yield as a function of peas?

Original data: wheat yield as a linear model of peas yield, in tonnes per hectare.  Notice that the relationship between `peas` and `wheat` is not as strong as the relationship we saw previously between `maize` and `wheat`.

Figure 8.11: Original data: wheat yield as a linear model of peas yield, in tonnes per hectare. Notice that the relationship between peas and wheat is not as strong as the relationship we saw previously between maize and wheat.

Variability of the statistic

Because we are not focused on a null distribution, we sample with replacement \(n=58\) observations from the original dataset. Recall that with bootstrapping we always resample the same number of observations as we start with in order to mimic the process of taking a sample from the population. When sampling in the linear model case, consider each observation to be a single dot. If the dot is resampled, both the wheat and the peas measurement are observed. The measurements are linked to the dot (i.e., to the year in which the measurements were taken).

Original and one bootstrap sample of the crop data.  Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another.  The orange circle represent points in the original data which were not included in the bootstrap sample.  The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.  The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.Original and one bootstrap sample of the crop data.  Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another.  The orange circle represent points in the original data which were not included in the bootstrap sample.  The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.  The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.

Figure 8.12: Original and one bootstrap sample of the crop data. Note that it is difficult to differentiate the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are plotted as points on top of one another. The orange circle represent points in the original data which were not included in the bootstrap sample. The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples.

Figure 8.12 shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models. The orange circle represent points in the original data which were not included in the bootstrap sample. The blue circle represents a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample. The green circle represents a particular structure to the data which is observed in both the original and bootstrap samples. By repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in Figure 8.13.

Repeated bootstrap resamples of size 58 are taken from the original data.  Each of the bootstrapped linear model is slightly different.

Figure 8.13: Repeated bootstrap resamples of size 58 are taken from the original data. Each of the bootstrapped linear model is slightly different.

Recall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples. Figure 8.14 is a histogram of the relevant bootstrapped slopes. We can see that a 95% bootstrap percentile interval for the true population slope is given by (0.061, 0.52). We are 95% confident that for the model describing the population of crops of peas and wheat, a one unit increase in peas yield (in tonnes per hectare) will be associated with an increase in predicted average wheat yield of between 0.061 and 0.52 tonnes per hectare.

The original crop data on wheat and peas is bootstrapped 1,000 times. The histogram provides a sense for the variability of the standard deviation of the linear model slope from sample to sample.

Figure 8.14: The original crop data on wheat and peas is bootstrapped 1,000 times. The histogram provides a sense for the variability of the standard deviation of the linear model slope from sample to sample.

8.1.3 Mathematical model

When certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter. The approximations will build on the t-distribution which were described in Chapter 7. The mathematical model is often correct and is usually easy to implement computationally. The validity of the technical conditions will be considered in detail in Section 8.2.

In this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line. Just as we identified standard errors for point estimates in previous chapters, we first discuss standard errors for these new estimates.

Midterm elections and unemployment

Observed data

Elections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S. Presidential election. The set of House elections occurring during the middle of a Presidential term are called midterm elections. In America’s two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President’s party will do in the midterm elections. In 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.

To assess the validity of this claim, we can compile historical data and look for a connection. We consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression. The House of Representatives is made up of 435 voting members

Figure 8.15 shows these data and the least-squares regression line: \[\begin{aligned} &\mbox{% change in House seats for President's party} \\ &\qquad\qquad= -7.36 - 0.89 \times \mbox{(unemployment rate)}\end{aligned}\] We consider the percent change in the number of seats of the President’s party (e.g. percent change in the number of seats for Republicans in 2018) against the unemployment rate.

Examining the data, there are no clear deviations from linearity or substantial outliers (see Section 3.1.3 for a discussion on using residuals to visualize how well a linear model fits the data). While the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.

The percent change in House seats for the President's party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

Figure 8.15: The percent change in House seats for the President’s party in each election from 1898 to 2010 plotted against the unemployment rate. The two points for the Great Depression have been removed, and a least squares regression line has been fit to the data.

The data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively. Do you agree that they should be removed for this investigation? Why or why not?188

There is a negative slope in the line shown in Figure 8.15. However, this slope (and the y-intercept) are only estimates of the parameter values. We might wonder, is this convincing evidence that the “true” linear model has a negative slope? That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election? We can frame this investigation into a statistical hypothesis test:

  • \(H_0\): \(\beta_1 = 0\). The true linear model has slope zero.
  • \(H_A\): \(\beta_1 \neq 0\). The true linear model has a slope different than zero. The unemployment is predictive of whether the President’s party wins or loses seats in the House of Representatives.

We would reject \(H_0\) in favor of \(H_A\) if the data provide strong evidence that the true slope parameter is different than zero. To assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.

Regression output from software

Variability of the statistic

Just like other point estimates we have seen before, we can compute a standard error and test statistic for \(b_1\). We will generally label the test statistic using a \(T\), since it follows the \(t\)-distribution.

We will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course. Table 8.2 shows software output for the least squares regression line in Figure 8.15. The row labeled unemp includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable).

Table 8.2: Output from statistical software for the regression line modeling the midterm election losses for the President’s party as a response to unemployment.
term estimate std.error statistic p.value
(Intercept) -7.36 5.155 -1.43 0.165
unemp -0.89 0.835 -1.07 0.296

What do the first and second columns of Table 8.2 represent?


The entries in the first column represent the least squares estimates, \(b_0\) and \(b_1\), and the values in the second column correspond to the standard errors of each estimate. Using the estimates, we could write the equation for the least square regression line as \[\begin{aligned} \hat{y} = -7.36 - 0.89 x \end{aligned}\] where \(\hat{y}\) in this case represents the predicted change in the number of seats for the president’s party, and \(x\) represents the unemployment rate.

We previously used a \(t\)-test statistic for hypothesis testing in the context of numerical data. Regression is very similar. In the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T-score formula: \[\begin{aligned} T = \frac{\text{estimate} - \text{null value}}{\text{SE}} = \frac{-0.89 - 0}{0.835} = -1.07\end{aligned}\] This corresponds to the third column of Table 8.2 .

Use Table 8.2 to determine the p-value for the hypothesis test.


The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate: 0.2961. That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President’s party in the House of Representatives in midterm elections.

Observed statistic vs. null statistics

As the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether or not the data could have come from a population where the true slope was zero (i.e., \(\beta_1 = 0\)). Before evaluating the formal hypothesis claim, sometimes it is important to check your intuition. Based on everything we’ve seen in the examples above describing the variability of a line from sample to sample, as yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.

Examine Figure 5.19, which relates the Elmhurst College aid and student family income. How sure are you that the slope is statistically significantly different from zero? That is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?


While the relationship between the variables is not perfect, there is an evident decreasing trend in the data. This suggests the hypothesis test will reject the null claim that the slope is zero.

The point of the tools in this section are to go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about the statistical significance of the slope estimate.

Table 8.3: Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.
term estimate std.error statistic p.value
(Intercept) 24319.329 1291.450 18.83 0
family_income -0.043 0.011 -3.98 0

Table 8.3 shows statistical software output from fitting the least squares regression line shown in Figure 5.19. Use the output to formally evaluate the following hypotheses.

  • \(H_0\): The true coefficient for family income is zero.
  • \(H_A\): The true coefficient for family income is not zero.189

Inference for regression We usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice. However, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met. See Section 8.2.

Confidence interval for a coefficient

Observed data

Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct a confidence interval for that coefficient.

Compute the 95% confidence interval for the coefficient using the regression output from Table 8.3.


The point estimate is -0.0431 and the standard error is \(SE = 0.0108\). When constructing a confidence interval for a model coefficient, we generally use a \(t\)-distribution. The degrees of freedom for the distribution are noted in the regression output, \(df = 48\), allowing us to identify \(t_{48}^{\star} = 2.01\) for use in the confidence interval.

We can now construct the confidence interval in the usual way: \[\begin{aligned} \text{point estimate} \pm t_{48}^{\star} \times SE \qquad\to\qquad -0.0431 \pm 2.01 \times 0.0108 \qquad\to\qquad (-0.0648, -0.0214) \end{aligned}\] We are 95% confident that with each dollar increase in , the university’s gift aid is predicted to decrease on average by $0.0214 to $0.0648.

Variability of the statistic

Confidence intervals for coefficients

Confidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the \(t\)-distribution: \[\begin{aligned} b_i \ \pm\ t_{df}^{\star} \times SE_{b_{i}} \end{aligned}\] where \(t_{df}^{\star}\) is the appropriate \(t\)-value corresponding to the confidence level with the model’s degrees of freedom.

On the topic of intervals in this book, we’ve focused exclusively on confidence intervals for model parameters. However, there are other types of intervals that may be of interest, including prediction intervals for a response value and also confidence intervals for a mean response value in the context of regression.

8.1.4 Exercises

Exercises for this section are under construction.

8.2 Checking model conditions

In the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions. In this section, we’ll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures. .

What are the technical conditions for the mathematical model?

When fitting a least squares line, we generally require

  • Linearity. The data should show a linear trend. If there is a nonlinear trend (e.g. first panel of Figure 8.16) an advanced regression method from another book or later course should be applied.

  • Independent observations. Be cautious about applying regression to data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis. An example of a data set where successive observations are not independent is shown in the fourth panel of Figure 8.16. There are also other instances where correlations within the data are important, which is further discussed in Chapter 4.

  • Nearly normal residuals. Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points, which we’ll talk about more in Section 3.3. An example of a residual that would be a potentially concern is shown in the second panel of Figure 8.16, where one observation is clearly much further from the regression line than the others.

  • Constant or equal variability. The variability of points around the least squares line remains roughly constant. An example of non-constant variability is shown in the third panel of Figure 8.16, which represents the most common pattern observed when this condition fails: the variability of \(y\) is larger when \(x\) is larger.

Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the $x$ and $y$ relationship.  The bottom set of graphs is a residual plot.  First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of $x$. Fourth panel: a time series data set is shown, where successive observations are highly correlated.

Figure 8.16: Four examples showing when the methods in this chapter are insufficient to apply to the data. The top set of graphs represents the \(x\) and \(y\) relationship. The bottom set of graphs is a residual plot. First panel: linearity fails. Second panel: there are outliers, most especially one point that is very far away from the line. Third panel: the variability of the errors is related to the value of \(x\). Fourth panel: a time series data set is shown, where successive observations are highly correlated.

Should we have concerns about applying least squares regression to the Elmhurst data in Figure 3.13?190

The technical conditions are often remembered using the LINE mnemonic. The linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in Figure 8.16. A careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.

  • L: linear model
  • I: independent observations
  • N: points are normally distributed around the line
  • E: equal variability around the line for all values of the explanatory variable

Why do we need technical conditions?

As with other inferential techniques we have covered in this text, if the technical conditions above don’t hold, then it is not possible to make concluding claims about the population. That is, without the technical conditions, the T-score (or Z-score) will not have the assumed t-distribution (or standard normal Z distribution). That said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions. If there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.

The model based on Linearity

The linearity condition is among the most important if your goal is to understand a linear model between \(x\) and \(y\). For example, the value of the slope will not be at all meaningful if the true relationship between \(x\) and \(y\) is quadratic, as in Figure 3.3. Not only should we be cautious about the inference, but the model itself is also not an accurate portrayal of the relationship between the variables.

In Section 8.3 we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones. However, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.

The importance of Independence

The technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose. It is also extremely difficult to gather a dataset which is a true random sample from the population of interest. (Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)

Dependent observations can bias results in ways that produce fundamentally flawed analyses. That is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university. At best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).

In lieu of trying to answer whether or not your observations are a true random sample, you might instead focus on whether or not you believe your observations are representative of the populations. Humans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.

Some thoughts on Normality

The normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion. You should consider the “bell” of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.

The normality condition is less important than linearity or independence for a few reasons. First, the linear model fit with least squares will still be an unbiased estimate of the true population model. However, the standard errors associated with variability of the line will not be well estimated. Fortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough. One analysis method that does require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given \(x\) value, using the linear model. One additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.

Equal variability for prediction in particular

As with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of \(x\)) will not cause problems for the estimate of the linear model. That said, the inference on the model (e.g., computing p-values) will be incorrect if the variability around the line is heterogeneous. Data that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).

The inference results for both a randomization test or a bootstrap confidence interval are robust to the equal variability condition, so they give the analyst methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line). Although randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation). When the equal variability condition is violated and a mathematical analysis (e.g., p-value from T-score) is needed, there are other existing methods (outside the scope of this text) which can easily handle the unequal variance (e.g., weighted least squares analysis).

What if all the technical conditions are met?

When the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software. In addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests). In particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions. When the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand. We will discuss some of the extended modeling and associated inference in Section 8.3 and Section 8.4. Many of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one. If you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.

8.2.1 Exercises

8.3 Inference for multiple regression

In Chapter 4, the least squares regression method was used to estimate linear models which predicted a particular response variable given more than one explanatory variable. Here, we discuss whether each of the variables individually is a significant predictor or whether the model might be just as strong without that variable. That is, as before, we apply inferentially methods to ask whether a variable could have come from a population where the particular coefficient at hand was zero. If one of the linear model coefficients is truly zero (in the population), then the estimate of the coefficient (using least squares) will vary around zero. The inference task at hand is to decide whether the coefficient’s difference from zero is large enough to decide that the data cannot possibly have come from a model where the true population coefficient is zero. Both the derivations from the mathematical model and the randomization model are beyond the scope of this book, but we are able to calculate p-values using statistical software. We will discuss interpreting p-values in the multiple regression setting and note some scenarios where careful understanding of the context and the relationship between variables is important.

8.3.1 Multiple regression output from software

Recall the loans data from Chapter 4.

The data can be found in the openintro package: loans_full_schema. Based on the data in this dataset we have created two new variables: credit_util which is calculated as the total credit utilized divided by the total credit limit and bankruptcy which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies). We will refer to this modified dataset as loans.

Now, our goal is to create a model where interest_rate can be predicted using the variables debt_to_income, term, and credit_checks.
As learned in Chapter 4, least squares can be used to find the coefficient estimates for the linear model. The unknown population model can be written as: \[E[\mbox{interest_rate}] = \beta_0 + \beta_1\times \mbox{debt_to_income} + \beta_2 \times \mbox{term} + \beta_3 \times \mbox{credit_checks}\]

The estimated equation for the regression model may be written as a model with three predictor variables:

Table 8.4: Summary of a linear model for predicting interest rate based on the variables debt_to_income, term, and credit_checks. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) 4.309 0.195 22.1 <0.0001
debt_to_income 0.041 0.003 13.3 <0.0001
term 0.158 0.004 37.9 <0.0001
credit_checks 0.247 0.019 12.8 <0.0001

\[\widehat{\mbox{interest_rate}} = 4.31 + 0.041 \times \mbox{debt_to_income} + 0.16 \times \mbox{term} + 0.25 \times \mbox{credit_checks}\]

Not only does Table 8.4 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.

In Section 8.1, we learned that the hypothesis test for a linear model with one predictor can be written as:

\[\mbox{if only one predictor } H_0: \beta_1 = 0.\]

That is, if the true population slope is zero, the p-value measures how likely it would be to select data which produced the observed slope (\(b_1\)) value.

With multiple predictors, the hypothesis is similar, however, it is now conditioned on each of the other variables remaining in the model.

\[\mbox{if multiple predictors } H_0: \beta_i = 0 \mbox{ given other variables in the model}\]

Using the example above and focusing on each of the variable p-values (here we won’t discuss the p-value associated with the intercept), we can write out the three different hypotheses:

\[\begin{eqnarray*} H_0: \beta_1 = 0 && \mbox{ given \mbox{term} and \mbox{credit_checks} are included in the model}\\ H_0: \beta_2 = 0 &&\mbox{ given \mbox{debt_to_income} and \mbox{credit_checks} are included in the model}\\ H_0: \beta_3 = 0 &&\mbox{ given \mbox{debt_to_income} and \mbox{term} and are included in the model} \end{eqnarray*}\]

The very low p-values from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of the other two. Consider the p-value on \(H_0: \beta_1\). The low p-value says that it would be extremely unlikely to see data that produce a coefficient on debt_to_income as large as 0.041 if the true relationship between debt_to_incomeand interest_rate was non-existent (i.e., if \(\beta_1 = 0\)) and the model also included term and credit_checks. You might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the problem, 0.041 turns out to be far away from zero, it’s all about context! The p-values on term and on credit_checks are interpreted similarly.

Sometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor variables themselves being correlated.

8.3.2 Multicollinearity

In practice, there will almost always be some degree of correlation between the explanatory variables in a multiple regression model. For regression models, it is important to understand the entire context of the model, particularly for correlated variables. Our discussion will focus on interpreting coefficients (and their signs) in relationship to other variables as well as the significance (i.e., the p-value) of each coefficient.

Consider an example where we’d like to predict how much money is in a coin dish based only on the number of coins in the dish. We ask 26 students to tell us about their individual coin dishes, collecting data on the total dollar amount, the total number of coins, and the total number of low coins.191 The number of low coins is the number of coins minus the number of quarters (a quarter is the largest commonly used US coin, at US$0.25). Figure 8.17 illustrates a sample of U.S. coins, their total worth (amount), the number of total coins, and the number of low coins.

A sample of coins with 16 total coins, 10 low coins, and a net worth of $1.90.

Figure 8.17: A sample of coins with 16 total coins, 10 low coins, and a net worth of $1.90.

The collected data is given in Figure 8.18 and shows that the total amount of money is more highly correlated with the total number of coins than it is with the number of low coins. We also note that the number of high coins and the number of low coins are positively correlated.

Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins.  As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins.  As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.

Figure 8.18: Plot describing the amount of money (US$) as a fucntion of the number of coins and the number of low coins. As you might expect, the amount of money is more highly postively correlated with the total number of coins than with the number of low coins.

Using the total number of coins as the predictor variable, Table 8.5 provides the least squares estimate of the coefficient is 0.13. For every additional coin in the dish, we would predict that the student had US$0.13 more. The \(b_1 = 0.13\) coefficient is highly significant, suggesting we would not have seen data like this if number of coins and amount of money were not linearly related.

\[\widehat{\mbox{amount}} = 0.55 + 0.13 \times \mbox{number of coins} \]

Table 8.5: Linear model output predicting the total amount of money based on the total number of coins.
term estimate std.error statistic p.value
(Intercept) 0.55 0.44 1.23 0.2301
number.of.coins 0.13 0.02 5.54 <0.0001

Using the number of low coins as the predictor variable, Table 8.6 provides the least squares estimate of the coefficient is 0.02. For every additional coin in the dish, we would predict that the student had US$0.02 more. The \(b_1 = 0.02\) coefficient is not at all significant, suggesting we could easily have seen data like ours even if the number of low coins and amount of money are not at all linearly related.

\[\widehat{\mbox{amount}} = 2.28 + 0.02 \times \mbox{number of low coins} \]

Table 8.6: Linear model output predicting the total amount of money based on the number of low coins.
term estimate std.error statistic p.value
(Intercept) 2.28 0.58 3.9 0.0
number.of.low.coins 0.02 0.05 0.4 0.7

Using both the total number of coins and the number of low coins as predictor variables, Table 8.7 provides the least squares estimates of both coefficients as 0.21 and -0.16. Now, with two variables in the model, the interpretation is more nuanced.

  • The coefficient indicates a change in one variable while keeping the other variable constant.
    For every additional coin in the dish while the number of low coins stays constant, we would predict that the student had US$0.21 more. Re-considering the phrase “every additional coin in the dish while the number of low coins stays constant” makes us realize that each increase is a single additional quarter (larger samples sizes would have led to a \(b_1\) coefficient closer to 0.25 because of the deterministic relationship described here).
  • For every additional low coin in the dish while the number of total coins stays constant, we would predict that the student had US$0.16 less. Re-considering the phrase “every additional low coin in the dish while the number of total coins stays constant” makes us realize that a quarter is being swapped out for a penny, nickel, or dime.

Considering the coefficients across Tables 8.5, 8.6, and 8.7 within the context and knowledge we have of US coins allows us to understand the correlation between variables and why the signs of the coefficients would change depending on the model. Note also, however, that the significance on the low coin coefficient changed from Table 8.6 to Table 8.7. It makes sense that the variable describing the number of low coins provides more information about the amount of money when it is part of a model which also includes the total number of coins than it does when it is used as a single variable in a simple linear regression model.

\[\widehat{\mbox{amount}} = 0.80 + 0.21 \times \mbox{number of coins} - 0.16 \times \mbox{number of low coins}\]

Table 8.7: Linear model output predicting the total amount of money based on both the total number of coins and the number of low coins.
term estimate std.error statistic p.value
(Intercept) 0.80 0.30 2.65 0.0142
number.of.coins 0.21 0.02 9.89 <0.0001
number.of.low.coins -0.16 0.03 -5.51 <0.0001

When working with multiple regression models, understanding the significance and sign of the coefficient is not always as straightforward as it was with the coin example. However, we encourage you to always think carefully about the variables in the model, consider how they might be correlated among themselves, and work through different models to see how using different sets of variables might produce different relationships for predicting the response variable of interest.

Although diving into the details are beyond the scope of this text, we will provide one more reflection about multicollinearity. If the predictor variables have some degree of correlation, it can be quite difficult to interpret the value of the coefficient or its significance. However, even a model that suffers from high multicollinearity will likely lead to unbiased predictions of the response variable. So if the task at hand is only to do prediction, mutlicollinearity is likely to not cause you substantial problems.

8.3.3 Cross validation for prediction error

In Section 8.3.1, p-values were calculated on each of the model coefficients. The p-value gives a sense of which variables are important to the model; however, a more extensive treatment of variable selection is warranted in a follow-up course or textbook. Here, we use cross validation prediction error to focus on which variable(s) are important for predicting the response variable of interest. In general, linear models are also used to make predictions of individual observations. In addition to model building, cross validation provides a method for generating predictions that are not overfit to the particular dataset at hand. We continue to encourage you to take up further study on the topic of cross validation, as it is among the most important ideas in modern data analysis, and we are only able to scratch the surface here.

Cross validation is a computational technique which removes some observations before a model is run, then assesses the model accuracy on the held-out sample. By removing some observations, we provide ourselves with an independent evaluation of the model (that is, the removed observations do not contribute to finding the parameters which minimize the least squares equation). Cross validation can be used in many different ways (as an independent assessment), and here we will just scratch the surface with respect to one way the technique can be used to compare models. See Figure 8.19 for an image describing how cross validation works.

This isn't the right picture, but I'd love to keep CV in the book and be able to explain it in a straightforward way.

Figure 8.19: This isn’t the right picture, but I’d love to keep CV in the book and be able to explain it in a straightforward way.

The data can be found in the palmerpenguings package: penguins. The observations of three different penguin species include measurements on body size and sex. The data were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER as part of the Long Term Ecological Research Network.

Our goal in this section is to compare two different regression models which both seek to predict the mass of an individual penguin in grams. The first model will predict body_mass_g by using only the bill_length_mm, a variable denoting the length of a penguin’s bill, in mm. The second model will predict body_mass_g by using bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species.

The presentation below (see the comparison of Figures 8.20 and 8.21) shows that the model with more variables predicts body_mass_g with much smaller errors (predicted minus actual body mass) than the model which uses only bill_length_g. We have deliberately used a model that intuitively makes sense (the more body measurements, the more predictable mass is). However, in many settings, it is not obvious which variables or which models contribute most to accurate predictions. Cross validation is one way to get accurate independent predictions with which to compare different models.

Comparing two models to predict body_mass_g in penguins

The question we will seek to answer is whether the predictions of body_mass_g are substantially better when bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species are used in the model, as compared with a model on bill_length_mm only.

The population model \(A\) is given below, and the estimates of the parameters (on the entire dataset) are provided in Table 8.8. The population model \(B\) is given below, and the estimates of the parameters (on the entire dataset) are provided in Table 8.9. Given what we know about high correlations between body measurements, it is somewhat unsurprising that all of the variables are significant (small p-values) predictors of body_mass_g.

\[\begin{eqnarray*} \mbox{Model } A: && E[\mbox{body_mass_g}] = \beta_0 + \beta_1 \times \mbox{bill_length_mm}\\ \mbox{Model } B: && E[\mbox{body_mass_g}] = \beta_0 + \beta_1 \times \mbox{bill_length_mm} + \beta_2 \times \mbox{bill_depth_mm} + \\ && \beta_3 \times \mbox{flipper_length_mm} + \beta_4 \times \mbox{sex}_{male} + \\ && \beta_5 \times \mbox{species}\\ \end{eqnarray*}\]

Table 8.8: The least squares estimates of the regression model predicting body_mass_g from bill_length_mm.
term estimate std.error statistic p.value
(Intercept) 362.3 283.3 1.28 0.202
bill_length_mm 87.4 6.4 13.65 0.000
Table 8.9: The least squares estimates of the regression model predicting body_mass_g from bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species.
term estimate std.error statistic p.value
(Intercept) -1461.0 571.31 -2.56 0.011
bill_length_mm 18.2 7.11 2.56 0.011
bill_depth_mm 67.2 19.74 3.40 0.001
flipper_length_mm 15.9 2.91 5.48 0.000
sexmale 389.9 47.85 8.15 0.000
speciesChinstrap -251.5 81.08 -3.10 0.002
speciesGentoo 1014.6 129.56 7.83 0.000

The predictions that will allow us to distinguish between models A and B must be independent of the data which were used to build the model. So, using cross validation, we remove one quarter of the data before running the least squares calculations. Here we use a 4-fold cross validation (meaning that one quarter of the data is removed each time) to produce four different versions of each model (other times it might be more appropriate to use 2-fold or 10-fold or even run the model separately after removing each individual data point one at a time).

  • Model \(A_1\) built on 3/4 of the data, after the first random quarter of observations has been removed.
  • Model \(A_2\) built on 3/4 of the data, after the second random quarter of observations has been removed.
  • Model \(A_3\) built on 3/4 of the data, after the third random quarter of observations has been removed.
  • Model \(A_4\) built on 3/4 of the data, after the fourth random quarter of observations has been removed.

The text below is old. I have a few different ways I could compare the 4 models (as equations, as plots, as R output…). I’d like to talk through how to display before doing the work to create the display.

Note that to compare apples to apples (for model A vs B) we can’t do an x-y plot (it would have to be predicted vs error or predicted vs observed which we sort of already have).

\[\begin{eqnarray*} \mbox{Model } A_1 && \widehat{\mbox{bst}} = 99.42 + 0.16 \times \mbox{weight}\\ \mbox{Model } A_2 && \widehat{\mbox{bst}} = 101.36 + 0.14 \times \mbox{weight}\\ \mbox{Model } A_3 && \widehat{\mbox{bst}} = 104.52 + 0.12 \times \mbox{weight}\\ \mbox{Model } A_4 && \widehat{\mbox{bst}} = 103.54 + 0.12 \times \mbox{weight}\\ \end{eqnarray*}\]

One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted.  The least squares regression model was fit independently of the removed penguins.  The predictions of body mass are based on bill length only.  The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.20: One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

  • Model \(B_1\) built on 3/4 of the data, after the first random quarter of observations has been removed.
  • Model \(B_2\) built on 3/4 of the data, after the second random quarter of observations has been removed.
  • Model \(B_3\) built on 3/4 of the data, after the third random quarter of observations has been removed.
  • Model \(B_4\) built on 3/4 of the data, after the fourth random quarter of observations has been removed.

The text below is old. I have a few different ways I could compare the 4 models (as equations, as plots, as R output…). I’d like to talk through how to display before doing the work to create the display.

Note that to compare apples to apples (for model A vs B) we can’t do an x-y plot (it would have to be predicted vs error or predicted vs observed which we sort of already have).

\[\begin{eqnarray*} \mbox{Model } B_1 && \widehat{\mbox{bwt}} = 99.69 + 0.15 \times \mbox{weight} + 0.03 \times \mbox{age}\\ \mbox{Model } B_2 && \widehat{\mbox{bwt}} = 100.79 + 0.14 \times \mbox{weight} + 0.03 \times \mbox{age}\\ \mbox{Model } B_3 && \widehat{\mbox{bwt}} = 102.57 + 0.11 \times \mbox{weight} + 0.08 \times \mbox{age}\\ \mbox{Model } B_4 && \widehat{\mbox{bwt}} = 103.41 + 0.12 \times \mbox{weight} + 0.06\times \mbox{age}\\ \end{eqnarray*}\]

One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted.  The least squares regression model was fit independently of the removed penguins.  The predictions of body mass are based on bill length only.  The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.21: One quarter at a time, the data were removed from the model building, and the body mass of the removed penguins was predicted. The least squares regression model was fit independently of the removed penguins. The predictions of body mass are based on bill length only. The x-axis represents the predicted value, the y-axis represents the error (difference between predicted value and actual value).

Figure 8.20 shows that the independent predictions are centered around the true values (i.e., errors are centered around zero), but that the predictions can be as much as 1000g off when using only bill_length_mm to predict body_mass_g. On the other hand, when using bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species to predict body_mass_g, the prediction errors seem to be about half as big, as seen in Figure 8.21.

We have provided a very brief overview to and example using cross validation. Cross validation is a computational approach to model building and model validation as an alternative to reliance on p-values. While p-values have a role to play in understanding model coefficients, throughout this text, we have continued to present computational methods that broaden statistical approaches to data analysis. Cross validation will be used again in Section 8.4 with logistic regression. We encourage you to consider both standard inferential methods (such as p-values) and computational approaches (such as cross validation) as you build and use multivariable models of all varieties.

8.3.4 Exercises

Exercises for this section are under construction.

8.4 Inference for logistic regression

As with multiple linear regression, the inference aspect for logistic regression will focus on interpretation of coefficients and relationships between explanatory variables. Both p-values and cross validation will be used for assessing a logistic regression model.

We continue to work with the Palmer Penguins collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. Now, however, the research goal changes to focus on a model which would predict from which island a particular penguin came. The logistic model (see Section 4.5) can be used both as a way to understand which variables are most important for modeling island origin (inference on the logistic coefficients) as well as to predict island origin for individual penguins (cross validation).

8.4.1 Multiple logistic regression output from software

Recall the penguins data from Chapter 8.3.3.

As learned in Chapter 4, optimization can be used to find the coefficient estimates for the logistic model. The unknown population model can be written as: \[\log_e\bigg(\frac{p}{1-p}\bigg) = \beta_0 + \beta_1\times \mbox{bill_length_mm} + \beta_2 \times \mbox{bill_depth_mm} + \beta_3 \times \mbox{flipper_length_mm} + \beta_4 \times \mbox{sex}_{male}\]

The estimated equation for the regression model may be written as a model with four predictor variables:

Table 8.10: Summary of a logistic model for predicting island of origina based on the variables bill_length_mm, bill_depth_mm, flipper_length_mm, and sex. Each of the variables has its own coefficient estimate as well as p-value significance.
term estimate std.error statistic p.value
(Intercept) -0.75 5.52 -0.14 <0.0001
bill_length_mm 0.22 0.05 4.68 <0.0001
bill_depth_mm 0.76 0.15 4.96 <0.0001
flipper_length_mm -0.11 0.02 -4.63 <0.0001
sexmale -1.29 0.46 -2.80 <0.0001

\[\log_e\bigg(\frac{\hat{p}}{1-\hat{p}}\bigg) = -0.75 + 0.22\times \mbox{bill_length_mm} + 0.76 \times \mbox{bill_depth_mm} - 0.11 \times \mbox{flipper_length_mm} - 1.29 \times \mbox{sex}_{male}\]

Not only does Table 8.10 provide the estimates for the coefficients, it also provides information on the inference analysis (i.e., hypothesis testing) which are the focus of this chapter.

As in Section 8.3, with multiple predictors, each hypothesis test (for each of the explanatory variables) is conditioned on each of the other variables remaining in the model.

\[\mbox{if multiple predictors } H_0: \beta_i = 0 \mbox{ given other variables in the model}\]

Using the example above and focusing on each of the variable p-values (here we won’t discuss the p-value associated with the intercept), we can write out the three different hypotheses:

\[\begin{eqnarray*} H_0: \beta_1 = 0 && \mbox{ given \mbox{bill_depth_mm} and \mbox{flipper_length_mm} and \mbox{sex} are included in the model}\\ H_0: \beta_2 = 0 &&\mbox{ given \mbox{bill_length_mm} and \mbox{flipper_length_mm} and \mbox{sex} are included in the model}\\ H_0: \beta_3 = 0 &&\mbox{ given \mbox{bill_length_mm} and \mbox{bill_depth_mm} and \mbox{sex} and are included in the model} H_0: \beta_4 = 0 &&\mbox{ given \mbox{bill_length_mm} and \mbox{bill_depth_mm} and \mbox{flipper_length_mm} are included in the model} \end{eqnarray*}\]

The very low p-values from the software output tell us that each of the variables acts as an important predictor in the model, despite the inclusion of any of the other three. Consider the p-value on \(H_0: \beta_1\). The low p-value says that it would be extremely unlikely to see data that produce a coefficient on bill_length_mm as large as 0.22 if the true relationship between bill_length_mm and island was non-existent (i.e., if \(\beta_1 = 0\)) and the model also included bill_depth_mmandflipper_length_mmandsex`. You might have thought that the value 0.22 is a small number (i.e., close to zero), but in the units of the problem, 0.22 turns out to be far away from zero, it’s all about context! The p-values on the remaining variables are be interpreted similarly.

As with linear regression (see Section 8.3.2), correlated explanatory variables can impact both the coefficient estimates and the associated p-values. Investigating multicollinearity in a logistic regression model is saved for a text which provides more detail about logistic regression. Here, we revisit cross validation within the context of predicting island of origin for each of the individual penguins.

8.4.2 Cross validation for prediction error

Not totally sure how this is being presented. But, as expected, when using one variable the accuracy rates are much lower (~50%) than the prediction accuracy rates from a model that uses 4 variables (~80%).

#> , , fold = 1st quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        35     6
#>   Dream         23     8
#>   Torgersen      0     0
#> 
#> , , fold = 2nd quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        41     0
#>   Dream         30     1
#>   Torgersen      0     0
#> 
#> , , fold = 3rd quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        36     5
#>   Dream         24     6
#>   Torgersen      0     0
#> 
#> , , fold = 4th quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        40     0
#>   Dream         31     0
#>   Torgersen      0     0
#> # A tibble: 4 x 3
#>   fold        count accuracy
#> * <chr>       <int>    <dbl>
#> 1 1st quarter    72    0.597
#> 2 2nd quarter    72    0.583
#> 3 3rd quarter    71    0.592
#> 4 4th quarter    71    0.563
#> , , fold = 1st quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        32     8
#>   Dream          9    22
#>   Torgersen      0     0
#> 
#> , , fold = 2nd quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        34     7
#>   Dream          8    23
#>   Torgersen      0     0
#> 
#> , , fold = 3rd quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        36     5
#>   Dream          3    28
#>   Torgersen      0     0
#> 
#> , , fold = 4th quarter
#> 
#>            predIsland
#> obs         Biscoe Dream
#>   Biscoe        30    11
#>   Dream          7    23
#>   Torgersen      0     0
#> # A tibble: 4 x 3
#>   fold        count accuracy
#> * <chr>       <int>    <dbl>
#> 1 1st quarter    71    0.761
#> 2 2nd quarter    72    0.792
#> 3 3rd quarter    72    0.889
#> 4 4th quarter    71    0.746

8.4.3 Exercises

Exercises for this section are under construction.

8.5 Chapter review

Throughout the text, we have presented a modern view to introduction to statistics. Early we presented graphical techniques which communicated relationships across multiple variables. We also used modeling to formalize the relationships. Many chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data. Not only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods. In Chapter @ref{inference-reg} we brought many of the ideas together by considering inferential claims on models which include many variables.

As you might guess, this text has only scratched the surface of the world of statistical analyses that can be applied to different datasets. In particular, to do justice to the topic, the linear models and generalized linear models we have introduced can each be covered with their own course or book. Hierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO), and advanced computational methods applied to models (e.g., permuting the response variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this book. However, your successful understanding of the ideas we have covered has set you up perfectly to move on to a higher level of statistical modeling and inference. Enjoy!

8.5.1 Terms

We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.

8.5.2 Chapter exercises

Exercises for this section are under construction.

8.5.3 Interactive R tutorials

Navigate the concepts you’ve learned in this chapter in R using the following self-paced tutorials. All you need is your browser to get started!

You can also access the full list of tutorials supporting this book here.

8.5.4 R labs

Further apply the concepts you’ve learned in this chapter in R with computational labs that walk you through a data analysis case study.


  1. The answer to this question relies on the idea that statistical data analysis is somewhat of an art. That is, in many situations, there is no “right” answer. As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset. In terms of the Great Depression, we will provide two contrasting considerations. Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high. On the other hand, these are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.↩︎

  2. We look in the second row corresponding to the family income variable. We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the \(t\)-test statistic is \(T = -3.98\). The p-value corresponds exactly to the two-sided test we are interested in: 0.0002. The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of Figure 5.19.↩︎

  3. The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant. These are also not time series observations. Least squares regression can be applied to these data.↩︎

  4. In all honesty, this particular dataset is fabricated, and the original idea for the problem comes from Jeff Witmer at Oberlin College.↩︎