4 Multivariable and logistic models

The principles of simple linear regression lay the foundation for more sophisticated regression models used in a wide range of challenging settings. In this chapter, we explore multiple regression, which introduces the possibility of more than one predictor in a linear model, and logistic regression, a technique for predicting categorical outcomes with two levels.

4.1 Regression with multiple predictors

Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \(x_1\), \(x_2\), \(x_3\), …). The method is motivated by scenarios where many variables may be simultaneously connected to an output.

We will consider data about loans from the peer-to-peer lender, Lending Club, which is a data set we first encountered in Chapters 1. The loan data includes terms of the loan as well as information about the borrower. The outcome variable we would like to better understand is the interest rate assigned to the loan. For instance, all other characteristics held constant, does it matter how much debt someone already has? Does it matter if their income has been verified? Multiple regression will help us answer these and other questions.

The data set includes results from 10,000 loans, and we’ll be looking at a subset of the available variables, some of which will be new from those we saw in earlier chapters. The first six observations in the data set are shown in Table 4.1, and descriptions for each variable are shown in Table 4.2. Notice that the past bankruptcy variable (bankruptcy) is an indicator variable, where it takes the value 1 if the borrower had a past bankruptcy in their record and 0 if not. Using an indicator variable in place of a category name allows for these variables to be directly used in regression. Two of the other variables are categorical (verified_income and issue_month), each of which can take one of a few different non-numerical values; we’ll discuss how these are handled in the model in Section 4.1.1.

The data can be found in the openintro package: loans_full_schema. Based on the data in this dataset we have created two new variables: credit_util which is calculated as the total credit utilized divided by the total credit limit and bankruptcy which turns the number of bankruptcies to an indicator variable (0 for no bankruptcies and 1 for at least 1 bankruptcies). We will refer to this modified dataset as loans.

Table 4.1: First six rows from the loans_full_schema data set.
interest_rate verified_income debt_to_income credit_util bankruptcy term credit_checks issue_month
14.07 Verified 18.01 0.548 0 60 6 Mar-2018
12.61 Not Verified 5.04 0.150 1 36 1 Feb-2018
17.09 Source Verified 21.15 0.661 0 36 4 Feb-2018
6.72 Not Verified 10.16 0.197 0 36 0 Jan-2018
14.07 Verified 57.96 0.755 0 36 7 Mar-2018
6.72 Not Verified 6.46 0.093 0 36 6 Jan-2018
Table 4.2: Variables and their descriptions for the loans data set.
variable description
interest_rate Interest rate on the loan, in an annual percentage.
verified_income Categorical variable describing whether the borrower’s income source and amount have been verified, with levels Verified, Source Verified, and Not Verified.
debt_to_income Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.
credit_util Of all the credit available to the borrower, what fraction are they utilizing. For example, the credit utilization on a credit card would be the card’s balance divided by the card’s credit limit.
bankruptcy An indicator variable for whether the borrower has a past bankruptcy in their record. This variable takes a value of 1 if the answer is yes and 0 if the answer is no.
term The length of the loan, in months.
issue_month The month and year the loan was issued, which for these loans is always during the first quarter of 2018.
credit_checks Number of credit checks in the last 12 months. For example, when filing an application for a credit card, it is common for the company receiving the application to run a credit check.

4.1.1 Indicator and categorical predictors

Let’s start by fitting a linear regression model for interest rate with a single predictor indicating whether or not a person has a bankruptcy in their record:

\[\widehat{\texttt{interest_rate}} = 12.33 + 0.74 \times \texttt{bankruptcy}\]

Results of this model are shown in Table 4.3.

Table 4.3: Summary of a linear model for predicting interest rate based on whether the borrower has a bankruptcy in their record. Degrees of freedom for this model is 9998.
term estimate std.error statistic p.value
(Intercept) 12.338 0.0533 231.49 <0.0001
bankruptcy1 0.737 0.1529 4.82 <0.0001

Interpret the coefficient for the past bankruptcy variable in the model. Is this coefficient significantly different from 0?

The variable takes one of two values: 1 when the borrower has a bankruptcy in their history and 0 otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate for those borrowers with a bankruptcy in their record. (See Section 3.2.6 for a review of the interpretation for two-level categorical predictor variables.) Examining the regression output in Table 4.3, we can see that the p-value for is very close to zero, indicating there is strong evidence the coefficient is different from zero when using this simple one-predictor model.

Suppose we had fit a model using a 3-level categorical variable, such as verified_income. The output from software is shown in Table 4.4. This regression output provides multiple rows for the variable. Each row represents the relative difference for each level of verified_income. However, we are missing one of the levels: Not Verified. The missing level is called the reference level and it represents the default level that other levels are measured against.

Table 4.4: Summary of a linear model for predicting interest rate based on whether the borrower’s income source and amount has been verified. This predictor has three levels, which results in 2 rows in the regression output.
term estimate std.error statistic p.value
(Intercept) 11.10 0.081 137.2 <0.0001
verified_incomeSource Verified 1.42 0.111 12.8 <0.0001
verified_incomeVerified 3.25 0.130 25.1 <0.0001

How would we write an equation for this regression model?

The equation for the regression model may be written as a model with two predictors:

\[ \begin{align} \widehat{\texttt{interest_rate}} = 11.10 &+ 1.42 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ 3.25 \times \texttt{verified_income}_{\texttt{Verified}} \end{align} \]

We use the notation \(\texttt{variable}_{\texttt{level}}\) to represent indicator variables for when the categorical variable takes a particular value. For example, \(\texttt{verified_income}_{\texttt{Source Verified}}\) would take a value of 1 if was for a loan, and it would take a value of 0 otherwise. Likewise, \(\texttt{verified_income}_{\texttt{Verified}}\) would take a value of 1 if took a value of verified and 0 if it took any other value.

The notation \(\texttt{variable}_{\texttt{level}}\) may feel a bit confusing. Let’s figure out how to use the equation for each level of the verified_income variable.

Using the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source and amount are both unverified.

When verified_income takes a value of Not Verified, then both indicator functions in the equation for the linear model are set to 0:

\[\widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 0 + 3.25 \times 0 = 11.10\]

The average interest rate for these borrowers is 11.1%. Because the level does not have its own coefficient and it is the reference value, the indicators for the other levels for this variable all drop out.

Using the model for predicting interest rate from income verification type, compute the average interest rate for borrowers whose income source and amount are both unverified.

When verified_income takes a value of Source Verified, then the corresponding variable takes a value of 1 while the other is 0:

\[\widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 1 + 3.25 \times 0 = 12.52\]

The average interest rate for these borrowers is 12.52%.

Compute the average interest rate for borrowers whose income source and amount are both verified.94

Predictors with several categories.

When fitting a regression model with a categorical variable that has \(k\) levels where \(k > 2\), software will provide a coefficient for \(k - 1\) of those levels. For the last level that does not receive a coefficient, this is the , and the coefficients listed for the other levels are all considered relative to this reference level.

Interpret the coefficients in the model.95

The higher interest rate for borrowers who have verified their income source or amount is surprising. Intuitively, we’d think that a loan would look less risky if the borrower’s income has been verified. However, note that the situation may be more complex, and there may be confounding variables that we didn’t account for. For example, perhaps lender require borrowers with poor credit to verify their income. That is, verifying income in our data set might be a signal of some concerns about the borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding variables might explain this counter-intuitive relationship suggested by the model?)

How much larger of an interest rate would we expect for a borrower who has verified their income source and amount vs a borrower whose income source has only been verified?96

4.1.2 Many predictors in a model

The world is complex, and it can be helpful to consider many factors at once in statistical modeling. For example, we might like to use the full context of borrower to predict the interest rate they receive rather than using a single variable. This is the strategy used in multiple regression. While we remain cautious about making any causal interpretations using multiple regression on observational data, such models are a common first step in gaining insights or providing some evidence of a causal connection.

We want to construct a model that accounts for not only for any past bankruptcy or whether the borrower had their income source or amount verified, but simultaneously accounts for all the variables in the loans data set: verified_income, debt_to_income, credit_util, bankruptcy, term, issue_month, and credit_checks.

\[\begin{align*} \widehat{\texttt{interest_rate}} = b_0 &+ b_1 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ b_2 \times \texttt{verified_income}_{\texttt{Verified}} \\ &+ b_3 \times \texttt{debt_to_income} \\ &+ b_4 \times \texttt{credit_util} \\ &+ b_5 \times \texttt{bankruptcy} \\ &+ b_6 \times \texttt{term} \\ &+ b_7 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\ &+ b_8 \times \texttt{issue_month}_{\texttt{Mar-2018}} \\ &+ b_9 \times \texttt{credit_checks} \end{align*}\]

This equation represents a holistic approach for modeling all of the variables simultaneously. Notice that there are two coefficients for verified_income and also two coefficients for issue_month, since both are 3-level categorical variables.

We calculate \(b_0\), \(b_1\), \(b_2\), \(\cdots\), \(b_9\) the same way as we did in the case of a model with a single predictor – we select values that minimize the sum of the squared residuals:

\[SSE = e_1^2 + e_2^2 + \dots + e_{10000}^2 = \sum_{i=1}^{10000} e_i^2 = \sum_{i=1}^{10000} \left(y_i - \hat{y}_i\right)^2\]

where \(y_i\) and \(\hat{y}_i\) represent the observed interest rates and their estimated values according to the model, respectively. 10,000 residuals are calculated, one for each observation. Note that these values are sample statistics and in the case where the observed data is a random sample from a target population that we are interested in making inferences about, they are estimates of the population parameters \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\cdots\), \(\beta_9\). We will discuss inference based on linear models in Chapter 8, for now we will focus on calculating sample statistics \(b_i\).

We typically use a computer to minimize the sum of squares and compute point estimates, as shown in the sample output in Table 4.5. Using this output, we identify \(b_i\), just as we did in the one-predictor case.

Table 4.5: Output for the regression model, where interest rate is the outcome and the variables listed are the predictors. Degrees of freedom for this model is 9990.
term estimate std.error statistic p.value
(Intercept) 1.894 0.210 9.008 <0.0001
verified_incomeSource Verified 0.997 0.099 10.056 <0.0001
verified_incomeVerified 2.563 0.117 21.873 <0.0001
debt_to_income 0.022 0.003 7.434 <0.0001
credit_util 4.897 0.162 30.249 <0.0001
bankruptcy1 0.391 0.132 2.957 0.0031
term 0.153 0.004 38.889 <0.0001
credit_checks 0.228 0.018 12.516 <0.0001
issue_monthJan-2018 0.046 0.108 0.421 0.6736
issue_monthMar-2018 -0.042 0.107 -0.391 0.696

Multiple regression model.

A multiple regression model is a linear model with many predictors. In general, we write the model as

\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\]

when there are \(k\) predictors. We always calculate \(b_i\) using statistical software.

Write out the regression model using the point estimates from Table 4.5. How many predictors are there in this model?

The fitted model for the interest rate is given by:

\[ \begin{align} \widehat{\texttt{interest_rate}} = 1.925 &+ 0.975 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ 2.537 \times \texttt{verified_income}_{\texttt{Verified}} \\ &+ 0.021 \times \texttt{debt_to_income} \\ &+ 4.896 \times \texttt{credit_util} \\ &+ 0.386 \times \texttt{bankruptcy} \\ &+ 0.154 \times \texttt{term} \\ &+ 0.028 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\ &- 0.040 \times \texttt{issue_month}_{\texttt{Mar-2018}} \\ &+ 0.228 \times \texttt{credit_checks} \end{align} \]

If we count up the number of predictor coefficients, we get the effective number of predictors in the model: \(k = 9\). Notice that the categorical predictor counts as two, once for the two levels shown in the model. In general, a categorical predictor with \(p\) different levels will be represented by \(p - 1\) terms in a multiple regression model.

What does \(b_4\), the coefficient of variable credit_util, represent?

Compute the residual of the first observation in Table 4.1 on page using the full model.97

We calculated a slope coefficient of 0.74 for bankruptcy in Section 4.1.1 while the coefficient is 0.386 here. Why is there a difference between the coefficient values between the models with single and multiple predictors?

If we examined the data carefully, we would see that some predictors are correlated. For instance, when we modeled the relationship of the outcome interest_rate and predictor bankruptcy using simple linear regression, we were unable to control for other variables like whether the borrower had her income verified, the borrower’s debt-to-income ratio, and other variables. That original model was constructed in a vacuum and did not consider the full context. When we include all of the variables, underlying and unintentional bias that was missed by these other variables is reduced or eliminated. Of course, bias can still exist from other confounding variables.

The previous example describes a common issue in multiple regression: correlation among predictor variables. We say the two predictor variables are (pronounced as co-linear) when they are correlated, and this collinearity complicates model estimation. While it is impossible to prevent collinearity from arising in observational data, experiments are usually designed to prevent predictors from being collinear.

The estimated value of the intercept is 1.925, and one might be tempted to make some interpretation of this coefficient, such as, it is the model’s predicted price when each of the variables take value zero: income source is not verified, the borrower has no debt (debt-to-income and credit utilization are zero), and so on. Is this reasonable? Is there any value gained by making this interpretation?98

4.1.3 Adjusted R-squared

We first used \(R^2\) in Section 3.2.5 to determine the amount of variability in the response that was explained by the model: \[ R^2 = 1 - \frac{\text{variability in residuals}}{\text{variability in the outcome}} = 1 - \frac{Var(e_i)}{Var(y_i)} \]where \(e_i\) represents the residuals of the model and \(y_i\) the outcomes. This equation remains valid in the multiple regression framework, but a small enhancement can make it even more informative when comparing models.

The variance of the residuals for the model given in the earlier Guided Practice is 18.53, and the variance of the total price in all the auctions is 25.01. Calculate \(R^2\) for this model.99

This strategy for estimating \(R^2\) is acceptable when there is just a single variable. However, it becomes less helpful when there are many variables. The regular \(R^2\) is a biased estimate of the amount of variability explained by the model when applied to a new sample of data. To get a better estimate, we use the adjusted \(R^2\).

Adjusted R-squared as a tool for model assessment

The adjusted R-squared is computed as \[\begin{aligned} R_{adj}^{2} = 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} = 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned}\]

where \(n\) is the number of cases used to fit the model and \(k\) is the number of predictor variables in the model. Remember that a categorical predictor with \(p\) levels will contribute \(p - 1\) to the number of variables in the model.

Because \(k\) is never negative, the adjusted \(R^2\) will be smaller – often times just a little smaller – than the unadjusted \(R^2\). The reasoning behind the adjusted \(R^2\) lies in the associated with each variance, which is equal to \(n - k - 1\) for the multiple regression context. If we were to make predictions for new data using our current model, we would find that the unadjusted \(R^2\) would tend to be slightly overly optimistic, while the adjusted \(R^2\) formula helps correct this bias.

There were \(n=10000\) auctions in the data set and \(k=9\) predictor variables in the model. Use \(n\), \(k\), and the variances from the earlier Guided Practice to calculate \(R_{adj}^2\) for the interest rate model.100

Suppose you added another predictor to the model, but the variance of the errors \(Var(e_i)\) didn’t go down. What would happen to the \(R^2\)? What would happen to the adjusted \(R^2\)?101

Adjusted \(R^2\) could have been used in Chapter 3. However, when there is only \(k = 1\) predictors, adjusted \(R^2\) is very close to regular \(R^2\), so this nuance isn’t typically important when the model has only one predictor.

4.1.4 Exercises

  1. Absenteeism, Part I Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set.

    eth sex lrn days
    1 0 1 1 2
    2 0 1 1 11
    \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
    146 1 0 0 37

    The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner). (Venables and Ripley 2002)

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 18.93 2.57 7.37 0.0000
    eth -9.11 2.60 -3.51 0.0000
    sex 3.10 2.64 1.18 0.2411
    lrn 2.15 2.65 0.81 0.4177
    1. Write the equation of the regression model.

    2. Interpret each one of the slopes in this context.

    3. Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.

    4. The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the \(R^2\) and the adjusted \(R^2\). Note that there are 146 observations in the data set.

  2. Baby weights, Part III We considered the variables smoke and parity, one at a time, in modeling birth weights of babies in Exercises \[baby_weights_smoke\] and \[baby_weights_parity\]. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (gestation), mother’s age in years (age), mother’s height in inches (height), and mother’s pregnancy weight in pounds (weight). Below are three observations from this data set.

    bwt gestation parity age height weight smoke
    1 120 284 0 27 62 100 0
    2 113 282 0 33 64 135 0
    \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
    1236 117 297 0 38 65 129 0

    The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables included in the data set.

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) -80.41 14.35 -5.60 0.0000
    gestation 0.44 0.03 15.26 0.0000
    parity -3.33 1.13 -2.95 0.0033
    age -0.01 0.09 -0.10 0.9170
    height 1.15 0.21 5.63 0.0000
    weight 0.05 0.03 1.99 0.0471
    smoke -8.40 0.95 -8.81 0.0000
    1. Write the equation of the regression model that includes all of the variables.

    2. Interpret the slopes of gestation and age in this context.

    3. The coefficient for parity is different than in the linear model shown in Exercise \[baby_weights_parity\]. Why might there be a difference?

    4. Calculate the residual for the first observation in the data set.

    5. The variance of the residuals is 249.28, and the variance of the birth weights of all babies in the data set is 332.57. Calculate the \(R^2\) and the adjusted \(R^2\). Note that there are 1,236 observations in the data set.

  3. Baby weights, Part II Exercise \[baby_weights_smoke\] introduces a data set on birth weight of babies. Another variable we consider is parity, which is 1 if the child is the first born, and 0 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 120.07 0.60 199.94 0.0000
    parity -1.93 1.19 -1.62 0.1052
    1. Write the equation of the regression model.

    2. Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

    3. Is there a statistically significant relationship between the average birth weight and parity?

  4. Baby weights, Part I The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother. (n.d.)

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 123.05 0.65 189.60 0.0000
    smoke -8.94 1.03 -8.65 0.0000

    The variability within the smokers and non-smokers are about equal and the distributions are symmetric. With these conditions satisfied, it is reasonable to apply the model. (Note that we don’t need to check linearity since the predictor has only two levels.)

    1. Write the equation of the regression model.

    2. Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.

    3. Is there a statistically significant relationship between the average birth weight and smoking?

  5. Cherry trees Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet. (Hand 1994)

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) -57.99 8.64 -6.71 0.00
    height 0.34 0.13 2.61 0.01
    diameter 4.71 0.26 17.82 0.00
    1. Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.

    2. One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.

  6. GPA A survey of 55 Duke University students asked about their GPA, number of hours they study at night, number of nights they go out, and their gender. Summary output of the regression model is shown below. Note that male is coded as 1.

    Estimate Std. Error t value Pr(\(>\)\(|\)t\(|\))
    (Intercept) 3.45 0.35 9.85 0.00
    studyweek 0.00 0.00 0.27 0.79
    sleepnight 0.01 0.05 0.11 0.91
    outnight 0.05 0.05 1.01 0.32
    gender -0.08 0.12 -0.68 0.50
    1. Calculate a 95% confidence interval for the coefficient of gender in the model, and interpret it in the context of the data.

    2. Would you expect a 95% confidence interval for the slope of the remaining variables to include 0? Explain

4.2 Model selection

The best model is not always the most complicated. Sometimes including variables that are not evidently important can actually reduce the accuracy of predictions. In this section, we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important. It’s common (and hip, at least in the statistical world) to refer to models that have undergone such variable pruning as parsimonious.

In practice, the model that includes all available predictors is often referred to as the full model. The full model may not be the best model, and if it isn’t, we want to identify a smaller model that is preferable.

4.2.1 Stepwise selection

Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection. These techniques are often referred to as stepwise selection strategies, because they add or delete one variable at a time as they “step” through the candidate predictors.

Backward elimination starts with the full model (the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the model any further.

Forward selection is the reverse of the backward elimination technique. Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model any further.

An important consideration in implementing either of these stepwise selection strategies is the criterion used to decide whether to eliminate or add a variable. One commonly used decision criterion is adjusted \(R^2\). When using adjusted \(R^2\) as the decision criterion, we seek to eliminate or add variables depending on whether they lead to the largest improvement in adjusted \(R^2\) and we stop when adding or elimination of another variable does not lead to further improvement in adjusted \(R^2\).

Adjusted \(R^2\) describes the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where adding value means they are (likely) improving the accuracy in predicting future outcomes.

Let’s consider two models, which are shown in Table 4.6 and Table 4.7. The first table summarizes the full model since it includes all predictors, while the second does not include the issue_month variable.

Table 4.6: The fit for the full regression model, including the adjusted \(R^2\).
term estimate std.error statistic p.value
(Intercept) 1.894 0.210 9.008 <0.0001
verified_incomeSource Verified 0.997 0.099 10.056 <0.0001
verified_incomeVerified 2.563 0.117 21.873 <0.0001
debt_to_income 0.022 0.003 7.434 <0.0001
credit_util 4.897 0.162 30.249 <0.0001
bankruptcy1 0.391 0.132 2.957 0.0031
term 0.153 0.004 38.889 <0.0001
credit_checks 0.228 0.018 12.516 <0.0001
issue_monthJan-2018 0.046 0.108 0.421 0.6736
issue_monthMar-2018 -0.042 0.107 -0.391 0.696
Adjusted \(R^2\) = 0.2597
df = 9964
Table 4.7: The fit for the regression model after dropping the issue_month variable.
term estimate std.error statistic p.value
(Intercept) 1.896 0.198 9.56 <0.0001
verified_incomeSource Verified 0.996 0.099 10.05 <0.0001
verified_incomeVerified 2.561 0.117 21.86 <0.0001
debt_to_income 0.022 0.003 7.44 <0.0001
credit_util 4.896 0.162 30.25 <0.0001
bankruptcy1 0.392 0.132 2.96 0.0031
term 0.153 0.004 38.89 <0.0001
credit_checks 0.228 0.018 12.52 <0.0001
Adjusted \(R^2\) = 0.2598
df = 9966

Which of the two models is better?

We compare the adjusted \(R^2\) of each model to determine which to choose. Since the second model has a higher \(R^2_{adj}\) compared to the first model, we prefer the second model to the first.

Will the model without issue_month be better than the model with issue_month? We cannot know for sure, but based on the adjusted \(R^2\), this is our best assessment.

Results corresponding to the full model for the loans data are shown in Table 4.6. How should we proceed under the backward elimination strategy?

Our baseline adjusted \(R^2\) from the full model is , and we need to determine whether dropping a predictor will improve the adjusted \(R^2\). To check, we fit models that each drop a different predictor, and we record the adjusted \(R^2\):

  • Excluding verified_income: 0.22380
  • Excluding debt_to_income: 0.25468
  • Excluding credit_util: 0.19063
  • Excluding bankruptcy: 0.25787
  • Excluding term: 0.14581
  • Excluding credit_checks: 0.25854
  • Excluding issue_month: 0.24689

The model without issue_month has the highest adjusted \(R^2\) of 0.25854, higher than the adjusted \(R^2\) for the full model. Because eliminating issue_month leads to a model with a higher adjusted \(R^2\), we drop issue_month from the model.

Since we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors. Our baseline adjusted \(R^2\) is now \(R^2_{adj} = 0.25854\). We now fit new models, which consider eliminating each of the remaining predictors in addition to issue_month:

  • Excluding issue_month and verified_income: 0.22395
  • Excluding issue_month and debt_to_income: 0.25479
  • Excluding issue_month and credit_util: 0.19074
  • Excluding issue_month and bankruptcy: 0.25798
  • Excluding issue_month and term: 0.14592
  • Excluding issue_month and credit_checks: 0.24701

None of these models lead to an improvement in adjusted $R^2$, so we do not eliminate any of the remaining predictors. That is, after backward elimination, we are left with the model that keeps all predictors except issue_month, which we can summarize using the coefficients from Table 4.7.

\[ \begin{align} \widehat{\texttt{interest_rate}} = 1.921 &+ 0.974 \times \texttt{verified_income}_\texttt{Source only} \\ &+ 2.535 \times \texttt{verified_income}_\texttt{Verified} \\ &+ 0.021 \times \texttt{debt_to_income} \\ &+ 4.896 \times \texttt{credit_util} \\ &+ 0.387 \times \texttt{bankruptcy} \\ &+ 0.154 \times \texttt{term} \\ &+ 0.228 \times \texttt{credit_check} \end{align} \]

Construct a model for predicting interest_rate from the loans data using forward selection.

We start with the model that includes no predictors. Then we fit each of the possible models with just one predictor. Then we examine the adjusted \(R^2\) for each of these models:

  • Including verified_income: 0.05926
  • Including debt_to_income: 0.01946
  • Including credit_util: 0.06452
  • Including bankruptcy: 0.00222
  • Including term: 0.12855
  • Including credit_checks: -0.0001
  • Including issue_month: 0.01711

In this first step, we compare the adjusted \(R^2\) against a baseline model that has no predictors. The no-predictors model always has \(R_{adj}^2 = 0\). The model with one predictor that has the largest adjusted \(R^2\) is the model with the term predictor, and because this adjusted \(R^2\) is larger than the adjusted \(R^2\) from the model with no predictors (\(R_{adj}^2 = 0\)), we will add this variable to our model.

We repeat the process again, this time considering 2-predictor models where one of the predictors is and with a new baseline of \(R^2_{adj} = 0.12855\):

  • Including term and verified_income: 0.16851
  • Including term and debt_to_income: 0.14368
  • Including term and credit_util: 0.20046
  • Including term and bankruptcy: 0.13070
  • Including term and credit_checks: 0.12840
  • Including term and issue_month: 0.14294

Adding credit_util yields a model with a higher adjusted \(R^2\) (0.20046) than the baseline (0.12855), so we also add credit_util to the model as a predictor.

Since we have again added a predictor to the model, we continue and see whether it would be beneficial to add a third predictor:

  • Including term, credit_util, and verified_income: 0.24183
  • Including term, credit_util, and debt_to_income: 0.20810
  • Including term, credit_util, and bankruptcy: 0.20169
  • Including term, credit_util, and credit_checks: 0.20031
  • Including term, credit_util, and issue_month: 0.21629

The model including verified_income improved adjusted \(R^2\) (0.24183 to 0.20046), so we verified_income add to the model as a predictor as well.

We continue on in this way, next adding debt_to_income, then credit_checks, and bankruptcy. At this point, we come again to the issue_month variable: adding this as a predictor leads to \(R_{adj}^2 = 0.25843\), while keeping all the other predictors but excluding issue_month leads to a higher \(R_{adj}^2 = 0.25854\). This means we do not add issue_month to the model as a predictor. In this example, we have arrived at the same model that we identified from backward elimination.

Stepwise selection strategies

Backward elimination begins with the model having the largest number of predictors and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. Forward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables are found.

Backward elimination and forward selection sometimes arrive at different final models. If trying both techniques and this happens, it’s common to choose the model with the larger adjusted \(R^2\).

4.2.2 Other model selection strategies

Stepwise selection using adjusted \(R^2\) as the decision criteria is one of many widely accepted and commonly used model selection strategies. Stepwise selection can also be carried out with decision criteria other than adjusted \(R^2\), such as p-values, which you’ll learn about in Chapter 8, or AIC (Akaike information criterion) or BIC (Bayesian information criterion), which you might learn about in more advanced courses.

Additionally, one can choose to include or exclude variables from a model based on expert opinion or due to research focus.

4.2.3 Exercises

  1. Absenteeism, Part II Exercise \[absent_from_school_mlr\] considers a model that predicts the number of days absent using three predictors: ethnic background (), gender (), and learner status (). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.

    Model Adjusted \(R^2\)
    1 Full model 0.0701
    2 No ethnicity -0.0033
    3 No sex 0.0676
    4 No learner status 0.0723

    Which, if any, variable should be removed from the model first?

  2. Absenteeism, Part III Exercise \[absent_from_school_mlr\] provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted \(R^2\) of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?

    variable ethnicity sex learner status
    p-value 0.0007 0.3142 0.5870
    \(R_{adj}^2\) 0.0714 0.0001 0
  3. Baby weights, Part IV Exercise \[baby_weights_mlr\] considers a model that predicts a newborn’s weight using several predictors (gestation length, parity, age of mother, height of mother, weight of mother, smoking status of mother). The table below shows the adjusted R-squared for the full model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.

    Model Adjusted \(R^2\)
    1 Full model 0.2541
    2 No gestation 0.1031
    3 No parity 0.2492
    4 No age 0.2547
    5 No height 0.2311
    6 No weight 0.2536
    7 No smoking status 0.2072

    Which, if any, variable should be removed from the model first?

  4. Baby weights, Part V Exercise \[baby_weights_mlr\] provides regression output for the full model (including all explanatory variables available in the data set) for predicting birth weight of babies. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted \(R^2\) of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?

    variable gestation parity age height weight smoke
    p-value \(2.2 \times 10^{-16}\) 0.1052 0.2375 \(2.97 \times 10^{-12}\) \(8.2 \times 10^{-8}\) \(2.2 \times 10^{-16}\)
    \(R_{adj}^2\) 0.1657 0.0013 0.0003 0.0386 0.0229 0.0569
  5. Movie lovers, Part II Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers ( demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted \(R^2\) or the p-value approach in selecting variables for their recommendation system?

4.3 Logistic regression

In this section we introduce logistic regression as a tool for building models when there is a categorical response variable with two levels, e.g. yes and no. Logistic regression is a type of generalized linear model (GLM) for response variables where regular multiple regression does not work very well. In particular, the response variable in these settings often takes a form where residuals look completely different from the normal distribution.

GLMs can be thought of as a two-stage modeling approach. We first model the response variable using a probability distribution, such as the binomial or Poisson distribution. Second, we model the parameter of the distribution using a collection of predictors and a special form of multiple regression. Ultimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.

4.3.1 Resume data

We will consider experiment data from a study that sought to understand the effect of race and sex on job application callback rates.(Bertrand and Mullainathan 2003) To evaluate which factors were important, job postings were identified in Boston and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback.102 The researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate the resumes. Finally, they randomly assigned a name to each resume, where the name would imply the applicant’s sex and race.

The data from this study can be found in the openintro package: resume.

The first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to Black or White individuals; other races were not considered in this study. While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment. You can find the full set of names that did pass the survey test and were ultimately used in the study in Table 4.8. For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted to be associated with a White male.

Table 4.8: List of all 36 unique names along with the commonly inferred race and sex associated with these names.
first_name race sex
Aisha black female
Allison white female
Anne white female
Brad white male
Brendan white male
Brett white male
Carrie white female
Darnell black male
Ebony black female
Emily white female
Geoffrey white male
Greg white male
Hakim black male
Jamal black male
Jay white male
Jermaine black male
Jill white female
Kareem black male
Keisha black female
Kenya black female
Kristen white female
Lakisha black female
Latonya black female
Latoya black female
Laurie white female
Leroy black male
Matthew white male
Meredith white female
Neil white male
Rasheed black male
Sarah white female
Tamika black female
Tanisha black female
Todd white male
Tremayne black male
Tyrone black male

The response variable of interest is whether or not there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we’ll consider, with special interest in the race and sex variables. Race and sex are in protected classes the United States, meaning they are not legally permitted factors for hiring or employment decisions. The full set of attributes considered is provided in Table 4.9.

Table 4.9: Descriptions for the received_callback variable along with 8 other variables in the resume data set. Many of the variables are indicator variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
variable description
received_callback Specifies whether the employer called the applicant following submission of the application for the job.
job_city City where the job was located: Boston or Chicago.
college_degree An indicator for whether the resume listed a college degree.
years_experience Number of years of experience listed on the resume.
honors Indicator for the resume listing some sort of honors, e.g. employee of the month.
military Indicator for if the resume listed any military experience.
has_email_address Indicator for if the resume listed an email address for the applicant.
race Race of the applicant, implied by their first name listed on the resume.
sex Sex of the applicant (limited to only and in this study), implied by the first name listed on the resume.

All of the attributes listed on each resume were randomly assigned. This means that no attributes that might be favourable or detrimental to employment would favour one demographic over another on these resumes. Importantly, due to the experimental nature of this study, we can infer causation between these variables and the callback rate, if the variable is statistically significant. Our analysis will allow us to compare the practical importance of each of the variables relative to each other.

4.3.2 Modelling the probability of an event

Logistic regression is a generalized linear model where the outcome is a two-level categorical variable. The outcome, \(Y_i\), takes the value 1 (in our application, this represents a callback for the resume) with probability \(p_i\) and the value 0 with probability \(1 - p_i\). Because each observation has a slightly different context, e.g. different education level or a different number of years of experience, the probability \(p_i\) will differ for each observation. Ultimately, it is this probability that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.

Notation for a logistic regression model.

The outcome variable for a GLM is denoted by \(Y_i\), where the index \(i\) is used to represent observation \(i\). In the resume application, \(Y_i\) will be used to represent whether resume \(i\) received a callback (\(Y_i=1\)) or not (\(Y_i=0\)).

The predictor variables are represented as follows: \(x_{1,i}\) is the value of variable 1 for observation \(i\), \(x_{2,i}\) is the value of variable 2 for observation \(i\), and so on.

\[ transformation(p_i) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \cdots + b_k x_{k,i} \]

We want to choose a transformation in the equation that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range.

A common transformation for \(p_i\) is the logit transformation, which may be written as

\[ logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right) \]

The logit transformation is shown in Figure 4.1. Below, we rewrite the equation relating \(Y_i\) to its predictors using the logit transformation of \(p_i\):

\[ \log_{e}\left( \frac{p_i}{1-p_i} \right) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \cdots + b_k x_{k,i} \]

Values of $p_i$ against values of $logit(p_i)$.

Figure 4.1: Values of \(p_i\) against values of \(logit(p_i)\).

In our resume example, there are 8 predictor variables, so \(k = 8\). While the precise choice of a logit function isn’t intuitive, it is based on theory that underpins generalized linear models, which is beyond the scope of this book. Fortunately, once we fit a model using software, it will start to feel like we’re back in the multiple regression context, even if the interpretation of the coefficients is more complex.

We start by fitting a model with a single predictor: honors. This variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month. The following logistic regression model was fit using statistical software:

\[\log_e \left( \frac{p_i}{1-p_i} \right) = -2.4998 + 0.8668 \times\text{\texttt{honors}}\]

  1. If a resume is randomly selected from the study and it does not have any honors listed, what is the probability resulted in a callback?

  2. What would the probability be if the resume did list some honors?

  1. If a randomly chosen resume from those sent out is considered, and it does not list honors, then takes value 0 and the right side of the model equation equals -2.4998. Solving for \(p_i\): \(\frac{e^{-2.4998}}{1 + e^{-2.4998}} = 0.076\). Just as we labelled a fitted value of \(y_i\) with a “hat” in single-variable and multiple regression, we do the same for this probability: \(\hat{p}_i = 0.076{}\).

  2. If the resume had listed some honors, then the right side of the model equation is \(-2.4998{} + 0.8668 \times 1 = -1.6330\), which corresponds to a probability \(\hat{p}_i = 0.163\). Notice that we could examine -2.4998 and -1.6330 in Figure 4.1 to estimate the probability before formally calculating the value.

To convert from values on the logistic regression scale (e.g. -2.4998 and -1.6330), use the following formula,

which is the result of solving for \(p_i\) in the regression model:

\[ p_i = \frac{e^{b_0 + b_1 x_{1,i}+\cdots+b_k x_{k,i}}}{1 + e^{b_0 + b_1 x_{1,i}+\cdots+b_k x_{k,i}}} \]

As with most applied data problems, we substitute the point estimates for the parameters (the \(b_i\)) so that we can make use of this formula. The probabilities can be calculated as follows.

\[ \frac{e^{-2.4998}}{1 + e^{-2.4998}}= 0.076 \]

\[ \frac{e^{-2.4998 + 0.8668}}{1 + e^{-2.4998 + 0.8668}} = 0.163 \]

While knowing whether a resume listed honors provides some signal when predicting whether or not the employer would call, we would like to account for many different variables at once to understand how each of the different resume characteristics affected the chance of a callback.

4.3.3 Logistic model with many variables

We used statistical software to fit the logistic regression model with all 8 predictors described in Table 4.9. Like multiple regression, the result may be presented in a summary table, which is shown in Table 4.10. The structure of this table is almost identical to that of multiple regression; the only notable difference is that the p-values are calculated using the normal distribution rather than the \(t\)-distribution.

Table 4.10: Summary table for the full logistic regression model for the resume callback example.
term estimate std.error statistic p.value
(Intercept) -2.663 0.182 -14.64 <0.0001
job_cityChicago -0.440 0.114 -3.85 <0.0001
college_degree1 -0.067 0.121 -0.55 <0.0001
years_experience 0.020 0.010 1.96 <0.0001
honors1 0.769 0.186 4.14 <0.0001
military1 -0.342 0.216 -1.59 <0.0001
has_email_address1 0.218 0.113 1.93 <0.0001
racewhite 0.442 0.108 4.09 <0.0001
sexman -0.182 0.138 -1.32 <0.0001

Just like multiple regression, we could trim some variables from the model. Here we’ll use a statistic called Akaike information criterion (AIC), which is an analogue to how we used adjusted R-squared in multiple regression, and we look for models with a lower AIC through a backward elimination strategy. After using this criteria, the variable college_degree is eliminated, giving the smaller model summarized in Table 4.11, which is what we’ll rely on for the remainder of this section.

Table 4.11: Summary table for the logistic regression model for the resume callback example, where variable selection has been performed using AIC.
term estimate std.error statistic p.value
(Intercept) -2.7162 0.1551 -17.51 <0.0001
job_cityChicago -0.4364 0.1141 -3.83 <0.0001
years_experience 0.0206 0.0102 2.02 <0.0001
honors1 0.7634 0.1852 4.12 <0.0001
military1 -0.3443 0.2157 -1.60 <0.0001
has_email_address1 0.2221 0.1130 1.97 <0.0001
racewhite 0.4429 0.1080 4.10 <0.0001
sexman -0.1959 0.1352 -1.45 <0.0001

The race variable had taken only two levels: black and white. Based on the model results, was race a meaningful factor for if a prospective employer would call back?

We see that the p-value for this coefficient is very small (very nearly zero), which implies that race played a statistically significant role in whether a candidate received a callback. Additionally, we see that the coefficient shown corresponds to the level of white, and it is positive. This positive coefficient reflects a positive gain in callback rate for resumes where the candidate’s first name implied they were White. The data provide very strong evidence of racism by prospective employers that favours resumes where the first name is typically interpreted to be White.

The coefficient of \(\texttt{race}_{\texttt{white}}\) in the full model in Table 4.10, is nearly identical to the model shown in Table 4.11. The predictors in this experiment were thoughtfully laid out so that the coefficient estimates would typically not be much influenced by which other predictors were in the model, which aligned with the motivation of the study to tease out which effects were important to getting a callback. In most observational data, it’s common for point estimates to change a little, and sometimes a lot, depending on which other variables are included in the model.

Use the model summarized in Table 4.11 to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male.

We can start by writing out the equation using the coefficients from the model, then we can add in the corresponding values of each variable for this individual:

\[ \begin{aligned} &log_e \left(\frac{p}{1 - p}\right) \\ &\quad= - 2.7162 - 0.4364 \times \texttt{job_city}_{\texttt{Chicago}} + 0.0206 \times \texttt{years_experience} \\ &\quad\qquad + 0.7634 \times \texttt{honors} - 0.3443 \times \texttt{military} + 0.2221 \times \texttt{email} \\ &\quad\qquad + 0.4429 \times \texttt{race}_{\texttt{white}} - 0.1959 \times \texttt{sex}_{\texttt{man}} \\ &\quad= - 2.7162 - 0.4364 \times 1 + 0.0206 \times 14 \\ &\quad\qquad + 0.7634 \times 0 - 0.3443 \times 0 + 0.2221 \times 1 \\ &\quad\qquad + 0.4429 \times 1 - 0.1959 \times 1 \\ &\quad= - 2.3955 \end{aligned} \]

We can now back-solve for \(p\): the chance such an individual will receive a callback is about 8.35%.

Compute the probability of a callback for an individual with a name commonly inferred to be from a Black male but who otherwise has the same characteristics as the one described in the previous example.

We can complete the same steps for an individual with the same characteristics who is Black, where the only difference in the calculation is that the indicator variable \(\texttt{race}_{\texttt{white}}\) will take a value of 0. Doing so yields a probability of 0.0553. Let’s compare the results with those of the previous example..

In practical terms, an individual perceived as White based on their first name would need to apply to \(\frac{1}{0.0835} \approx 12\) jobs on average to receive a callback, while an individual perceived as Black based on their first name would need to apply to \(\frac{1}{0.0553} \approx 18\) jobs on average to receive a callback. That is, applicants who are perceived as Black need to apply to 50% more employers to receive a callback than someone who is perceived as White based on their first name for jobs like those in the study.

What we’ve quantified in this section is alarming and disturbing. However, one aspect that makes this racism so difficult to address is that the experiment, as well-designed as it is, cannot send us much signal about which employers are discriminating. It is only possible to say that discrimination is happening, even if we cannot say which particular callbacks — or non-callbacks — represent discrimination. Finding strong evidence of racism for individual cases is a persistent challenge in enforcing anti-discrimination laws.

4.3.4 Model diagnostics

Logistic regression conditions.

There are two key conditions for fitting a logistic regression model:

  1. Each outcome \(Y_i\) is independent of the other outcomes.
  2. Each predictor \(x_i\) is linearly related to logit\((p_i)\) if all other predictors are held constant.

The first logistic regression model condition — independence of the outcomes — is reasonable for the experiment since characteristics of resumes were randomly assigned to the resumes that were sent out.

The second condition of the logistic regression model is not easily checked without a fairly sizable amount of data. Luckily, we have 4870 resume submissions in the data set! Let’s first visualize these data by plotting the true classification of the resumes against the model’s fitted probabilities, as shown in Figure 4.2.

The predicted probability that each of the 4870 resumes results in a callback. Noise (small, random vertical shifts) have been added to each point so points with nearly identical values aren’t plotted exactly on top of one another.

Figure 4.2: The predicted probability that each of the 4870 resumes results in a callback. Noise (small, random vertical shifts) have been added to each point so points with nearly identical values aren’t plotted exactly on top of one another.

4.3.5 Groups of different sizes

Any form of discrimination is concerning, and this is why we decided it was so important to discuss this topic using data. The resume study also only examined discrimination in a single aspect: whether a prospective employer would call a candidate who submitted their resume. There was a 50% higher barrier for resumes simply when the candidate had a first name that was perceived to be from a Black individual. It’s unlikely that discrimination would stop there.

Let’s consider a sex-imbalanced company that consists of 20% women and 80%103 men, and we’ll suppose that the company is very large, consisting of perhaps 20,000 employees. Suppose when someone goes up for promotion at this company, 5 of their colleagues are randomly chosen to provide feedback on their work.

Now let’s imagine that 10% of the people in the company are prejudiced against the other sex. That is, 10% of men are prejudiced against women, and similarly, 10% of women are prejudiced against men.

Who is discriminated against more at the company, men or women?

Let’s suppose we took 100 men who have gone up for promotion in the past few years. For these men, \(5 \times 100 = 500\) random colleagues will be tapped for their feedback, of which about 20% will be women (100 women). Of these 100 women, 10 are expected to be biased against the man they are reviewing. Then, of the 500 colleagues reviewing them, men will experience discrimination by about 2% of their colleagues when they go up for promotion.

Let’s do a similar calculation for 100 women who have gone up for promotion in the last few years. They will also have 500 random colleagues providing feedback, of which about 400 (80%) will be men. Of these 400 men, about 40 (10%) hold a bias against women. Of the 500 colleagues providing feedback on the promotion packet for these women, 8% of the colleagues hold a bias against the women.

This example highlights something profound: even in a hypothetical setting where each demographic has the same degree of prejudice against the other demographic, the smaller group experiences the negative effects more frequently. Additionally, if we would complete a handful of examples like the one above with different numbers, we’d learn that the greater the imbalance in the population groups, the more the smaller group is disproportionately impacted.104

Of course, there are other considerable real-world omissions from the hypothetical example. For example, studies have found instances where people from an oppressed group also discriminate against others within their own oppressed group. As another example, there are also instances where a majority group can be oppressed, with apartheid in South Africa being one such historic example. Ultimately, discrimination is complex, and there are many factors at play beyond the mathematics property we observed in the previous example.

We close this book on this serious topic, and we hope it inspires you to think about the power of reasoning with data. Whether it is with a formal statistical model or by using critical thinking skills to structure a problem, we hope the ideas you have learned will help you do more and do better in life.

4.3.6 Exercises

  1. Challenger disaster, Part I On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. Temp gives the temperature in Fahrenheit, Damaged represents the number of damaged O- rings, and Undamaged represents the number of O-rings that were not damaged.

    Shuttle Mission 1 2 3 4 5 6 7 8 9 10 11 12

    Temperature 53 57 58 63 66 67 67 67 68 69 70 70
    Damaged 5 1 1 1 0 0 0 0 0 0 1 0
    Undamaged 1 5 5 5 6 6 6 6 6 6 5 6

    Shuttle Mission 13 14 15 16 17 18 19 20 21 22 23

    Temperature 70 70 72 73 75 75 76 76 78 79 81
    Damaged 1 0 0 0 0 1 0 0 0 0 0
    Undamaged 5 6 6 6 6 5 6 6 6 6 6

    1. Each column of the table above represents a different shuttle mission. Examine these data and describe what you observe with respect to the relationship between temperatures and damaged O-rings.

    2. Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and a logistic regression model was fit to these data. A summary of this model is given below. Describe the key components of this summary table in words.

      Estimate Std. Error z value Pr(\(>\)\(|\)z\(|\))
      (Intercept) 11.6630 3.2963 3.54 0.0004
      Temperature -0.2162 0.0532 -4.07 0.0000
    3. Write out the logistic model using the point estimates of the model parameters.

    4. Based on the model, do you think concerns regarding O-rings are justified? Explain.

  2. Challenger disaster, Part II Exercise \[challenger_disaster_model_select\] introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.

    1. The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as \[\begin{aligned} \log\left( \frac{\hat{p}}{1 - \hat{p}} \right) = 11.6630 - 0.2162\times Temperature\end{aligned}\] where \(\hat{p}\) is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature: \[\begin{aligned} &\hat{p}_{57} = 0.341 && \hat{p}_{59} = 0.251 && \hat{p}_{61} = 0.179 && \hat{p}_{63} = 0.124 \\ &\hat{p}_{65} = 0.084 && \hat{p}_{67} = 0.056 && \hat{p}_{69} = 0.037 && \hat{p}_{71} = 0.024\end{aligned}\]

    2. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

    3. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.

  3. Possum classification, Part I The common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum (see Figure \[brushtail_possum\]). We consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population. The first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. The second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia. We use logistic regression to differentiate between possums in these two regions. The outcome variable, called , takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland. We consider five predictors: (an indicator for a possum being male), , , , and . Each variable is summarized in a histogram. The full logistic regression model and a reduced model after variable selection are summarized in the table.

                   Estimate        SE       Z   Pr($>$$|$Z$|$)      Estimate       SE       Z   Pr($>$$|$Z$|$)
     (Intercept)    39.2349   11.5368    3.40           0.0007       33.5095   9.9053    3.38           0.0007
        sex_male    -1.2376    0.6662   -1.86           0.0632       -1.4207   0.6457   -2.20           0.0278
     head_length    -0.1601    0.1386   -1.16           0.2480                                
     skull_width    -0.2012    0.1327   -1.52           0.1294       -0.2787   0.1226   -2.27           0.0231
    total_length     0.6488    0.1531    4.24           0.0000        0.5687   0.1322    4.30           0.0000
     tail_length    -1.8708    0.3741   -5.00           0.0000       -1.8057   0.3599   -5.02           0.0000

    1. Examine each of the predictors. Are there any outliers that are likely to have a very large influence on the logistic regression model?

    2. The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: . The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models.

  4. Possum classification, Part II A logistic regression model was proposed for classifying common brushtail possums into their two regions in Exercise \[possum_classification_model_select\]. The outcome variable took value 1 if the possum was from Victoria and 0 otherwise.

                   Estimate       SE       Z   Pr($>$$|$Z$|$)
     (Intercept)    33.5095   9.9053    3.38           0.0007
        sex_male    -1.4207   0.6457   -2.20           0.0278
     skull_width    -0.2787   0.1226   -2.27           0.0231
    total_length     0.5687   0.1322    4.30           0.0000
     tail_length    -1.8057   0.3599   -5.02           0.0000

    1. Write out the form of the model. Also identify which of the variables are positively associated when controlling for other variables.

    2. Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it doesn’t say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model’s computed probability that this possum is from Victoria? How confident are you in the model’s accuracy of this probability calculation?

4.4 Chapter review

4.4.1 Terms

We introduced the following terms in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However you should be able to easily spot them as bolded text.

adjusted R-squared collinearity generalized linear model parsimonious
Akaike information criterion forward selection logistic regression reference level
backward elimination full model multiple regression stepwise selection

4.4.2 Chapter exercises

  1. Logistic regression fact checking Determine which of the following statements are true and false. For each statement that is false, explain why it is false.

    1. Suppose we consider the first two observations based on a logistic regression model, where the first variable in observation 1 takes a value of \(x_1 = 6\) and observation 2 has \(x_1 = 4\). Suppose we realized we made an error for these two observations, and the first observation was actually \(x_1 = 7\) (instead of 6) and the second observation actually had \(x_1 = 5\) (instead of 4). Then the predicted probability from the logistic regression model would increase the same amount for each observation after we correct these variables.

    2. When using a logistic regression model, it is impossible for the model to predict a probability that is negative or a probability that is greater than 1.

    3. Because logistic regression predicts probabilities of outcomes, observations used to build a logistic regression model need not be independent.

    4. When fitting logistic regression, we typically complete model selection using adjusted \(R^2\).

  2. Movie returns, Part II The student from Exercise \[movie_returns_altogether\] analyzed return-on-investment (ROI) for movies based on release year and genre of movies. The plots below show the predicted ROI vs. actual ROI for each of the genres separately. Do these figures support the comment in the FiveThirtyEight.com article that states, “The return-on-investment potential for horror movies is absurd.” Note that the x-axis range varies for each plot.

  3. Multiple regression fact checking Determine which of the following statements are true and false. For each statement that is false, explain why it is false.

    1. If predictors are collinear, then removing one variable will have no influence on the point estimate of another variable’s coefficient.

    2. Suppose a numerical variable \(x\) has a coefficient of \(b_1 = 2.5\) in the multiple regression model. Suppose also that the first observation has \(x_1 = 7.2\), the second observation has a value of \(x_1 = 8.2\), and these two observations have the same values for all other predictors. Then the predicted value of the second observation will be 2.5 higher than the prediction of the first observation based on the multiple regression model.

    3. If a regression model’s first variable has a coefficient of \(b_1 = 5.7\), then if we are able to influence the data so that an observation will have its \(x_1\) be 1 larger than it would otherwise, the value \(y_1\) for this observation would increase by 5.7.

    4. Suppose we fit a multiple regression model based on a data set of 472 observations. We also notice that the distribution of the residuals includes some skew but does not include any particularly extreme outliers. Because the residuals are not nearly normal, we should not use this model and require more advanced methods to model these data.

  4. Spam filtering, Part I Spam filters are built on principles similar to those used in logistic regression. We fit a probability that each message is spam or not spam. We have several email variables for this problem: , , , , , , , , , , and . We won’t describe what each variable means here for the sake of brevity, but each is either a numerical or indicator variable.

    1. For variable selection, we fit the full model, which includes all variables, and then we also fit each model where we’ve dropped exactly one of the variables. In each of these reduced models, the AIC value for the model is reported below. Based on these results, which variable, if any, should we drop as part of model selection? Explain.

      Variable Dropped AIC
      None Dropped 1863.50
    2. Consider the following model selection stage. Here again we’ve computed the AIC for each leave-one-variable-out model. Based on the results, which variable, if any, should we drop as part of model selection? Explain.

      Variable Dropped AIC
      None Dropped 1862.41
  5. Spam filtering, Part II In Exercise \[spam_filtering_model_sel\], we encountered a data set where we applied logistic regression to aid in spam classification for individual emails. In this exercise, we’ve taken a small set of these variables and fit a formal model with the following output:

    Estimate Std. Error z value Pr(\(>\)\(|\)z\(|\))
    (Intercept) -0.8124 0.0870 -9.34 0.0000
    tomultiple -2.6351 0.3036 -8.68 0.0000
    winner 1.6272 0.3185 5.11 0.0000
    format -1.5881 0.1196 -13.28 0.0000
    resubj -3.0467 0.3625 -8.40 0.0000
    1. Write down the model using the coefficients from the model fit.

    2. Suppose we have an observation where \(\var{to\us{}multiple} = 0\), \(\var{winner} = 1\), \(\var{format} = 0\), and \(\var{re\us{}subj} = 0\). What is the predicted probability that this message is spam?

    3. Put yourself in the shoes of a data scientist working on a spam filter. For a given message, how high must the probability a message is spam be before you think it would be reasonable to put it in a spambox (which the user is unlikely to check)? What tradeoffs might you consider? Any ideas about how you might make your spam-filtering system even better from the perspective of someone using your email service?

4.4.3 Interactive R tutorials

Navigate the concepts you’ve learned in this chapter in R using the following self-paced tutorials. All you need is your browser to get started!

You can also access the full list of tutorials supporting this book here.

4.4.4 R labs

Further apply the concepts you’ve learned in this chapter in R with computational labs that walk you through a data analysis case study.