Multiple linear regression (MLR)

Prof. Maria Tackett

Sep 10, 2024

Topics

  • Exploratory data analysis for multiple linear regression

  • Fitting the least squares line

  • Interpreting coefficients for quantitative predictors

  • Prediction

Computing setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Considering multiple variables

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income debt_to_income verified_income interest_rate
           <dbl>          <dbl> <fct>                   <dbl>
 1         59000         0.558  Not Verified            10.9 
 2         60000         1.31   Not Verified             9.92
 3         75000         1.06   Verified                26.3 
 4         75000         0.574  Not Verified             9.92
 5        254000         0.238  Not Verified             9.43
 6         67000         1.08   Source Verified          9.92
 7         28800         0.0997 Source Verified         17.1 
 8         80000         0.351  Not Verified             6.08
 9         34000         0.698  Not Verified             7.97
10         80000         0.167  Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

  • annual_income: Annual income
  • debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
  • verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Outcome: interest_rate: Interest rate for the loan

Outcome: interest_rate

Min Median Max IQR
5.31 9.93 26.3 5.755

Predictors

Data manipulation 1: Rescale income

loan50 <- loan50 |>
  mutate(annual_income_th = annual_income / 1000)

Why did we rescale income?

Outcome vs. predictors

Goal: Use these predictors in a single model to understand variability in interest rate.

Why do we want to use a single model versus 3 separate simple linear regression models?

Multiple linear regression

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

interest_rate =β0+β1 debt_to_income+β2 verified_income+β3 annual_income_th+ϵ,ϵ∼N(0,σϵ2)

Multiple linear regression

Recall: The simple linear regression model

Y=β0+β1 X+ϵ,ϵ∼N(0,σϵ2)

The form of the multiple linear regression model is

Y=β0+β1X1+⋯+βpXp+ϵ,ϵ∼N(0,σϵ2)


Therefore,

E(Y|X1,…,Xp)=β0+β1X1+⋯+βpXp

Fitting the least squares line

Similar to simple linear regression, we want to find estimates for β0,β1,…,βp that minimize

∑i=1nei2=∑i=1n[yi−y^i]2=∑i=1n[yi−(β0+β1xi1+⋯+βpxip)]2


The calculations can be very tedious, especially if p is large

Matrix form of multiple linear regression

Suppose we have n observations, a quantitative response variable, and p > 1 predictors [y1⋮yn]⏟y=[1x11…x1p⋮⋮⋱⋮1xn1…xnp]⏟X[β0β1⋮βp]⏟β+[ϵ1⋮ϵn]⏟ϵ

What are the dimensions of y, X, β, ϵ?

Matrix form of multiple linear regression

As with simple linear regression, we have

Y=Xβ+ϵ

Generalizing the derivations from SLR to p>2, we have

β^=(XTX)−1XTy

as before.

Model fit in R

int_fit <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th,
              data = loan50)

tidy(int_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 10.726 1.507 7.116 0.000
debt_to_income 0.671 0.676 0.993 0.326
verified_incomeSource Verified 2.211 1.399 1.581 0.121
verified_incomeVerified 6.880 1.801 3.820 0.000
annual_income_th -0.021 0.011 -1.804 0.078

Model equation

interest_rate^=10.726+0.671×debt_to_income+2.211×source_verified+6.880×verified−0.021×annual_income_th

Note

We will talk about why there are two terms in the model for verified_income soon!

Interpreting β^j

  • The estimated coefficient β^j is the expected change in the mean of Y when Xj increases by one unit, holding the values of all other predictor variables constant.
  • Example: The estimated coefficient for debt_to_income is 0.671. This means for each point in an borrower’s debt to income ratio, the interest rate on the loan is expected to be greater by 0.671%, holding annual income and income verification constant.

Interpreting β^j

The estimated coefficient for annual_income_th is -0.021. Interpret this coefficient in the context of the data.


Why do we need to include a statement about holding all other predictors constant?

Interpreting β^0

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 10.726 1.507 7.116 0.000 7.690 13.762
debt_to_income 0.671 0.676 0.993 0.326 -0.690 2.033
verified_incomeSource Verified 2.211 1.399 1.581 0.121 -0.606 5.028
verified_incomeVerified 6.880 1.801 3.820 0.000 3.253 10.508
annual_income_th -0.021 0.011 -1.804 0.078 -0.043 0.002

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

Prediction

What is the predicted interest rate for an borrower with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000?


10.726 + 0.671 * 0.558 + 2.211 * 0 + 6.880 * 0 - 0.021 * 59
[1] 9.861418

The predicted interest rate for an borrower with with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000 is 9.86%.

Prediction in R

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_borrower <- tibble(
  debt_to_income  = 0.558, 
  verified_income = "Not Verified", 
  annual_income_th = 59
)

predict(int_fit, new_borrower)
       1 
9.890888 

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Cautions

  • Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
  • The multiple regression model only shows association, not causality
    • To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Recap

  • Showed exploratory data analysis for multiple linear regression

  • Used least squares to fit the regression line

  • Interpreted the coefficients for quantitative predictors

  • Predicted the response for new observations

Next class

  • More on multiple linear regression

    • Categorical predictors

    • Model assessment

    • Geometric interpretation (as time permits)

  • See Sep 12 prepare

🔗 STA 221 - Fall 2024

1 / 26
Multiple linear regression (MLR) Prof. Maria Tackett Sep 10, 2024

  1. Slides

  2. Tools

  3. Close
  • Multiple linear regression (MLR)
  • Topics
  • Computing setup
  • Considering multiple variables
  • Data: Peer-to-peer lender
  • Variables
  • Outcome: interest_rate
  • Predictors
  • Data manipulation 1: Rescale income
  • Outcome vs. predictors
  • Multiple linear regression
  • Multiple linear regression (MLR)
  • Multiple linear regression
  • Fitting the least squares line
  • Matrix form of multiple linear regression
  • Matrix form of multiple linear regression
  • Model fit in R
  • Model equation
  • Interpreting β^j
  • Interpreting β^j
  • Interpreting β^0
  • Prediction
  • Prediction in R
  • Cautions
  • Recap
  • Next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help