STA 221 - Fall 2024 – Multiple linear regression (MLR)

Min	Median	Max	IQR
5.31	9.93	26.3	5.755

Multiple linear regression (MLR)

Based on the analysis goals, we will use a multiple linear regression model of the following form

$\begin{aligned} interest_rate = β_{0} & + β_{1} debt_to_income \\ + β_{2} verified_income \\ + β_{3} annual_income_th \\ + ϵ, ϵ \sim N (0, σ_{ϵ}^{2}) \end{aligned}$

Multiple linear regression

Recall: The simple linear regression model

$Y = β_{0} + β_{1} X + ϵ, ϵ \sim N (0, σ_{ϵ}^{2})$

The form of the multiple linear regression model is

$Y = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} + ϵ, ϵ \sim N (0, σ_{ϵ}^{2})$

Therefore,

$E (Y | X_{1}, \dots, X_{p}) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$

Fitting the least squares line

Similar to simple linear regression, we want to find estimates for $β_{0}, β_{1}, \dots, β_{p}$ that minimize

$\sum_{i = 1}^{n} e_{i}^{2} = \sum_{i = 1}^{n} [y_{i} - {\hat{y}}_{i}]^{2} = \sum_{i = 1}^{n} [y_{i} - (β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p})]^{2}$

The calculations can be very tedious, especially if $p$ is large

Matrix form of multiple linear regression

Suppose we have $n$ observations, a quantitative response variable, and $p$ > 1 predictors $\underset{y}{\underset{⏟}{[\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}]}} = \underset{X}{\underset{⏟}{[\begin{matrix} 1 & x_{11} & \dots & x_{1 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{matrix}]}} \underset{β}{\underset{⏟}{[\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p} \end{matrix}]}} + \underset{ϵ}{\underset{⏟}{[\begin{matrix} ϵ_{1} \\ ⋮ \\ ϵ_{n} \end{matrix}]}}$

What are the dimensions of $y$ , $X$ , $β$ , $ϵ$ ?

Matrix form of multiple linear regression

As with simple linear regression, we have

$Y = X β + ϵ$

Generalizing the derivations from SLR to $p > 2$ , we have

$\hat{β} = (X^{T} X)^{- 1} X^{T} y$

as before.

Model fit in R

int_fit <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th,
              data = loan50)

tidy(int_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	10.726	1.507	7.116	0.000
debt_to_income	0.671	0.676	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_income_th	-0.021	0.011	-1.804	0.078

Model equation

$\begin{aligned} \hat{interest_rate} = 10.726 & + 0.671 \times debt_to_income \\ + 2.211 \times source_verified \\ + 6.880 \times verified \\ - 0.021 \times annual_income_th \end{aligned}$

Note

We will talk about why there are two terms in the model for verified_income soon!

Interpreting ${\hat{β}}_{j}$

The estimated coefficient ${\hat{β}}_{j}$ is the expected change in the mean of $Y$ when $X_{j}$ increases by one unit, holding the values of all other predictor variables constant.

Example: The estimated coefficient for debt_to_income is 0.671. This means for each point in an borrower’s debt to income ratio, the interest rate on the loan is expected to be greater by 0.671%, holding annual income and income verification constant.

Interpreting ${\hat{β}}_{j}$

The estimated coefficient for annual_income_th is -0.021. Interpret this coefficient in the context of the data.

Why do we need to include a statement about holding all other predictors constant?

Interpreting ${\hat{β}}_{0}$

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10.726	1.507	7.116	0.000	7.690	13.762
debt_to_income	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th	-0.021	0.011	-1.804	0.078	-0.043	0.002

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

Prediction

What is the predicted interest rate for an borrower with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000?

10.726 + 0.671 * 0.558 + 2.211 * 0 + 6.880 * 0 - 0.021 * 59

[1] 9.861418

The predicted interest rate for an borrower with with an debt-to-income ratio of 0.558, whose income is not verified, and who has an annual income of $59,000 is 9.86%.

Prediction in R

Just like with simple linear regression, we can use the predict() function in R to calculate the appropriate intervals for our predicted values:

new_borrower <- tibble(
  debt_to_income  = 0.558, 
  verified_income = "Not Verified", 
  annual_income_th = 59
)

predict(int_fit, new_borrower)

       1 
9.890888

Note

Difference in predicted value due to rounding the coefficients on the previous slide.

Cautions

Do not extrapolate! Because there are multiple predictor variables, there is the potential to extrapolate in many directions
The multiple regression model only shows association, not causality
- To show causality, you must have a carefully designed experiment or carefully account for confounding variables in an observational study

Recap

Showed exploratory data analysis for multiple linear regression
Used least squares to fit the regression line
Interpreted the coefficients for quantitative predictors
Predicted the response for new observations

Next class

More on multiple linear regression
- Categorical predictors
- Model assessment
- Geometric interpretation (as time permits)
See Sep 12 prepare

Multiple linear regression (MLR)

Topics

Computing setup

Considering multiple variables

Data: Peer-to-peer lender

Variables

Outcome: `interest_rate`

Predictors

Data manipulation 1: Rescale income

Outcome vs. predictors

Multiple linear regression

Multiple linear regression (MLR)

Multiple linear regression

Fitting the least squares line

Matrix form of multiple linear regression

Matrix form of multiple linear regression

Model fit in R

Model equation

Interpreting ${\hat{β}}_{j}$

Interpreting ${\hat{β}}_{j}$

Interpreting ${\hat{β}}_{0}$

Prediction

Prediction in R

Cautions

Recap

Next class