Multiple linear regression (MLR)

Categorical predictors + Assessment

Prof. Maria Tackett

Sep 12, 2024

term	estimate	std.error	statistic	p.value
(Intercept)	10.726	1.507	7.116	0.000
debt_to_income	0.671	0.676	0.993	0.326
verified_incomeSource Verified	2.211	1.399	1.581	0.121
verified_incomeVerified	6.880	1.801	3.820	0.000
annual_income_th	-0.021	0.011	-1.804	0.078

Indicator variables

Suppose we want to predict the amount of sleep a Duke student gets based on whether they are in Pratt (Pratt Yes/ No are the only two options). Consider the model

$S l e e p_{i} = β_{0} + β_{1} 1 (P r a t t_{i} = Yes) + β_{2} 1 (P r a t t_{i} = No)$

Write out the design matrix for this hypothesized linear model.
Demonstrate that the design matrix is not of full column rank (that is, affirmatively provide one of the columns in terms of the others).
Use this intuition to explain why when we include categorical predictors, we cannot include both indicators for every level of the variable and an intercept.

Indicator variables in the model

We will use $k - 1$ of the indicator variables in the model.
The baseline is the category that doesn’t have a term in the model.
The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.

loan50 |>
  select(verified_income, source_verified, verified) |>
  slice(1, 3, 6)

# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0

Take a look at the design matrix in AE 02

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	10.726	1.507	7.116	0.000	7.690	13.762
debt_to_income	0.671	0.676	0.993	0.326	-0.690	2.033
verified_incomeSource Verified	2.211	1.399	1.581	0.121	-0.606	5.028
verified_incomeVerified	6.880	1.801	3.820	0.000	3.253	10.508
annual_income_th	-0.021	0.011	-1.804	0.078	-0.043	0.002

term	estimate	std.error	statistic	p.value
(Intercept)	9.560	2.034	4.700	0.000
debt_to_income	0.691	0.685	1.009	0.319
verified_incomeSource Verified	3.577	2.539	1.409	0.166
verified_incomeVerified	9.923	3.654	2.716	0.009
annual_income_th	-0.007	0.020	-0.341	0.735
verified_incomeSource Verified:annual_income_th	-0.016	0.026	-0.643	0.523
verified_incomeVerified:annual_income_th	-0.032	0.033	-0.979	0.333

1 / 34

Multiple linear regression (MLR) Categorical predictors + Assessment Prof. Maria Tackett Sep 12, 2024

Multiple linear regression (MLR)
Announcements
Homework
Homework
Topics
Computing setup
Data: Peer-to-peer lender
Variables
Response vs. predictors
Model fit in R
Categorical predictors
Matrix form of multiple linear regression
Indicator variables
Indicator variables
Indicator variables for verified_income
Indicator variables in the model
Interpreting verified_income
Interaction terms
Interaction terms
Interest rate vs. annual income
Application exercise
Interaction term in model
Interpreting interaction terms
Model assessment and comparison
RMSE & $R^{2}$
Comparing models
Adjusted $R^{2}$
$R^{2}$ and Adjusted $R^{2}$
Using $R^{2}$ and Adjusted $R^{2}$
LaTex
Latex in this class
Application exercise
Recap
Next class

Multiple linear regression (MLR)

Announcements

Homework

Homework

Topics

Computing setup

Data: Peer-to-peer lender

Variables

Response vs. predictors

Model fit in R

Categorical predictors

Matrix form of multiple linear regression

Indicator variables

Indicator variables

Indicator variables for `verified_income`

Indicator variables in the model

Interpreting `verified_income`

Interaction terms

Interaction terms

Interest rate vs. annual income

Application exercise

Interaction term in model

Interpreting interaction terms

Model assessment and comparison

RMSE & $R^{2}$

Comparing models

Adjusted $R^{2}$

$R^{2}$ and Adjusted $R^{2}$

Using $R^{2}$ and Adjusted $R^{2}$

LaTex

Latex in this class

Application exercise

Recap

Next class