library(tidyverse)
library(tidymodels)
library(openintro)
library(knitr)
AE 02: Multiple linear regression
Peer-to-peer lending
Go to the course GitHub organization and locate your ae-02
repo to get started.
Render, commit, and push your responses to GitHub by the end of class to submit your AE.
Packages
Data
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
We will focus on the following variables:
annual_income_th
: Annual income (in $1000s)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
,Source Verified
,Verified
)interest_rate
: Interest rate for the loan
The goal of this analysis is to use the annual income, debt-to-income ratio, and income verification to understand variability in the interest rate on the loan.
We’ll start with data prep to rescale annual income to $1000’s and recode verified_income
to fix an issue with the underlying data.
<- loan50 |>
loan50 mutate(annual_income_th = annual_income / 1000,
verified_income =
case_when(verified_income == "Not Verified" ~ "Not Verified",
== "Source Verified" ~ "Source Verified",
verified_income == "Verified" ~ "Verified"),
verified_income verified_income = as_factor(verified_income)
)
glimpse(loan50)
Rows: 50
Columns: 19
$ state <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI…
$ emp_length <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
$ term <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
$ homeownership <fct> rent, rent, mortgage, rent, mortgage, mortgage…
$ annual_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ verified_income <fct> Not Verified, Not Verified, Verified, Not Veri…
$ debt_to_income <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
$ total_credit_limit <int> 95131, 51929, 301373, 59890, 422619, 349825, 1…
$ total_credit_utilized <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
$ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
$ loan_purpose <fct> debt_consolidation, credit_card, debt_consolid…
$ loan_amount <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
$ grade <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E…
$ interest_rate <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
$ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ loan_status <fct> Current, Current, Current, Current, Current, C…
$ has_second_income <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ total_income <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ annual_income_th <dbl> 59.0, 60.0, 75.0, 75.0, 254.0, 67.0, 28.8, 80.…
Categorical predictors
Let’s take a look at the design matrix for the model with predictors debt_to_income
, annual_income_th
, and verified_income
.
How does R choose the baseline level by default?
## add code here
[Add response here]
Fit the model with the predictors debt_to_income
, annual_income_th
, verified_income
, and the interaction between annual_income_th
and verified_income
.
Neatly display the model results using 3 digits.
# add code here
Write the estimated regression equation for the people with
Not Verified
income.Write the estimated regression equation for people with
Verified
income.
[add response here]
In general, how do
indicators for categorical predictors impact the model equation?
interaction terms impact the model equation?
[Add response here]
Model assessment
Let’s compare the original model without interaction effects to the model you fit in Exercise 2.
Calculate \(R^2\) and \(Adj. R^2\) for each model. You can find \(Adj. R^2\) from the glance
function:
glance(model_name)$adj.r.squared
<- lm(interest_rate ~ debt_to_income + annual_income_th +
int_fit data = loan50) verified_income,
# add code here
Which model would you choose based on
\(R^2\)?
\(Adj. R^2\)?
[add response here]
LaTex
Sometimes, you will need to include mathematical notation in your document. There are two ways you can display mathematics in your document:
Inline: Your mathematics will display within the line of text.
Use
$
to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Inline Math.Example: The text
The simple linear regression model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$
produces
The simple linear regression model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
Displayed: Your mathematics will display outside the line of text
Use a
$$
to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Display Math.Example: The text
The estimated regression equation is $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$
produces
The estimated regression equation is
\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} \]
Click here for a quick reference of LaTex code.
Submission
To submit the AE:
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your AE repo on GitHub. You’re done! 🎉