AE 02: Multiple linear regression

Peer-to-peer lending

Published

September 12, 2024

Important

Go to the course GitHub organization and locate your ae-02 repo to get started.

Render, commit, and push your responses to GitHub by the end of class to submit your AE.

Packages

library(tidyverse)   
library(tidymodels)   
library(openintro)    
library(knitr)

Data

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

We will focus on the following variables:

annual_income_th: Annual income (in $1000s)
debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)
interest_rate: Interest rate for the loan

The goal of this analysis is to use the annual income, debt-to-income ratio, and income verification to understand variability in the interest rate on the loan.

We’ll start with data prep to rescale annual income to $1000’s and recode verified_income to fix an issue with the underlying data.

loan50 <- loan50 |>
   mutate(annual_income_th = annual_income / 1000, 
          verified_income = 
            case_when(verified_income == "Not Verified" ~ "Not Verified",
                      verified_income == "Source Verified" ~ "Source Verified",
                      verified_income == "Verified" ~ "Verified"),
          verified_income = as_factor(verified_income)
   )

glimpse(loan50)

Rows: 50
Columns: 19
$ state                   <fct> NJ, CA, SC, CA, OH, IN, NY, MO, FL, FL, MD, HI…
$ emp_length              <dbl> 3, 10, NA, 0, 4, 6, 2, 10, 6, 3, 8, 10, 10, 2,…
$ term                    <dbl> 60, 36, 36, 36, 60, 36, 36, 36, 60, 60, 36, 36…
$ homeownership           <fct> rent, rent, mortgage, rent, mortgage, mortgage…
$ annual_income           <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ verified_income         <fct> Not Verified, Not Verified, Verified, Not Veri…
$ debt_to_income          <dbl> 0.55752542, 1.30568333, 1.05628000, 0.57434667…
$ total_credit_limit      <int> 95131, 51929, 301373, 59890, 422619, 349825, 1…
$ total_credit_utilized   <int> 32894, 78341, 79221, 43076, 60490, 72162, 2872…
$ num_cc_carrying_balance <int> 8, 2, 14, 10, 2, 4, 1, 3, 10, 4, 3, 4, 3, 2, 3…
$ loan_purpose            <fct> debt_consolidation, credit_card, debt_consolid…
$ loan_amount             <int> 22000, 6000, 25000, 6000, 25000, 6400, 3000, 1…
$ grade                   <fct> B, B, E, B, B, B, D, A, A, C, D, A, A, A, A, E…
$ interest_rate           <dbl> 10.90, 9.92, 26.30, 9.92, 9.43, 9.92, 17.09, 6…
$ public_record_bankrupt  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ loan_status             <fct> Current, Current, Current, Current, Current, C…
$ has_second_income       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ total_income            <dbl> 59000, 60000, 75000, 75000, 254000, 67000, 288…
$ annual_income_th        <dbl> 59.0, 60.0, 75.0, 75.0, 254.0, 67.0, 28.8, 80.…

Categorical predictors

Exercise 1

Let’s take a look at the design matrix for the model with predictors debt_to_income, annual_income_th, and verified_income.

How does R choose the baseline level by default?

## add code here

[Add response here]

Exercise 2

Fit the model with the predictors debt_to_income, annual_income_th, verified_income , and the interaction between annual_income_th and verified_income.

Neatly display the model results using 3 digits.

# add code here

Exercise 3

Write the estimated regression equation for the people with Not Verified income.
Write the estimated regression equation for people with Verified income.

[add response here]

Exercise 4

In general, how do

indicators for categorical predictors impact the model equation?
interaction terms impact the model equation?

[Add response here]

Model assessment

Exercise 5

Let’s compare the original model without interaction effects to the model you fit in Exercise 2.

Calculate $R^2$ and $Adj. R^2$ for each model. You can find $Adj. R^2$ from the glance function:

glance(model_name)$adj.r.squared

int_fit <- lm(interest_rate ~ debt_to_income + annual_income_th +
                verified_income, data = loan50)

# add code here

Exercise 6

Which model would you choose based on

$R^2$?
$Adj. R^2$?

[add response here]

LaTex

Sometimes, you will need to include mathematical notation in your document. There are two ways you can display mathematics in your document:

Inline: Your mathematics will display within the line of text.

Use $ to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Inline Math.
Example: The text The simple linear regression model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ produces

The simple linear regression model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$

Displayed: Your mathematics will display outside the line of text

Use a $$ to start and end your LaTex syntax. You can also use the menu: Insert -> LaTex Math -> Display Math.
Example: The text The estimated regression equation is $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$produces

The estimated regression equation is

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} \]

Tip

Click here for a quick reference of LaTex code.

Submission

Important

To submit the AE:

Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉