library(tidyverse)
library(knitr)
library(tidymodels)
library(pROC)
library(Stat2Data)
AE 09: Exam 02 review
Go to the course GitHub organization and locate your ae-09 repo to get started.
Render, commit, and push your responses to GitHub by the end of class to submit your AE.
Exercise 1
Suppose you fit a simple linear regression model.
Draw a scatterplot that contains an observation with large leverage but low Cook’s distance.
Draw a scatterplot that contains an observation with large leverage and high Cook’s distance.
Draw a scatterplot that contains an observation with a large studentized residual.
Exercise 2
Describe what it means for \(\tilde{\boldsymbol{\beta}}\) to be the maximum likelihood estimator.
What are properties of MLEs?
Use this data analysis scenario for Exercises 3 - 6.
The data for this analysis is about credit card customers. It can be found in the file credit.csv
. The following variables are in the data set:
income
: Income in $1,000’slimit
: Credit limitrating
: Credit ratingcards
: Number of credit cardsage
: Age in yearseducation
: Number of years of educationown
: Whether an individual owns their home (No
orYes
)student
: Whether the individual was a student (No
orYes
)married
: Whether the individual was married (No
andYes)
region
: Region the individual is from (South
,East
, andWest
)balance
: Average credit card balance in $.
<- read_csv("data/credit.csv") |>
credit mutate(maxed = factor(if_else(balance == 0, 1, 0)))
The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.
We’ll start with a model predicting the odds of maxed = 1
using income
, rating
, and region
.
<- glm(maxed ~ income + rating + region, data = credit,
credit_fit family = "binomial")
tidy(credit_fit) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.898 | 1.449 | 6.829 | 0.000 |
income | 0.113 | 0.021 | 5.273 | 0.000 |
rating | -0.057 | 0.008 | -7.397 | 0.000 |
regionSouth | -0.595 | 0.604 | -0.985 | 0.324 |
regionWest | -0.082 | 0.649 | -0.126 | 0.900 |
Exercise 3
The logistic regression model takes the following form:
\[ \log(\frac{\pi_i}{1 - \pi_i}) = \beta_0 + \beta_1 ~ income + \beta_2 ~ rating + \beta_3 ~ regionSouth + \beta_4 ~ regionWest \]
Use the equation above to show the expected change in the odds of maxing out a credit card when the credit rating increases by 10 points. Assume income and region are constant. Write your answer in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\)
Suppose there are two individuals. Individual 1 has an income of $64,000, a credit rating of 590, and is from the South region. Individual 2 has an income of $135,000, a credit rating of 695, and is from the East region. Use the equation above to show how the odds of maxing out a credit card differ between Individual 1 and Individual 2. Write your answer in terms of \(\beta_0, \beta_1, \beta_2\), etc.
Use R to compute the odds for each individual. How do the odds compare? Is this consistent with your response to part (b)?
Exercise 4
We consider adding the interaction between region
and income
to the current model. We’ll use a drop-in-deviance test to determine whether or not to add the interaction term.
- State the null and alternative hypotheses in words and using mathematical notation.
- Describe what the test statistic \(G\) means in the context of the data.
- Show why the degrees of freedom for the test statistic are equal to 2.
- Conduct the drop-in-deviance test and state your conclusion in the context of the data.
Exercise 5
Use the model credit_fit
that includes the main effects for income
, rating
, and region
.
- Compute the predicted probabilities, then use those to predict whether individuals maxed out their credit card using a threshold of \(\hat{\pi} = 0.5\).
- What is the accuracy? What does it mean in the context of the data?
- What is the sensitivity? What does it mean in the context of the data ?
- What is the false positive rate? How it is computed given the specificity? What does it mean in the context of the data?
Exercise 6
Use the model credit_fit
that includes the main effects for income
, rating
, and region
.
- Construct the ROC curve and compute the area under the curve (AUC).
- Based on the AUC, do you think this model sufficiently identifies those who will max out their credit card vs. those who will not? Explain.
- Suppose a credit card company uses your model to inform the credit limit to give to new customers. Do you think they would prioritize sensitivity, specificity, or regard both equally? Briefly explain.
- Based on your response to part(c), select a threshold for classifying observations into those likely to max out the credit card and those who are not. What is your threshold in terms of probability? What is the sensitivity? What is the specificity?
To submit the AE:
Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉