AE 09: Exam 02 review

Published

December 2, 2024

Important

Go to the course GitHub organization and locate your ae-09 repo to get started.

Render, commit, and push your responses to GitHub by the end of class to submit your AE.

library(tidyverse)
library(knitr)
library(tidymodels)
library(pROC)
library(Stat2Data)

Exercise 1

Suppose you fit a simple linear regression model.

Draw a scatterplot that contains an observation with large leverage but low Cook’s distance.
Draw a scatterplot that contains an observation with large leverage and high Cook’s distance.
Draw a scatterplot that contains an observation with a large studentized residual.

Exercise 2

Describe what it means for $\tilde{\boldsymbol{\beta}}$ to be the maximum likelihood estimator.
What are properties of MLEs?

Note

Use this data analysis scenario for Exercises 3 - 6.

The data for this analysis is about credit card customers. It can be found in the file credit.csv. The following variables are in the data set:

income: Income in $1,000’s
limit: Credit limit
rating: Credit rating
cards: Number of credit cards
age: Age in years
education: Number of years of education
own: Whether an individual owns their home ( No or Yes )
student: Whether the individual was a student ( No or Yes )
married: Whether the individual was married (No and Yes)
region: Region the individual is from ( South, East, and West)
balance: Average credit card balance in $.

credit <- read_csv("data/credit.csv") |>
  mutate(maxed = factor(if_else(balance == 0, 1, 0)))

The objective of this analysis is to predict whether a person has maxed out their credit card, i.e., had $0 average card balance.

We’ll start with a model predicting the odds of maxed = 1 using income, rating, and region.

credit_fit <- glm(maxed ~ income + rating + region, data = credit, 
                  family = "binomial")

tidy(credit_fit) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	9.898	1.449	6.829	0.000
income	0.113	0.021	5.273	0.000
rating	-0.057	0.008	-7.397	0.000
regionSouth	-0.595	0.604	-0.985	0.324
regionWest	-0.082	0.649	-0.126	0.900

Exercise 3

The logistic regression model takes the following form:

\[ \log(\frac{\pi_i}{1 - \pi_i}) = \beta_0 + \beta_1 ~ income + \beta_2 ~ rating + \beta_3 ~ regionSouth + \beta_4 ~ regionWest \]

Use the equation above to show the expected change in the odds of maxing out a credit card when the credit rating increases by 10 points. Assume income and region are constant. Write your answer in terms of $\beta_0, \beta_1, \beta_2, \beta_3, \beta_4$
Suppose there are two individuals. Individual 1 has an income of $64,000, a credit rating of 590, and is from the South region. Individual 2 has an income of $135,000, a credit rating of 695, and is from the East region. Use the equation above to show how the odds of maxing out a credit card differ between Individual 1 and Individual 2. Write your answer in terms of $\beta_0, \beta_1, \beta_2$, etc.
Use R to compute the odds for each individual. How do the odds compare? Is this consistent with your response to part (b)?

Exercise 4

We consider adding the interaction between region and income to the current model. We’ll use a drop-in-deviance test to determine whether or not to add the interaction term.

State the null and alternative hypotheses in words and using mathematical notation.
Describe what the test statistic $G$ means in the context of the data.
Show why the degrees of freedom for the test statistic are equal to 2.
Conduct the drop-in-deviance test and state your conclusion in the context of the data.

Exercise 5

Use the model credit_fit that includes the main effects for income, rating, and region.

Compute the predicted probabilities, then use those to predict whether individuals maxed out their credit card using a threshold of $\hat{\pi} = 0.5$.
What is the accuracy? What does it mean in the context of the data?
What is the sensitivity? What does it mean in the context of the data ?
What is the false positive rate? How it is computed given the specificity? What does it mean in the context of the data?

Exercise 6

Use the model credit_fit that includes the main effects for income, rating, and region.

Construct the ROC curve and compute the area under the curve (AUC).
Based on the AUC, do you think this model sufficiently identifies those who will max out their credit card vs. those who will not? Explain.
Suppose a credit card company uses your model to inform the credit limit to give to new customers. Do you think they would prioritize sensitivity, specificity, or regard both equally? Briefly explain.
Based on your response to part(c), select a threshold for classifying observations into those likely to max out the credit card and those who are not. What is your threshold in terms of probability? What is the sensitivity? What is the specificity?

Submission

To submit the AE:

Render the document to produce the PDF with all of your work from today’s class.

Push all your work to your AE repo on GitHub. You’re done! 🎉