Model diagnostics

Prof. Maria Tackett

Oct 17, 2024

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)   
library(viridis)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Likelihood

A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).
Note that this is not the same as the probability function.
- Probability function: Fixed parameter value(s) + input possible outcomes $\Rightarrow$ probability of seeing the different outcomes given the parameter value(s)
- Likelihood function: Fixed data + input possible parameter values $\Rightarrow$ probability of seeing the fixed data for each parameter value

MLE for linear regression in matrix form

$L (β, σ_{ϵ}^{2} | X, y) = \frac{1}{(2 π)^{n / 2} σ_{ϵ}^{n}} \exp {- \frac{1}{2 σ_{ϵ}^{2}} (y - X β)^{T} (y - X β)}$

$\log L (β, σ_{ϵ}^{2} | X, y) = - \frac{n}{2} \log (2 π) - n \log (σ_{ϵ}) - \frac{1}{2 σ_{ϵ}^{2}} (y - X β)^{T} (y - X β)$

For a fixed value of $σ_{ϵ}$ , we know that $\log L$ is maximized when what is true about $(y - X β)^{T} (y - X β)$ ?
What does this tell us about the relationship between the MLE and least-squares estimator for $β$ ?

term	estimate	std.error	statistic	p.value
(Intercept)	3133.284	353.499	8.864	0.000
age_at_wt_mo	19.558	10.083	1.940	0.056

1 / 48

Model diagnostics Prof. Maria Tackett Oct 17, 2024

Model diagnostics
Announcements
Computing set up
Topics
Maximum likelihood estimation
Likelihood
Maximum likelihood estimation
Simple linear regression model
Likelihood for SLR
Log-likelihood for SLR
MLE for $β_{0}, β_{1}, σ_{ϵ}^{2}$
MLE for linear regression in matrix form
Why maximum likelihood estimation?
Putting it all together
Putting it all together
Model diagnostics
Data: Duke lemurs
EDA
Fit model
Model conditions
Model conditions
Model diagnostics
Model diagnostics in R
Influential Point
Influential points
Leverage
Hat matrix
Leverage
Large leverage
Lemurs: Leverage
Let’s look at the data
Large leverage
Scaled residuals
Scaled residuals
Standardized residuals
Studentized residuals
Using standardized residuals
Digging in to the data
Cook’s Distance
Motivating Cook’s Distance
Cook’s Distance
Using Cook’s Distance
Cook’s Distance
Using these measures
What to do with outliers/influential points?
What to do with outliers/influential points?
Recap
References