Model diagnostics

Prof. Maria Tackett

Oct 17, 2024

Announcements

  • Exam corrections (optional) due Thursday, October 24 at 11:59pm on Canvas

  • Labs resume on Monday

  • Project: Exploratory data analysis due October 31

  • Statistics experience due Tuesday, November 26

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)   
library(viridis)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

  • Review: Maximum likelihood estimation

  • Influential points

  • Model diagnostics

    • Leverage

    • Studentized residuals

    • Cook’s Distance

Maximum likelihood estimation

Likelihood

  • A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).

  • Note that this is not the same as the probability function.

    • Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)

    • Likelihood function: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value

Maximum likelihood estimation

  • Maximum likelihood estimation is the process of finding the values of the parameters that maximize the likelihood function , i.e., the values that are most likely given the observed data.

  • There are three primary ways to find the maximum likelihood estimator

    • Approximate using a graph
    • Using calculus
    • Numerical approximation

Simple linear regression model

Suppose we have the simple linear regression (SLR) model

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \hspace{10mm} \epsilon_i \sim N(0, \sigma^2_{\epsilon}) \]

such that \(\epsilon_i\) are independently and identically distributed.


We can write this model in the form below and use this to find the MLE

\[ y_i | x_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2_{\epsilon}) \]

Likelihood for SLR

The likelihood function for \(\beta_0, \beta_1, \sigma^2_{\epsilon}\) is

\[ \begin{aligned} L&(\beta_0, \beta_1, \sigma^2_{\epsilon} | x_i, \dots, x_n, y_i, \dots, y_n) \\[5pt] &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma_ \epsilon^2}}\exp\Big\{{-\frac{1}{2\sigma_\epsilon^2}(y_i - [\beta_0 + \beta_1x_i])^2}\Big\} \\[10pt] & = (2\pi\sigma^2_{\epsilon})^{-\frac{n}{2}}\exp\Big\{-\frac{1}{2\sigma^2_{\epsilon}}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\Big\} \end{aligned} \]

Log-likelihood for SLR

The log-likelihood function for \(\beta_0, \beta_1, \sigma^2_{\epsilon}\) is

\[ \begin{aligned} \log &L(\beta_0, \beta_1, \sigma^2_{\epsilon} | x_i, \dots, x_n, y_i, \dots, y_n) \\[8pt] & = -\frac{n}{2}\log(2\pi\sigma^2_{\epsilon}) -\frac{1}{2\sigma^2_{\epsilon}}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2 \end{aligned} \]


We will use the log-likelihood function to find the MLEs

MLE for \(\beta_0,\beta_1, \sigma^2_{\epsilon}\)

\[ \tilde{\beta}_0 = \frac{1}{n}\sum_{i=1}^ny_i - \frac{1}{n}\tilde{\beta}_1\sum_{i=1}^n x_i \]


\[ \tilde{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{\sum_{i=1}^n(x_i - \bar{x})^2} \]


\[ \tilde{\sigma}^2_{\epsilon} = \frac{\sum_{i=1}^n(y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)^2}{n} = \frac{\sum_{i=1}^ne_i^2}{n} \]

MLE for linear regression in matrix form

\[ L(\boldsymbol{\beta}, \sigma^2_{\epsilon} | \mathbf{X}, \mathbf{y}) = \frac{1}{(2\pi)^{n/2}\sigma^n_{\epsilon}}\exp\Big\{-\frac{1}{2\sigma^2_{\epsilon}}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\Big\} \]

\[ \log L(\boldsymbol{\beta}, \sigma^2_\epsilon | \mathbf{X}, \mathbf{y}) = -\frac{n}{2}\log(2\pi) - n \log(\sigma_{\epsilon}) - \frac{1}{2\sigma^2_{\epsilon}}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\mathbf{\beta}) \]

  1. For a fixed value of \(\sigma_\epsilon\) , we know that \(\log L\) is maximized when what is true about \((\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\) ?
  2. What does this tell us about the relationship between the MLE and least-squares estimator for \(\boldsymbol{\beta}\)?

Why maximum likelihood estimation?

  • “Maximum likelihood estimation is, by far, the most popular technique for deriving estimators.” (Casella and Berger 2024, 315)

  • MLEs have nice statistical properties. They are

    • Consistent

    • Efficient - Have the smallest MSE among all consistent estimators

    • Asymptotically normal

Putting it all together

  • The MLE \(\tilde{\boldsymbol{\beta}}\) is equivalent to the least-squares estimator \(\hat{\boldsymbol{\beta}}\) , when the errors follow independent and identical normal distributions

  • This means the least-squares estimator \(\hat{\mathbf{\boldsymbol{\beta}}}\) inherits all the nice properties of MLEs

    • Consistency
    • Efficiency - minimum variance among all consistent estimators
    • Asymptotically normal

Putting it all together

  • From previous work, we also know \(\hat{\boldsymbol{\beta}}\) is unbiased and thus the MLE \(\tilde{\boldsymbol{\beta}}\) is unbiased
  • Note that the MLE \(\tilde{\sigma}^2_{\epsilon}\) is asymptotically unbiased
    • The estimate from least-squares \(\hat{\sigma}_{\epsilon}^2\) is unbiased

Model diagnostics

Data: Duke lemurs

Today’s data contains a subset of the original Duke Lemur data set available in the TidyTuesday GitHub repo. This data includes information on “young adult” lemurs from the Coquerel’s sifaka species (PCOQ), the largest species at the Duke Lemur Center. The analysis will focus on the following variables:

  • age_at_wt_mo: Age in months: Age of the animal when the weight was taken, in months (((Weight_Date-DOB)/365)*12)

  • weight_g: Weight: Animal weight, in grams. Weights under 500g generally to nearest 0.1-1g; Weights >500g generally to the nearest 1-20g.

The goal of the analysis is to use the age of the lemurs to understand variability in the weight.

EDA

Fit model

lemurs_fit <- lm(weight_g ~ age_at_wt_mo, data = lemurs)

tidy(lemurs_fit) |> 
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 3133.284 353.499 8.864 0.000
age_at_wt_mo 19.558 10.083 1.940 0.056

Model conditions

  • Linearity

  • Constant variance

Model conditions

  • Normality

  • Independence

Model diagnostics

lemurs_aug <- augment(lemurs_fit)

lemurs_aug |> slice(1:10)
# A tibble: 10 × 8
   weight_g age_at_wt_mo .fitted .resid   .hat .sigma    .cooksd .std.resid
      <dbl>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
 1     3400         32.0   3758. -358.  0.0158   516. 0.00396       -0.703 
 2     4143         46.2   4037.  106.  0.0655   517. 0.00159        0.213 
 3     3581         43.1   3977. -396.  0.0414   515. 0.0134        -0.787 
 4     3620         33.0   3778. -158.  0.0141   517. 0.000690      -0.310 
 5     3720         32.4   3768.  -47.9 0.0149   517. 0.0000668     -0.0940
 6     3540         35.4   3825. -285.  0.0134   516. 0.00212       -0.559 
 7     4440         37.3   3863.  577.  0.0161   513. 0.0105         1.13  
 8     4440         32.6   3770.  670.  0.0147   511. 0.0129         1.31  
 9     3770         31.8   3754.   15.6 0.0162   517. 0.00000767     0.0305
10     3920         31.9   3757.  163.  0.0159   517. 0.000828       0.320 

Model diagnostics in R

Use the augment() function in the broom package to output the model diagnostics (along with the predicted values and residuals)

  • response and predictor variables in the model
  • .fitted: predicted values
  • .se.fit: standard errors of predicted values
  • .resid: residuals
  • .hat: leverage
  • .sigma: estimate of residual standard deviation when the corresponding observation is dropped from model
  • .cooksd: Cook’s distance
  • .std.resid: standardized residuals

Influential Point

An observation is influential if removing has a noticeable impact on the regression coefficients

Influential points

  • Influential points have a noticeable impact on the coefficients and standard errors used for inference
  • These points can sometimes be identified in a scatterplot if there is only one predictor variable
    • This is often not the case when there are multiple predictors
  • We will use measures to quantify an individual observation’s influence on the regression model
    • leverage, standardized & studentized residuals, and Cook’s distance

Leverage

Hat matrix

  • Recall the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

  • We’ve seen that \(\mathbf{H}\) is used to compute \(Var(\hat{\mathbf{y}}) = \sigma^2_{\epsilon}\mathbf{H}\) and \(Var(\mathbf{e}) = \sigma^2_{\epsilon}(\mathbf{I} - \mathbf{H})\)

  • An element of \(\mathbf{H}\), \(h_{ij}\), is the leverage of the observation \(y_i\) on the fitted values \(\hat{y}_{j}\)

Leverage

  • We focus on the diagonal elements

    \[ h_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \]such that \(\mathbf{x}^T_i\) is the \(i^{th}\) row of \(\mathbf{X}\)

  • \(h_{ii}\) is the leverage: a measure of the distance of the \(i^{th}\) observation from the center (or centroid) of the \(x\) space

  • Observations with large values of \(h_{ii}\) are far away from the typical value (or combination of values) of the predictors in the data

Large leverage

  • The sum of the leverages for all points is \(p + 1\), where \(p\) is the number of predictors in the model . More specifically

    \[ \sum_{i=1}^n h_{ii} = \text{rank}(\mathbf{H}) = \text{rank}(\mathbf{X}) = p+1 \]

  • The average value of leverage, \(h_{ii}\), is \(\bar{h} = \frac{(p+1)}{n}\)

  • An observation has large leverage if \[h_{ii} > \frac{2(p+1)}{n}\]

Lemurs: Leverage

h_threshold <- 2 * 2 / nrow(lemurs)
h_threshold
[1] 0.05263158
lemurs_aug |>
  filter(.hat > h_threshold)
# A tibble: 7 × 8
  weight_g age_at_wt_mo .fitted .resid   .hat .sigma .cooksd .std.resid
     <dbl>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
1     4143         46.2   4037.  106.  0.0655   517. 0.00159     0.213 
2     4313         58.2   4272.   41.1 0.229    517. 0.00123     0.0910
3     4640         54.9   4208.  432.  0.173    514. 0.0895      0.925 
4     3677         47.4   4061. -384.  0.0770   515. 0.0253     -0.778 
5     4319         58.2   4272.   47.1 0.229    517. 0.00161     0.104 
6     3610         47.4   4061. -451.  0.0770   514. 0.0348     -0.914 
7     3597         48.6   4084. -487.  0.0889   514. 0.0480     -0.992 

Why do you think these points have large leverage?

Let’s look at the data

Large leverage

If there is point with high leverage, ask

  • ❓ Is there a data entry error?

  • ❓ Is this observation within the scope of individuals for which you want to make predictions and draw conclusions?

  • ❓ Is this observation impacting the estimates of the model coefficients? (Need more information!)

Just because a point has high leverage does not necessarily mean it will have a substantial impact on the regression. Therefore we need to check other measures.

Scaled residuals

Scaled residuals

  • What is the best way to identify outlier points that don’t fit the pattern from the regression line?

    • Look for points that have large residuals
  • We can rescale residuals and put them on a common scale to more easily identify “large” residuals

  • We will consider two types of scaled residuals: standardized residuals and studentized residuals

Standardized residuals

  • The variance of the residuals can be estimated by the mean squared residuals (MSR) \(= \frac{SSR}{n - p - 1} = \hat{\sigma}^2_{\epsilon}\)

  • We can use MSR to compute standardized residuals

    \[ std.res_i = \frac{e_i}{\sqrt{MSR}} \]

  • Standardized residuals are produced by augment() in the column .std.resid

Studentized residuals

  • MSR is an approximation of the variance of the residuals.

  • The variance of the residuals is \(Var(\mathbf{e}) = \sigma^2_{\epsilon}(\mathbf{I} - \mathbf{H})\)

    • The variance of the \(i^{th}\) residual is \(Var(e_i) = \sigma^2_{\epsilon}(1 - h_{ii})\)
  • The studentized residual is the residual rescaled by the more exact calculation for variance

\[ r_i = \frac{e_{i}}{\sqrt{\hat{\sigma}^2_{\epsilon}(1 - h_{ii})}} \]

  • Standardized and studentized residuals provide similar information about which points are outliers in the response.

Using standardized residuals

We can examine the standardized residuals directly from the output from the augment() function

  • An observation is a potential outlier if its standardized residual is beyond \(\pm 3\)

Digging in to the data

Let’s look at the value of the response variable to better understand potential outliers

Cook’s Distance

Motivating Cook’s Distance

  • An observation’s influence on the regression line depends on

    • How close it lies to the general trend of the data

    • Its leverage

  • Cook’s Distance is a statistic that includes both of these components to measure an observation’s overall impact on the model

Cook’s Distance

Cook’s distance for the \(i^{th}\) observation is

\[ D_i = \frac{r^2_i}{p + 1}\Big(\frac{h_{ii}}{1 - h_{ii}}\Big) \]

This measure is a combination of

  • How well the model fits the \(i^{th}\) observation (magnitude of residuals)

  • How far the \(i^{th}\) observation is from the rest of the data (where the point is in the \(x\) space)

Using Cook’s Distance

  • An observation with large value of \(D_i\) is said to have a strong influence on the predicted values

  • General thresholds .An observation with

    • \(D_i > 0.5\) is moderately influential

    • \(D_i > 1\) is very influential

Cook’s Distance

Cook’s Distance is in the column .cooksd in the output from the augment() function

Using these measures

  • Standardized residuals, leverage, and Cook’s Distance should all be examined together

  • Examine plots of the measures to identify observations that are outliers, high leverage, and may potentially impact the model.

What to do with outliers/influential points?

  • First consider if the outlier is a result of a data entry error.

  • If not, you may consider dropping an observation if it’s an outlier in the predictor variables if…

    • It is meaningful to drop the observation given the context of the problem

    • You intended to build a model on a smaller range of the predictor variables. Mention this in the write up of the results and be careful to avoid extrapolation when making predictions

What to do with outliers/influential points?

  • It is generally not good practice to drop observations that ar outliers in the value of the response variable

    • These are legitimate observations and should be in the model

    • You can try transformations or increasing the sample size by collecting more data

  • A general strategy when there are influential points is to fit the model with and without the influential points and compare the outcomes

Recap

  • Reviewed Maximum likelihood estimation

  • Influential points

  • Model diagnostics

    • Leverage

    • Studentized residuals

    • Cook’s Distance

References

Casella, George, and Roger Berger. 2024. Statistical Inference. CRC Press.