Oct 17, 2024
Exam corrections (optional) due Thursday, October 24 at 11:59pm on Canvas
Labs resume on Monday
Project: Exploratory data analysis due October 31
Statistics experience due Tuesday, November 26
Review: Maximum likelihood estimation
Influential points
Model diagnostics
Leverage
Studentized residuals
Cook’s Distance
A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).
Note that this is not the same as the probability function.
Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)
Likelihood function: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value
Maximum likelihood estimation is the process of finding the values of the parameters that maximize the likelihood function , i.e., the values that are most likely given the observed data.
There are three primary ways to find the maximum likelihood estimator
Suppose we have the simple linear regression (SLR) model
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \hspace{10mm} \epsilon_i \sim N(0, \sigma^2_{\epsilon}) \]
such that \(\epsilon_i\) are independently and identically distributed.
We can write this model in the form below and use this to find the MLE
\[ y_i | x_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2_{\epsilon}) \]
The likelihood function for \(\beta_0, \beta_1, \sigma^2_{\epsilon}\) is
\[ \begin{aligned} L&(\beta_0, \beta_1, \sigma^2_{\epsilon} | x_i, \dots, x_n, y_i, \dots, y_n) \\[5pt] &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma_ \epsilon^2}}\exp\Big\{{-\frac{1}{2\sigma_\epsilon^2}(y_i - [\beta_0 + \beta_1x_i])^2}\Big\} \\[10pt] & = (2\pi\sigma^2_{\epsilon})^{-\frac{n}{2}}\exp\Big\{-\frac{1}{2\sigma^2_{\epsilon}}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\Big\} \end{aligned} \]
The log-likelihood function for \(\beta_0, \beta_1, \sigma^2_{\epsilon}\) is
\[ \begin{aligned} \log &L(\beta_0, \beta_1, \sigma^2_{\epsilon} | x_i, \dots, x_n, y_i, \dots, y_n) \\[8pt] & = -\frac{n}{2}\log(2\pi\sigma^2_{\epsilon}) -\frac{1}{2\sigma^2_{\epsilon}}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2 \end{aligned} \]
We will use the log-likelihood function to find the MLEs
\[ \tilde{\beta}_0 = \frac{1}{n}\sum_{i=1}^ny_i - \frac{1}{n}\tilde{\beta}_1\sum_{i=1}^n x_i \]
\[ \tilde{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{\sum_{i=1}^n(x_i - \bar{x})^2} \]
\[ \tilde{\sigma}^2_{\epsilon} = \frac{\sum_{i=1}^n(y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)^2}{n} = \frac{\sum_{i=1}^ne_i^2}{n} \]
\[ L(\boldsymbol{\beta}, \sigma^2_{\epsilon} | \mathbf{X}, \mathbf{y}) = \frac{1}{(2\pi)^{n/2}\sigma^n_{\epsilon}}\exp\Big\{-\frac{1}{2\sigma^2_{\epsilon}}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\Big\} \]
\[ \log L(\boldsymbol{\beta}, \sigma^2_\epsilon | \mathbf{X}, \mathbf{y}) = -\frac{n}{2}\log(2\pi) - n \log(\sigma_{\epsilon}) - \frac{1}{2\sigma^2_{\epsilon}}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\mathbf{\beta}) \]
“Maximum likelihood estimation is, by far, the most popular technique for deriving estimators.” (Casella and Berger 2024, 315)
MLEs have nice statistical properties. They are
Consistent
Efficient - Have the smallest MSE among all consistent estimators
Asymptotically normal
The MLE \(\tilde{\boldsymbol{\beta}}\) is equivalent to the least-squares estimator \(\hat{\boldsymbol{\beta}}\) , when the errors follow independent and identical normal distributions
This means the least-squares estimator \(\hat{\mathbf{\boldsymbol{\beta}}}\) inherits all the nice properties of MLEs
Today’s data contains a subset of the original Duke Lemur data set available in the TidyTuesday GitHub repo. This data includes information on “young adult” lemurs from the Coquerel’s sifaka species (PCOQ), the largest species at the Duke Lemur Center. The analysis will focus on the following variables:
age_at_wt_mo
: Age in months: Age of the animal when the weight was taken, in months (((Weight_Date-DOB)/365)*12)
weight_g
: Weight: Animal weight, in grams. Weights under 500g generally to nearest 0.1-1g; Weights >500g generally to the nearest 1-20g.
The goal of the analysis is to use the age of the lemurs to understand variability in the weight.
Linearity
Constant variance
Normality
Independence
# A tibble: 10 × 8
weight_g age_at_wt_mo .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3400 32.0 3758. -358. 0.0158 516. 0.00396 -0.703
2 4143 46.2 4037. 106. 0.0655 517. 0.00159 0.213
3 3581 43.1 3977. -396. 0.0414 515. 0.0134 -0.787
4 3620 33.0 3778. -158. 0.0141 517. 0.000690 -0.310
5 3720 32.4 3768. -47.9 0.0149 517. 0.0000668 -0.0940
6 3540 35.4 3825. -285. 0.0134 516. 0.00212 -0.559
7 4440 37.3 3863. 577. 0.0161 513. 0.0105 1.13
8 4440 32.6 3770. 670. 0.0147 511. 0.0129 1.31
9 3770 31.8 3754. 15.6 0.0162 517. 0.00000767 0.0305
10 3920 31.9 3757. 163. 0.0159 517. 0.000828 0.320
Use the augment()
function in the broom package to output the model diagnostics (along with the predicted values and residuals)
.fitted
: predicted values.se.fit
: standard errors of predicted values.resid
: residuals.hat
: leverage.sigma
: estimate of residual standard deviation when the corresponding observation is dropped from model.cooksd
: Cook’s distance.std.resid
: standardized residualsAn observation is influential if removing has a noticeable impact on the regression coefficients
Recall the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
We’ve seen that \(\mathbf{H}\) is used to compute \(Var(\hat{\mathbf{y}}) = \sigma^2_{\epsilon}\mathbf{H}\) and \(Var(\mathbf{e}) = \sigma^2_{\epsilon}(\mathbf{I} - \mathbf{H})\)
An element of \(\mathbf{H}\), \(h_{ij}\), is the leverage of the observation \(y_i\) on the fitted values \(\hat{y}_{j}\)
We focus on the diagonal elements
\[ h_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \]such that \(\mathbf{x}^T_i\) is the \(i^{th}\) row of \(\mathbf{X}\)
\(h_{ii}\) is the leverage: a measure of the distance of the \(i^{th}\) observation from the center (or centroid) of the \(x\) space
Observations with large values of \(h_{ii}\) are far away from the typical value (or combination of values) of the predictors in the data
The sum of the leverages for all points is \(p + 1\), where \(p\) is the number of predictors in the model . More specifically
\[ \sum_{i=1}^n h_{ii} = \text{rank}(\mathbf{H}) = \text{rank}(\mathbf{X}) = p+1 \]
The average value of leverage, \(h_{ii}\), is \(\bar{h} = \frac{(p+1)}{n}\)
An observation has large leverage if \[h_{ii} > \frac{2(p+1)}{n}\]
# A tibble: 7 × 8
weight_g age_at_wt_mo .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4143 46.2 4037. 106. 0.0655 517. 0.00159 0.213
2 4313 58.2 4272. 41.1 0.229 517. 0.00123 0.0910
3 4640 54.9 4208. 432. 0.173 514. 0.0895 0.925
4 3677 47.4 4061. -384. 0.0770 515. 0.0253 -0.778
5 4319 58.2 4272. 47.1 0.229 517. 0.00161 0.104
6 3610 47.4 4061. -451. 0.0770 514. 0.0348 -0.914
7 3597 48.6 4084. -487. 0.0889 514. 0.0480 -0.992
Why do you think these points have large leverage?
If there is point with high leverage, ask
❓ Is there a data entry error?
❓ Is this observation within the scope of individuals for which you want to make predictions and draw conclusions?
❓ Is this observation impacting the estimates of the model coefficients? (Need more information!)
Just because a point has high leverage does not necessarily mean it will have a substantial impact on the regression. Therefore we need to check other measures.
What is the best way to identify outlier points that don’t fit the pattern from the regression line?
We can rescale residuals and put them on a common scale to more easily identify “large” residuals
We will consider two types of scaled residuals: standardized residuals and studentized residuals
The variance of the residuals can be estimated by the mean squared residuals (MSR) \(= \frac{SSR}{n - p - 1} = \hat{\sigma}^2_{\epsilon}\)
We can use MSR to compute standardized residuals
\[ std.res_i = \frac{e_i}{\sqrt{MSR}} \]
Standardized residuals are produced by augment()
in the column .std.resid
MSR is an approximation of the variance of the residuals.
The variance of the residuals is \(Var(\mathbf{e}) = \sigma^2_{\epsilon}(\mathbf{I} - \mathbf{H})\)
The studentized residual is the residual rescaled by the more exact calculation for variance
\[ r_i = \frac{e_{i}}{\sqrt{\hat{\sigma}^2_{\epsilon}(1 - h_{ii})}} \]
We can examine the standardized residuals directly from the output from the augment()
function
Let’s look at the value of the response variable to better understand potential outliers
An observation’s influence on the regression line depends on
How close it lies to the general trend of the data
Its leverage
Cook’s Distance is a statistic that includes both of these components to measure an observation’s overall impact on the model
Cook’s distance for the \(i^{th}\) observation is
\[ D_i = \frac{r^2_i}{p + 1}\Big(\frac{h_{ii}}{1 - h_{ii}}\Big) \]
This measure is a combination of
How well the model fits the \(i^{th}\) observation (magnitude of residuals)
How far the \(i^{th}\) observation is from the rest of the data (where the point is in the \(x\) space)
An observation with large value of \(D_i\) is said to have a strong influence on the predicted values
General thresholds .An observation with
\(D_i > 0.5\) is moderately influential
\(D_i > 1\) is very influential
Cook’s Distance is in the column .cooksd
in the output from the augment()
function
Standardized residuals, leverage, and Cook’s Distance should all be examined together
Examine plots of the measures to identify observations that are outliers, high leverage, and may potentially impact the model.
First consider if the outlier is a result of a data entry error.
If not, you may consider dropping an observation if it’s an outlier in the predictor variables if…
It is meaningful to drop the observation given the context of the problem
You intended to build a model on a smaller range of the predictor variables. Mention this in the write up of the results and be careful to avoid extrapolation when making predictions
It is generally not good practice to drop observations that ar outliers in the value of the response variable
These are legitimate observations and should be in the model
You can try transformations or increasing the sample size by collecting more data
A general strategy when there are influential points is to fit the model with and without the influential points and compare the outcomes
Reviewed Maximum likelihood estimation
Influential points
Model diagnostics
Leverage
Studentized residuals
Cook’s Distance