Lab 04: Maximum likelihood estimation

Return of the Palmer penguins

Due date

This lab is due on Thursday, October 24 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your team’s GitHub repo

  • Final .pdf file submitted on Gradescope

Introduction

In this lab you will compute maximum likelihood estimates for regression models looking at the relationship between features of penguins living in Palmer Archipelago in Antarctica. You will also explore properties of maximum likelihood estimators and how they are related to least-squares estimators.

Learning goals

By the end of the lab you will be able to…

  • compute estimates for \(\hat{\boldsymbol{\beta}}\) and \(\sigma^2_{\epsilon}\) using maximum likelihood estimation.
  • understand how assumptions of linear regression connect to maximum likelihood estimation.
  • describe the similarities and differences between maximum likelihood estimators and least-squares estimators.
  • evaluate which estimation procedure may be preferable in a given analysis scenario.

Getting started

  • A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

  • Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix lab-04. It contains the starter documents you need to complete the lab.

  • Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.

Workflow: Using Git and GitHub as a team

Important

There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 02. Only one person should type in the group’s .qmd file at a time to avoid merge conflicts. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.

Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.

Packages

You will use the following packages in today’s lab. Add other packages as needed.

library(tidyverse)
library(tidymodels)
library(knitr)

Data

Today’s dataset include information about characteristics of three species of penguins living in Palmer Archipelago in Antarctica. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network (Gorman, Williams, and Fraser 2014).

The data are in the file palmer-penguins.csv in the data folder. This dataset is originally from the penguins data frame in the palmerpenguins R package with observations that have missing values removed. This analysis will focus on the following variables:

  • bill_depth_mm: bill depth in millimeters

  • bill_length_mm: bill length in millimeters

  • species: penguin species (Adélie, Chinstrap and Gentoo)

Click here to see the full data dictionary.

Exercises

Goal: The goal of this analysis is to use bill length to understand variability in bill depth, after accounting for species.

Exercise 1

We’ll start with exploratory data analysis focused on the relationship between the response and predictor variables.

  • Visualize the relationship between the response variable bill_depth_mm and predictor bill_length_mm .

  • Now, visualize the relationship between bill_depth_mm and bill_length_mm by species. Use geom_smooth(method = "lm", se = FALSE) to add lines and more clearly visualize the relationship for each species.

  • Based on these visualizations, why is it important to include species when in the model of the relationship between bill depth and length? Briefly explain.

  • Based on these visualizations, would you include an interaction term between the two predictors? Briefly explain?

Exercise 2

We will fit the main effects model using bill length and species to understand variability in the bill depth.

  • Write the form of the statistical (population-level) model in matrix form.

  • Write the dimensions for \(\mathbf{y}, \mathbf{X}, \boldsymbol{\beta}, \boldsymbol{\epsilon}\) specific for this problem.

Exercise 3

Consider the regression model described in Exercise 2.

  • Write the likelihood function \(L(\boldsymbol{\beta}, \sigma^2_{\epsilon} | \mathbf{y}, \mathbf{X})\) in matrix form.

  • Describe how each of the four model assumptions is necessary for the form of the likelihood function.

Exercise 4

Briefly explain how the process of finding the maximum likelihood estimators for the likelihood function in Exercise 3 is related to the process of finding the least-squares estimators for the model in Exercise 2.

Exercise 5

For the next few exercises, we will compare the results of the maximum likelihood and least-squares procedures.

  • Fit the least-squares regression model described in Exercise 2. Neatly display the results using three digits.

  • Describe the estimated effect of bill length on bill depth in the context of the data.

  • Describe the estimated effect of species on bill depth in the context of the data. Include discussion about whether there is statistical evidence of a difference between species.

Exercise 6

  • Use matrix/vector operations to compute the maximum likelihood estimators \(\tilde{\beta}\) for the model in Exercise 2.

  • How do these estimators compare to the least-squares estimators in the previous exercise?

Exercise 7

The maximum likelihood estimation procedure also produces an estimator for the variance about the regression line, \(\sigma^2_\epsilon\), which we can write as

\[ \tilde{\sigma}^2_{\epsilon} = \frac{1}{n} \mathbf{e}^T\mathbf{e} \]

We know that the maximum likelihood estimator and least-squares estimator for \(\sigma^2_{\epsilon}\) are not equal. Additionally, the least-squares estimator \(\hat{\sigma}^2_{\epsilon}\) is unbiased. We want to find a scaling factor \(c\) such that the maximum likelihood estimator is unbiased.

Using the data and regression estimates for this analysis, compute both the maximum likelihood and least-squares estimators for \(\sigma_{\epsilon}^2\), and then find \(c\) by solving the equation \[ \hat\sigma^2_{\epsilon} = c \cdot \tilde\sigma^2_{\epsilon} \]You can do this last step either computationally or algebraically.

Exercise 8

Now we will look into the last property of the maximum likelihood estimator for \(\boldsymbol{\beta}\), and thus of least-squares estimator - asymptotic normality.

In words, this property says that, when the number of samples \(n\) is large compared to the number of predictors \(p\), the maximum likelihood estimator \(\tilde{\boldsymbol\beta}\) follows a (multivariate) normal distribution \(N\big(\boldsymbol\beta, \sigma_\epsilon^2 (\mathbf{X}^T\mathbf{X})^{-1}\big)\). Let’s use this to construct an approximate confidence interval for \(\beta_j\), the coefficient for speciesChinstrap.

  • Use \(\tilde{\sigma}^2_{\epsilon}\), the maximum likelihood estimator, to compute the approximate confidence interval for \(\beta_j\) . The approximate 95% confidence interval may be computed as

\[ \tilde{\beta}_j\pm 2 \times SE(\tilde{\beta}_j) \]

  • Then interpret this interval in the context of the data.

Exercise 9

  • Compute the exact (based on the \(t\)-distribution) confidence interval for \(\beta_j\), the coefficient of speciesChinstrap.

  • Compare the center and width of the this exact interval with the one you computed in Exercise 8. Do they differ? By how much? Which one is wider, indicating more uncertainty?

Exercise 10

To wrap up, we have seen that both the OLS and the maximum likelihood procedures for linear regression produce the same coefficient estimates, but lead to different estimators for the variance \(\sigma_\epsilon^2\) and allow for different types of uncertainty quantification.

Based on the work in this lab, do you think performing inference based on either method would have changed your conclusion about the the relationship between bill depth and Chinstrap species?

Submission

You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Access Gradescope through the menu on the STA 221 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 8
Ex 2 4
Ex 3 6
Ex 4 3
Ex 5 5
Ex 6 5
Ex 7 4
Ex 8 5
Ex 9 4
Ex 10 2
Workflow & formatting 4

The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member, a neatly organized document with readable code, and updating the team name and date in the YAML.

References

Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.