HW 01: Simple linear regression

Ice duration and air temperature in Madison, WI

Due date

This assignment is due on Thursday, September 19 at 11:59pm.

Introduction

You will use simple linear regression to analyze the relationship between air temperature and ice duration for two lakes in Madison, Wisconsin. You will also explore the mathematical properties of simple linear regression models.

Learning goals

In this assignment, you will…

use matrix operations to show results about simple linear regression.
conduct exploratory data analysis.
fit and interpret simple linear regression models.
evaluate model fit.
continue developing a workflow for reproducible data analysis.

Getting started

Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

a. Show that the hat matrix $H$ is symmetric $(H^{T} = H)$ and idempotent $(H^{2} = H)$ .

b. Show that $(I - H)$ is symmetric and idempotent.

Exercise 2

Let $x$ be a $k \times 1$ vector and $A$ be a symmetric $k \times k$ matrix, such that $A$ is not a function of $x$ .

Show that the gradient of $x^{T} A x$ with respect to $x$ is

$\nabla_{x} x^{T} A x = 2 A x$

(Proposition 2 from class)

Exercise 3

In class we used the sum of squared residuals (SSR) to estimate the regression coefficients, $\hat{β} = (X^{T} X)^{- 1} X^{T} Y$ . To show this is the least squares estimate, we now need to show that we have, in fact, found the estimate of $β$ that minimizes the SSR (rather than maximize).

If the Hessian matrix $\nabla_{β}^{2} S S R$ is positive definite, then we know we have found the $\hat{β}$ that minimizes SSR, i.e., the least squares estimator. Additionally, we have the following proposition:

Proposition

A matrix $A$ is positive definite if $z^{T} A z > 0$ , given $z$ is a non-zero vector.

Show that $\nabla_{β}^{2} S S R$ is positive definite.

Exercise 4

Prove that the maximum value of $R^{2}$ must be less than 1 if the data set contains observations such that there are different observed values of the response for the same value of the predictor (e.g., the dataset contains observations $(x_{i}, y_{i})$ and $(x_{j}, y_{j})$ such that $x_{i} = x_{j}$ and $y_{i} \neq y_{j}$ ).

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data

The datasets wi-icecover.csv and wi-air-temperature.csv contain information about ice cover and air temperature, respectively, at Lake Monona and Lake Mendota (both in Madison, Wisonsin) for days in 1886 through 2019. The data were obtained from the ntl_icecover and ntl_airtemp data frames in the lterdatasampler R package. They were originally collected by the US Long Term Ecological Research program (LTER) Network.

icecover <- read_csv("data/wi-icecover.csv")
airtemp <- read_csv("data/wi-air-temperature.csv")

The analysis will focus on the following variables:

year: year of observation
lakeid: lake name
ice_duration: number of days between the freeze and ice breakup dates of each lake
air_temp_avg: yearly average air temperature in Madison, WI (degrees Celsius)

Analysis goal

The goal of this analysis is to use linear regression explain variability in ice duration for lakes in Madison, WI based on air temperature. Because ice cover is impacted by various environmental factors, researchers are interested in examining the association between these two factors to better understand the changing climate.

Exercise 5

Let’s start by looking at the response variable ice_duration.

Create side-by-side boxplots to visualize the distribution of ice_duration for each lake.
Visualize the distribution of ice duration over time for each lake.
There are separate measurements for each lake in the icecover data frame. In this analysis, we will combine the data from both lakes and use the average ice duration each year.

Evaluate the analysis choice to use the average per year rather than the individual lake measurements. Some things to consider in your evaluation: Does the average accurately reflects the ice duration for lakes in Madison, WI for that year? Will there be information loss? How might that impact (or not) the analysis conclusions? Etc.

Tip

See the ggplot2 reference for example code and plots.

Exercise 6

Next, let’s combine the ice duration and air temperature data into a single analysis data frame.

Fill in the code below to create a new data frame, icecover_avg, of the average ice duration by year.

Then join icecover_avg and airtemp to create a new data frame. The new data frame should have 134 observations.
```
icecover_avg <- icecover |>
  group_by(_____) |>
  summarise(_____) |>
  ungroup()
```

Important

You will use the new data frame with average ice duration and average air temperature for the remainder of the assignment.

Visualize the relationship between the air temperature and average ice duration. Do you think a linear model would be a good fit to capture the relationship between the two variables?

Now is a good time to render your document again if you haven’t done so recently and commit (with a meaningful commit message) and push all updates.

Exercise 7

We will fit a model using the average air temperature to explain variability in ice duration that takes the form

$y = X β + ϵ$

State the dimensions of $y$ , $X$ , $β$ , $ϵ$ for this analysis. Your answer should have exact values given this data set.
Find the estimated regression coefficients $\hat{β}$ using the matrix representation of the model. Show the code used to get the answer.

Exercise 8

Fit the model from the previous exercise using the lm function. Neatly display the results using 3 digits.
Interpret the slope in the context of the data.

Now is a good time to render your document again if you haven’t done so recently and commit (with a meaningful commit message) and push all updates.

Exercise 9

Calculate $R^{2}$ for the model in the previous exercise and interpret it in the context of the data.
Briefly comment on the model fit based on $R^{2}$ .

Exercise 10

You are asked to use a reproducible workflow for all of your work in the class, and the goal of this question to is better understand potential real-world implications of doing (or not) so. Below are some real-life examples in which having a non-reproducible workflow resulted in errors that impacted research or public records.

Source: Ostblom and Timbers (2022)
Reproducibility error	Consequence	Source(s)
Limitations in Excel data formats	Loss of 16,000 COVID case records in the UK	(Kelion 2020)
Automatic formatting in Excel	Important genes disregarded in scientific studies	(Ziemann, Eren, and El-Osta 2016)
Deletion of a cell caused rows to shift	Mix-up of which patient group received the treatment	(Wallensteen et al. 2018)
Using binary instead of explanatory labels	Mix-up of the intervention with the control group	(Aboumatar and Wise 2019)
Using the same notation for missing data and zero values	Paper retraction	(Whitehouse et al. 2021)
Incorrectly copying data in a spreadsheet	Delay in the opening of a hospital	(Picken 2020)

Choose one of the scenarios from the table and read the linked article discussing what went wrong. Then,

Briefly describe what went wrong, i.e., what part of the process of was not reproducible and what error or impact that had.
Describe one way the researchers could have made the process reproducible.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component	Points
Ex 1	4
Ex 2	4
Ex 3	4
Ex 4	4
Ex 5	6
Ex 6	6
Ex 7	5
Ex 8	5
Ex 9	4
Ex 10	5
Workflow & formatting	3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Ostblom, Joel, and Tiffany Timbers. 2022. “Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction and Practice.” Journal of Statistics and Data Science Education 30 (3): 241–50. https://doi.org/10.1080/26939169.2022.2074922.