HW 03: Multiple linear regression cont’d

Due date

This assignment is due on Thursday, October 31 at 11:59pm.

Introduction

In this assignment you will use linear regression to explore the relationship between multiple variables. You will also examine model diagnostics, variable transformations, and maximum likelihood estimation.

Learning goals

In this assignment, you will…

  • find maximum likelihood estimators under different scenarios.
  • use model diagnostics to identify influential points.
  • examine multicollinearity and consider strategies to handle it.
  • fit and interpret models with transformed variables.

Getting started

  • Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)
library(rms)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

Let \(x_1, \ldots, x_n\) be independently and identically distributed (i.i.d.) sample observations from a distribution where

\[ f(x | \theta) = \theta x^{\theta - 1} \hspace{8mm} 0\leq x\leq 1, \hspace{3mm} 0 < \theta < \infty \]

Find the maximum likelihood estimator (MLE) of \(\theta\).

Exercise 2

Suppose you have the linear regression model

\[ y_i = \beta x_i + \epsilon_i \hspace{8mm} \epsilon_i \sim N(0, \sigma^2_{\epsilon}) \]

such that \(\epsilon_i\) are i.i.d. and there is no intercept.

  1. Find \(\tilde{\beta}\) , the MLE of \(\beta\).
  2. Show that the MLE is unbiased. (Note: You must show this directly and may not use the result from part(c).)
  3. Show mathematically how \(\tilde{\beta}\) relates to the least-squares estimator.

Exercise 3

Suppose there are \(n\) observations, such that each \(y_i\) is generated from \(x_i\) based on the linear model

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \hspace{8mm} \epsilon_i \sim N(0, \sigma^2_{\epsilon}) \]

such that \(\epsilon_i\) are i.i.d.

The model is reparameterized (redefined) as

\[ y_i = \beta^{\prime}_0 + \beta^{\prime}_1(x_i - \bar{x}) + \epsilon_i \]

such that \(\epsilon_i\) follows the same distribution as the original model.

  1. Show that the MLE of \(\beta^{\prime}_1\) is equal to the MLE of \(\beta_1\).
  2. Show that the MLE of \(\beta^{\prime}_0\) is not equal to the MLE of \(\beta_0\).
  3. The predictor in the reparameterized model is mean-centered, i.e., the predictor is shifted by its mean value. Based on your responses to parts (a) and (b), what is one potential benefit for using mean-centered predictors in the model?
Tip

You do not need to derive the MLEs for \(\beta_0\) and \(\beta_1\). You may use the results from the notes.

You do need to show your work / explain your reasoning to get the MLEs for \(\beta^{\prime}_0\) and \(\beta^{\prime}_1\) .

Exercise 4

Suppose we have a model of the form

\[ \log(y_i) =\beta_0 + \beta_1\log(x_i) + \epsilon_i \hspace{10mm} \epsilon_i \sim N(0, \sigma^2_{\epsilon}) \]

Describe the expected change in \(y_i\) when \(x_i\) is multiplied by a constant \(C\). Show the work used to obtain the expected change.

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: SAT

This data set contains the average SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 U.S. states. The data is based on test takers for the 1982 exam. We will use the following variables:

  • SAT: average total SAT score

  • Takers: percentage of high school seniors who took exam

  • Income: median income of families of test-takers ($ hundreds)

  • Years: average number of years test-takers had formal education in social sciences, natural sciences, and humanities

  • Public: percentage of test-takers who attended public high schools

  • Expend: total state expenditure on high schools ($ hundreds per student)

  • Rank: median percentile rank of test-takers within their high school classes

The data are in the file sat-1982.csv in the data folder.

Exercise 5

  1. Fit a main effects model using SAT as the response and all other variables (except State) as the predictors. Neatly display the model using 3 digits.
  2. Are there any influential observations in the data set? Briefly explain showing any work or output used to make the determination.
  3. Consider the observation with the highest value for Cook’s distance. What is the value of leverage for this observation? Would this be considered large leverage? Briefly explain showing any work or output used to make the determination.

Exercise 6

  1. Compute the Variance Inflation Factors (VIF) for the model from the previous exercise.
  2. Does this model have any issues with multicollinearity? If so, what predictors appear to be collinear?
  3. If multicollinearity is present, select a strategy to fit a model that does not have an issue with multicollinearity.
    • Briefly describe your strategy.
    • Select a final model.
    • Briefly explain your selection, showing any work and output used to choose a model.

Data: 2000 U.S. Presidential Election

We will examine data about the 2000 U.S. presidential election between George W. Bush and Al Gore. It was one of the closest elections in history that ultimately came down to the state of Florida. One county in particular, Palm Beach County, was at the center of the controversy due to the design of their ballots - the infamous butterfly ballots. It is believed that many people who intended to vote for Al Gore accidentally voted for Pat Buchanan due to how the spots to mark the candidate were arranged next to the names.

The variables in the data are

  • County: County name

  • Bush2000: Number of votes for George W. Bush

  • Buchanan2000: Number of votes for Pat Buchanan

The data are available in the file florida-votes-2000.csv in the data folder of your repo.

Exercise 7

The goal is to fit a model that uses the number of votes for Bush to predict the number of votes for Buchanan. Using this model, we’ll investigate whether the data support the claim that votes for Gore may have accidentally gone to Buchanan.

  1. Visualize the relationship between the number of votes for Buchanan versus the number of votes for Bush. Describe what you observe in the visualization, including a description of the relationship between the votes for Buchanan and votes for Bush.
  2. What is the county with the extreme outlier number of votes for Buchanan? Create a new data frame that doesn’t include the outlying county. You will use this updated data frame for the remainder of this exercise and Exercise 8.
  3. Fit a model to predict the number of votes for Buchanan based on the number of votes for Bush in the county.
    • Make a plot of the standardized residuals versus the fitted values.
    • Is the constant variance condition satisfied? Briefly explain.

Exercise 8

Now let’s consider potential models with transformations on the response and/or predictor variables. The four candidate models are the following:

Model Response variable Predictor variable
1 (from previous exercise) Buchanan2000 Bush2000
2 log(Buchanan2000) Bush2000
3 Buchanan2000 log(Bush2000)
4 log(Buchanan2000) log(Bush2000)

Which model best fits the data? Briefly explain, showing any work and output used to determine the response. (Note: Use the data set without the outlying county to find the candidate models.)

Exercise 9

Now we will use the model to predict the expected number of Buchanan votes for the outlier county.

Suppose the observed value of the predictor for this county (a new observation) is \(x_0\). We define \(\mathbf{x}_0^T = [1, x_0]\)

Then the predicted response is

\[ \hat{y}_0 = \mathbf{x}_0^T\hat{\boldsymbol{\beta}} \]

Where \(\hat{\boldsymbol{\beta}}\) is the vector of estimated model coefficients.

Just as there is uncertainty in our model coefficients, there is uncertainty in our predictions as well. We use a confidence interval to quantify the uncertainty for a model coefficient, and we can use a prediction interval to quantify the uncertainty in the prediction for a new observation.

The \(C\%\) prediction interval for the new observation is

\[ \hat{y}_0 \pm t^*_{n - p - 1}\sqrt{\hat{\sigma}^2_\epsilon(1 + \mathbf{x}_0^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_0)} \]

where \(t^*_{n-p-1}\) is the critical value obtained from the \(t\) distribution with \(n - p - 1\) degrees of freedom, \(\mathbf{X}\) is the design matrix for the model, and \(\hat{\sigma}^2_\epsilon\) is the estimated variability about the regression line.

  1. Use the model you chose in the previous exercise to compute the predicted number of votes for Buchanan in the outlying county identified in Exercise 7. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).

  2. Compute the 95% prediction interval for this county. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).

  3. It is assumed that some of the votes for Buchanan in that county were actually intended to be for Gore. Based on your results in the previous question, does your model support this claim?

    • If no, briefly explain.

    • If yes, about how many votes were possibly intended for Gore? Show any calculations and output used to determine your answer. If you selected a model with a transformation, be sure to report your answer in terms of votes, not log(votes).

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

To submit your assignment:

  • Access Gradescope through the menu on the STA 221 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 3
Ex 2 9
Ex 3 9
Ex 4 3
Ex 5 5
Ex 6 5
Ex 7 5
Ex 8 3
Ex 9 5
Workflow & formatting 3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.