library(tidyverse)
library(tidymodels)
library(knitr)
Lab 05: Expanding Multiple Linear Regression
This lab is due on Thursday, November 7 at 11:59pm. To be considered on time, the following must be done by the due date:
Final
.qmd
and.pdf
files pushed to your team’s GitHub repoFinal
.pdf
file submitted on Gradescope
Introduction
In this lab, you will apply what you’ve learned about linear regression to analyze data on expenditures and revenues for collegiate athletic programs. The analysis will focus on working with and interpreting categorical predictors and interaction terms. You will also explore model diagnostics and assess the appropriateness of the model for the data.
Learning goals
By the end of the lab you will be able to…
use multiple linear regression to analyze complex real-world data.
use variance-stabilizing transformations.
identify and address multicollinearity.
evaluate whether a model is appropriate for the data.
Getting started
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix lab-05. It contains the starter documents you need to complete the lab.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Workflow: Using Git and GitHub as a team
There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 02. Only one person should type in the group’s .qmd file at a time to avoid merge conflicts. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.
Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.
Packages
You will use the following packages in today’s lab. Add other packages as needed.
Data
Today’s data contains expenditures, revenues, and student body composition, other information about NCAA Division 1 athletic programs at colleges and universities in the United States. The data were featured as part of the TidyTuesday data visualization challenge. It was originally obtained from Equity in Athletics Data Analysis, a reporting tool by the Office of Postsecondary Education of the U.S. Department of Education.
We will investigate whether institution features affect the revenues of athletics programs in different sports. Here is a table describing the variables in the data.
This analysis will focus on the following variables:
institution_name
: Institution namesector_name
: Institution typeef_total_count
: Total student enrollment for binary male/female gendersum_partic_men
: Total participation in athletics for mensum_partic_women
: Total participation in athletics for womensports
: Sports nametotal_exp_menwomen
: Total expenditure for both (in US dollars)total_rev_menwomen
: Total revenue for all sports (in US dollars)
Click here for the full data dictionary.
<- read_csv("data/ncaa-revenues.csv") sports
Exercises
Goal: The goal of the analysis is to use various features of NCAA Division 1 colleges and universities to explain variability in the total revenue from their athletic programs.
Exercise 1
We’ll begin by rescaling the response and some of the predictor variables.
Rescale the response variable
total_rev_menwomen
and the primary predictor variabletotal_exp_menwomen
, so that they are in terms of $100K (100 thousand dollars). Name the new variablesrev100k
andexp100k
, respectively.Rescale the predictor variable
ef_total_count
, so it is in terms of thousands of students. Name the new variablestudents_1k
.Briefly explain why we might we rescale these variables instead of using them in the original units.
Exercise 2
Before modeling, let’s do some exploratory data analysis.
- Make a visualization of the relationship between revenue and expenditures. Use the plot to describe the relationship between the two variables.
Higher values of both expenditures and revenues are associated with greater variability. Transform both variables using a
log
transformation, to deal with the potential violation of an assumption of linear regression. Name the variableslog_exp
andlog_rev
, respectively.Larger values of expenditures and revenues seem to follow a slightly different trend and are associated with two sports - football and basketball. Create an indicator variable that takes value
1
if the sport is basketball or football and0
otherwise. Name the variablebball_football
.
Exercise 3
Visualize the relationship between
log_rev
andlog_exp
.From the visualization, you may notice a lot of observations form a straight diagonal line, indicating a perfect one-to-one relationship between expenses and revenues. Provide a possible interpretation of this phenomenon.
Do you think it is reasonable to include observations displaying this exact relationship? Briefly explain.
Create a new data frame filtering out the observations for which expenditures and revenues are exactly equal. Call the new data frame
sports_nolinear
.
You will use sports_nolinear
for the remainder of the assignment.
Exercise 4
Use sports_nolinear
to fit a regression model with the log-transformed revenue as the response variable and the following predictors: log-transformed expenditures, student enrollment, institution type, sports type, participation in athletics for men, participation in athletics for women, the basketball/football indicator you created in a previous exercise, and the interaction between the log-transformed expenditures and the basketball/football indicator.
Neatly display the model using 3 digits.
Exercise 5
Consider the regression model from the previous exercise.
What type of institution and sport correspond to the intercept?
You’ll notice that one coefficient has a missing value. Why is the coefficient missing? What is the technical name of this phenomenon?
Exercise 6
For the sake of interpretability, it is useful to have a regression model in which no coefficients are missing, and the coefficients for each sport indicator represent the baseline level for such sport. To address this issue, use the code below to fit another regression model that uses the same predictors as before, making sure to drop the unnecessary variables and the intercept (by the the -1
in the formula) to achieve this.
<- lm(log_rev ~ -1 + sports + students_1k + sector_name +
sports_fit_2 + sum_partic_women + log_exp +
sum_partic_men *bball_football - bball_football,
log_expdata = sports_nolinear)
Why is there only one coefficient for institution type, even after the intercept was removed?
Which type of institution was chosen to be the baseline in this model?
Exercise 7
Now that we have an interpretable model, let us assess the model fit and perform diagnostics to verify whether our linear regression assumptions are reasonable for this data.
As a first step, provide some overall measures of model fit and comment on whether it seems to have an acceptable predictive power on the response of interest.
Exercise 8
Next, let’s take a look at the residuals for the model.
Because there are multiple sports at every institution, we may be concerned the residuals within an institution are correlated with each other (thus violating the independence assumption).
Due to the large number of institutions, we will look at randomly selected subset of 20 institutions to evaluate this.
Take a random sample of 20 institutions. Use
set.seed(221)
to make your results reproducible.Plot the residuals versus fitted values, faceted by institution.
Based on the faceted plot, do the errors appear to be correlated within institutions? Briefly explain your response.
Click here for more detail, code, and examples of faceting in ggplot2.
You may need to change the size of the figure so that the faceted lot is fully visible. You can do so using the options #| fig-width
and #| fig-height
in the code chunk.
Exercise 9
After an examination of all the residuals, we notice a few things:
- There seems to be two groups of observations.
- Institutions with larger fitted values correspond to lower variance in the residuals compared to the others.
- There are institutions with lower fitted values that have large negative residuals, potentially indicating outliers and/or influential points.
We are concerned that these observations may indicate some model misspecification (i.e., the model does not accurately reflect the trends in the data). Therefore, we take a look at the residuals a different way. We plot the standardized residual versus the fitted values, color the points based Cook’s distance, and use shape to indicate whether the sport is basketball or football.
Describe what you observe from the plot and how your observations compare to the list above.
Do you think this model is an appropriate fit for the data or is the model misspecified? Briefly explain.
Exercise 10
Based on this model (regardless of your answer to the previous exercise), which variables seem to be useful in explaining the variability in the revenue from collegiate sports?
Interpret the coefficient for one useful quantitative predictor in terms of the revenue (not log(revenue)).
Interpret the coefficient for one useful categorical predictor in terms of the revenue (not log(revenue)).
Submission
You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 5 |
Ex 2 | 6 |
Ex 3 | 6 |
Ex 4 | 3 |
Ex 5 | 4 |
Ex 6 | 4 |
Ex 7 | 4 |
Ex 8 | 5 |
Ex 9 | 4 |
Ex 10 | 6 |
Workflow & formatting | 3 |
The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member, a neatly organized document with readable code, and updating the team name and date in the YAML.