STA 221 - Fall 2024 – Multicollinearity

Announcements

Exam corrections (optional) due Thursday at 11:59pm on Canvas
Lab 04 due Thursday at 11:59pm
Team Feedback (from TEAMMATES) due Thursday at 11:59pm
Mid semester survey (strongly encouraged!) by Thursday at 11:59pm
Looking ahead
- Project: Exploratory data analysis due October 31
- Statistics experience due Tuesday, November 26

Spring 2025 statistics classes

STA 230, STA 231 or STA 240: Probability
STA 310: Generalized Linear Models
STA 323: Statistical Computing
STA 360: Bayesian Inference and Modern Statistical Methods
STA 432: Theory and Methods of Statistical Learning and Inference

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)
library(GGally) #for pairwise plot matrix


# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Topics

Multicollinearity
- Definition
- How it impacts the model
- How to detect it
- What to do about it

Data: Trail users

The Pioneer Valley Planning Commission (PVPC) collected data at the beginning a trail in Florence, MA for ninety days from April 5, 2005 to November 15, 2005 to
Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.

# A tibble: 5 × 7
  volume hightemp avgtemp season cloudcover precip day_type
   <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
1    501       83    66.5 Summer       7.60  0     Weekday 
2    419       73    61   Summer       6.30  0.290 Weekday 
3    397       74    63   Spring       7.5   0.320 Weekday 
4    385       95    78   Summer       2.60  0     Weekend 
5    200       44    48   Spring      10     0.140 Weekday

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Outcome:

volume estimated number of trail users that day (number of breaks recorded)

Predictors

hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
precip measure of precipitation (in inches)

EDA: Relationship between predictors

We can create a pairwise plot matrix using the ggpairs function from the GGally R package

rail_trail |>
  select(hightemp, avgtemp, season, precip) |>
  ggpairs()

EDA: Relationship between predictors

What might be a potential concern with a model that uses high temperature, average temperature, season, and precipitation to predict volume?

Multicollinearity

Announcements

Spring 2025 statistics classes

Computing set up

Topics

Data: Trail users

Variables

EDA: Relationship between predictors

EDA: Relationship between predictors

Multicollinearity

Multicollinearity

Common sources of multicollinearity

Detecting multicollinearity

Detecting multicollinearity

Effects of multicollinearity

Application exercise

Dealing with multicollinearity

Application exercise

Recap