Oct 22, 2024
Exam corrections (optional) due Thursday at 11:59pm on Canvas
Lab 04 due Thursday at 11:59pm
Team Feedback (from TEAMMATES) due Thursday at 11:59pm
Mid semester survey (strongly encouraged!) by Thursday at 11:59pm
Looking ahead
Project: Exploratory data analysis due October 31
Statistics experience due Tuesday, November 26
STA 230, STA 231 or STA 240: Probability
STA 310: Generalized Linear Models
STA 323: Statistical Computing
STA 360: Bayesian Inference and Modern Statistical Methods
STA 432: Theory and Methods of Statistical Learning and Inference
Multicollinearity
Definition
How it impacts the model
How to detect it
What to do about it
# A tibble: 5 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
Source: Pioneer Valley Planning Commission via the mosaicData package.
Outcome:
volume
estimated number of trail users that day (number of breaks recorded)Predictors
hightemp
daily high temperature (in degrees Fahrenheit)
avgtemp
average of daily low and daily high temperature (in degrees Fahrenheit)
season
one of “Fall”, “Spring”, or “Summer”
precip
measure of precipitation (in inches)
We can create a pairwise plot matrix using the ggpairs
function from the GGally R package
What might be a potential concern with a model that uses high temperature, average temperature, season, and precipitation to predict volume?
Ideally there is no linear relationship (dependence) between the predictors
Multicollinearity: there are near-linear dependencies between predictors
Dependencies that generally occur in the population
How the model is defined and the variables that are included
Sample comes from only a subspace of the region of predictors
There are more predictor variables than observations
\[ VIF_j = \frac{1}{1 - R^2_j} \]
where \(R^2_j\) is the proportion of variation in \(x_j\) that is explained by a linear combination of all the other predictors
Common practice uses threshold \(VIF > 10\) as indication of concerning multicollinearity
Variables with similar values of VIF are typically the ones correlated with each other
Use the vif()
function in the rms R package to calculate VIF
Large variance \((\hat{\sigma}^2_{\epsilon}(\mathbf{X}^T\mathbf{X})^{-1})\) in the model coefficients
Unreliable statistical inference results
Interpretation of coefficient is no longer “holding all other variables constant”, since this would be impossible for correlated predictors
Selected groups - put responses on your Google slide.
Collect more data (often not feasible given practical constraints)
Redefine the correlated predictors to keep the information from predictors but eliminate collinearity
For categorical predictors, avoid using levels with very few observations as the baseline
Remove one of the correlated variables
Selected groups - put responses on your Google slide.
Introduced multicollinearity
Definition
How it impacts the model
How to detect it
What to do about it