Sep 26, 2024
Project
Research questions due TODAY
Proposal due Thursday, October 3 at 11:59pm
Lab 03 due Thursday, October 3 at 11:59pm
HW 02 due Thursday, October 3 at 11:59pm (released after class)
Statistics experience due Tue, Nov 26 at 11:59pm
Compute and interpret confidence interval for a single coefficient
Properties of
Define “linear” model
Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.
We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :
total_exp_m
: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th
: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type
: institution type (Public or Private)
We often want to conduct inference on individual model coefficients
Hypothesis test: Is there a linear relationship between the response and
Confidence interval: What is a plausible range of values
A plausible range of values for a population parameter is called a confidence interval
Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net
We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
Similarly, if we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter
We will construct
“Confidence” means if we were to take repeated samples of the same size as our data, fit regression lines using the same predictors, and calculate
Need to balance precision and accuracy when selecting a confidence level
where
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 19.332 | 2.984 | 6.478 | 0 |
enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
typePublic | -13.226 | 3.153 | -4.195 | 0 |
We have discussed how to use least squares to find an estimator of
How do we know whether our least squares estimator is a “good” estimator?
When we consider what makes an estimator “good”, we’ll look at three criteria:
We’ll take a look at these over the course of a few lectures and motivate why we might prefer using least squares to compute
Suppose you are throwing darts at a target
Unbiased: Darts distributed around the target
Biased: Darts systematically away from the target
Variance: Darts could be widely spread (high variance) or generally clustered together (low variance)
Ideal scenario: Darts are clustered around the target (unbiased and low variance)
Worst case scenario: Darts are widely spread out and systematically far from the target (high bias and high variance)
Acceptable scenario: There’s some trade-off between the bias and variance. For example, it may be acceptable for the darts to be clustered around a point that is close to the target (low bias and low variance)
Each time we take a sample of size
Suppose we take many independent samples of size
The estimators are centered at the true parameter (unbiased)
The estimators are clustered around the true parameter (unbiased with low variance)
Let’s take a look at the mean and variance of the least squares estimator
The bias of an estimator is the difference between the estimator’s expected value and the true value of the parameter
Let
An estimator is unbiased if the bias is 0 and thus
Let
Let’s take a look at the expected value of the least squares estimator. Given
The least squares estimator
Now let’s take a look at the variance
We will show that
What does it mean for a model to be a “linear” regression model?
Linear regression models are linear in the parameters, i.e. given an observation
The functions
Computed and interpreted confidence interval for a single coefficient
Showed some properties of
Defined “linear” model