Sep 19, 2024
Lab 02 due on TODAY at 11:59pm
HW 01 due TODAY at 11:59pm
Statistics experience due Tue, Nov 26 at 11:59pm
Goal: Engage with statistics / data science outside the classroom and connect your experience with what you’re learning in the course.
What: Have a statistics experience + create a slide reflecting on the experience. Counts as a homework grade.
When: Must do the activity this semester. Reflection due Tuesday, November 26 at 11:59pm
For more info: sta221-fa24.netlify.app/hw/stats-experience
Lowest HW and lowest lab grade dropped at the end of the semester.
Understand statistical inference in the context of regression
Describe the assumptions for regression
Understand connection between distribution of residuals and inferential procedures
Conduct inference on a single coefficient
Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.
We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :
total_exp_m
: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)
enrollment_th
: Total student enrollment in the 2019 - 2020 academic year (in thousands)
type
: institution type (Public or Private)
exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 19.332 | 2.984 | 6.478 | 0 |
enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
typePublic | -13.226 | 3.153 | -4.195 | 0 |
For every additional 1,000 students, we expect the institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.
For every additional 1,000 students, we expect the institution’s total expenditures on football to increase by $780,000, on average, holding institution type constant.
Statistical inference provides methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from
For our inferences to be valid, the sample should be representative (ideally random) of the population we’re interested in
Inference based on ANOVA
Hypothesis test for the statistical significance of the overall regression model
Hypothesis test for a subset of coefficients
Inference for a single coefficient
Hypothesis test for a coefficient
Confidence interval for a coefficient
We have discussed multiple ways to find the least squares estimates of
Now we will use statistical inference to draw conclusions about
such that the errors are independent and normally distributed.
What else do we know about the distribution of the residuals based on this equation?
There is some uncertainty in the residuals (and the predicted responses), so we use mathematical models to describe that uncertainty.
Some terminology:
Sample space: Set of all possible outcomes
Random variable: Function (mapping) from the sample space onto real numbers
Event: Subset of the sample space, i.e., a set of possible outcomes (possible values the random variable can take)
Probability distribution function: Mathematical function that produces probability of occurrences for events in the sample space
Suppose we are tossing 2 fair coins with sides heads (H) and tails (T)
Sample space: {HH, HT, TH, TT}
Random variable:
Event: We flip two coins and get 1 head
Probability distribution function:
Now we can find
Image source: Introduction to the Practice of Statistics (5th ed)
Let
Then
Use this to find
Let
Then
Use this to find
Once we fit the model, we can use the residuals to estimate
We often want to conduct inference on individual model coefficients
Hypothesis test: Is there a linear relationship between the response and
Confidence interval: What is a plausible range of values
But first we need to understand the distribution of
A sampling distribution is the probability distribution of a statistic based on a large number of random samples of size
The sampling distribution of
The estimated coefficients
Let
We will generally test the hypotheses:
State these hypotheses in words.
Test statistic: Number of standard errors the estimate is away from the null
If
In general,
The test statistic
We need to account for the additional variability introduced by calculating
Figure 1: Standard normal vs. t distributions
The p-value is the probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed
calculated from a
Why do we take into account “extreme” on both the high and low ends?
Magnitude of p-value | Interpretation |
---|---|
p-value < 0.01 | strong evidence against |
0.01 < p-value < 0.05 | moderate evidence against |
0.05 < p-value < 0.1 | weak evidence against |
p-value > 0.1 | effectively no evidence against |
These are general guidelines. The strength of evidence depends on the context of the problem.
There are two parts to the conclusion
Make a conclusion by comparing the p-value to a predetermined decision-making threshold called the significance level (
If
If
State the conclusion in the context of the data
Introduced statistical inference in the context of regression
Described the assumptions for regression
Connected the distribution of residuals and inferential procedures
Conducted inference on a single coefficient