Sep 17, 2024
Lab 02 due on Thursday at 11:59pm
Push work to GitHub repo
Submit final PDF on Gradescope + select all team members + mark pages for each question
HW 01 due Thursday at 11:59pm
If you write your responses to Exercises 1 - 4 by hand, you will need to combine your written work to the completed PDF for Exercises 5 - 10 before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac):
Adobe (Mac or PC):
For this class you will need to be able to…
Properly write mathematical symbols, e.g., \(\beta_1\) not B1, \(R^2\) not R2
Write basic regression equations, e.g., \(\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2\)
Write matrix equations: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
Write hypotheses (we’ll start this next week), e.g., \(H_0: \beta = 0\)
You are welcome to but not required to write math proofs using LaTex.
Compare models using Adjusted \(R^2\)
Introduce the ANOVA table
Use a geometric interpretation to find the least squares estimates
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income_th debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59 0.558 Not Verified 10.9
2 60 1.31 Not Verified 9.92
3 75 1.06 Verified 26.3
4 75 0.574 Not Verified 9.92
5 254 0.238 Not Verified 9.43
6 67 1.08 Source Verified 9.92
7 28.8 0.0997 Source Verified 17.1
8 80 0.351 Not Verified 6.08
9 34 0.698 Not Verified 7.97
10 80 0.167 Source Verified 12.6
# ℹ 40 more rows
: Annual income (in $1000s)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Response: interest_rate
: Interest rate for the loan
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model
Though we use \(R^2\) to assess the model fit, it is generally unreliable for comparing models with different number of predictors. Why?
\(R^2\) will stay the same or increase as we add more variables to the model . Let’s show why this is true.
If we only use \(R^2\) to choose a best fit model, we will be prone to choose the model with the most predictor variables.
\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]
\[R^2_{adj} = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]
\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model
Which model would you select int_fit
(main effects only) or int_fit2
(main effects + interaction) based on…
Source | Sum of squares | DF | Mean square | F |
Model | \(\sum_{i=1}^n(\hat{y}_i - \bar{y})^2\) | \(p\) | \(SSM / p\) | \(MSM / MSR\) |
Residual | \(\sum_{i=1}^n(y_i- \hat{y}_i)^2\) | \(n - p - 1\) | \(SSR / (n - p - 1)\) | |
Total | \(\sum_{i = 1}^n(y_i - \bar{y})^2\) | \(n - 1\) |
The degrees of freedom (df) are the number of independent pieces of information used to calculate a statistic.
Mean square (MS) is the sum of squares divided by the associated degrees of freedom.
Adjusted \(R^2\) can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!
Use \(R^2\) when describing the relationship between the response and predictor variables
Let \(\text{Col}(\mathbf{X})\) be the column space of \(\mathbf{X}\): the set all possible linear combinations (span) of the columns of \(\mathbf{X}\)
The vector of responses \(\mathbf{y}\) is not in \(\text{Col}(\mathbf{X})\).
Goal: Find another vector \(\mathbf{z} = \mathbf{Xb}\) that is in \(\text{Col}(\mathbf{X})\) and is as close as possible to \(\mathbf{y}\).
For any \(\mathbf{z} = \mathbf{Xb}\) in \(\text{Col}(\mathbf{X})\), the vector \(\mathbf{e} = \mathbf{y} - \mathbf{Xb}\) is the difference between \(\mathbf{y}\) and \(\mathbf{Xb}\).
This is minimized for the \(\mathbf{b}\) ( we’ll call it \(\hat{\boldsymbol{\beta}}\) ) that makes \(\mathbf{e}\) orthogonal to \(\text{Col}(\mathbf{X})\)
Recall: If \(\mathbf{e}\) is orthogonal to \(\text{Col}(\mathbf{X})\), then the inner product of any vector in \(\text{Col}(\mathbf{X})\) and \(\mathbf{e}\) is 0 \(\Rightarrow \mathbf{X}^T\mathbf{e} = \mathbf{0}\)
\[ \mathbf{X}^T(\mathbf{y} - \mathbf{Xb}) = \mathbf{0} \]
Let’s solve for \(\mathbf{b}\) to get the least squares estimate.
Compared models using Adjusted \(R^2\)
Introduced the ANOVA table
Used a geometric interpretation to find the least squares estimates
Inference for regression
See Sep 19 prepare