Sep 17, 2024
Lab 02 due on Thursday at 11:59pm
Push work to GitHub repo
Submit final PDF on Gradescope + select all team members + mark pages for each question
HW 01 due Thursday at 11:59pm
If you write your responses to Exercises 1 - 4 by hand, you will need to combine your written work to the completed PDF for Exercises 5 - 10 before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
For this class you will need to be able to…
Properly write mathematical symbols, e.g., \(\beta_1\) not B1, \(R^2\) not R2
Write basic regression equations, e.g., \(\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2\)
Write matrix equations: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
Write hypotheses (we’ll start this next week), e.g., \(H_0: \beta = 0\)
You are welcome to but not required to write math proofs using LaTex.
Compare models using Adjusted \(R^2\)
Introduce the ANOVA table
Use a geometric interpretation to find the least squares estimates
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50
data frame in the openintro R package.
# A tibble: 50 × 4
annual_income_th debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59 0.558 Not Verified 10.9
2 60 1.31 Not Verified 9.92
3 75 1.06 Verified 26.3
4 75 0.574 Not Verified 9.92
5 254 0.238 Not Verified 9.43
6 67 1.08 Source Verified 9.92
7 28.8 0.0997 Source Verified 17.1
8 80 0.351 Not Verified 6.08
9 34 0.698 Not Verified 7.97
10 80 0.167 Source Verified 12.6
# ℹ 40 more rows
Predictors:
annual_income_th
: Annual income (in $1000s)debt_to_income
: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total incomeverified_income
: Whether borrower’s income source and amount have been verified (Not Verified
, Source Verified
, Verified
)Response: interest_rate
: Interest rate for the loan
Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model
Though we use \(R^2\) to assess the model fit, it is generally unreliable for comparing models with different number of predictors. Why?
\(R^2\) will stay the same or increase as we add more variables to the model . Let’s show why this is true.
If we only use \(R^2\) to choose a best fit model, we will be prone to choose the model with the most predictor variables.
\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]
\[R^2_{adj} = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]
where
\(n\) is the number of observations used to fit the model
\(p\) is the number of terms (not including the intercept) in the model
Which model would you select int_fit
(main effects only) or int_fit2
(main effects + interaction) based on…
\(R^2\)
Source | Sum of squares | DF | Mean square | F |
---|---|---|---|---|
Model | \(\sum_{i=1}^n(\hat{y}_i - \bar{y})^2\) | \(p\) | \(SSM / p\) | \(MSM / MSR\) |
Residual | \(\sum_{i=1}^n(y_i- \hat{y}_i)^2\) | \(n - p - 1\) | \(SSR / (n - p - 1)\) | |
Total | \(\sum_{i = 1}^n(y_i - \bar{y})^2\) | \(n - 1\) |
The degrees of freedom (df) are the number of independent pieces of information used to calculate a statistic.
Mean square (MS) is the sum of squares divided by the associated degrees of freedom.
Adjusted \(R^2\) can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!
Use \(R^2\) when describing the relationship between the response and predictor variables
Let \(\text{Col}(\mathbf{X})\) be the column space of \(\mathbf{X}\): the set all possible linear combinations (span) of the columns of \(\mathbf{X}\)
The vector of responses \(\mathbf{y}\) is not in \(\text{Col}(\mathbf{X})\).
Goal: Find another vector \(\mathbf{z} = \mathbf{Xb}\) that is in \(\text{Col}(\mathbf{X})\) and is as close as possible to \(\mathbf{y}\).
For any \(\mathbf{z} = \mathbf{Xb}\) in \(\text{Col}(\mathbf{X})\), the vector \(\mathbf{e} = \mathbf{y} - \mathbf{Xb}\) is the difference between \(\mathbf{y}\) and \(\mathbf{Xb}\).
This is minimized for the \(\mathbf{b}\) ( we’ll call it \(\hat{\boldsymbol{\beta}}\) ) that makes \(\mathbf{e}\) orthogonal to \(\text{Col}(\mathbf{X})\)
Recall: If \(\mathbf{e}\) is orthogonal to \(\text{Col}(\mathbf{X})\), then the inner product of any vector in \(\text{Col}(\mathbf{X})\) and \(\mathbf{e}\) is 0 \(\Rightarrow \mathbf{X}^T\mathbf{e} = \mathbf{0}\)
\[ \mathbf{X}^T(\mathbf{y} - \mathbf{Xb}) = \mathbf{0} \]
Let’s solve for \(\mathbf{b}\) to get the least squares estimate.
Compared models using Adjusted \(R^2\)
Introduced the ANOVA table
Used a geometric interpretation to find the least squares estimates
Inference for regression
See Sep 19 prepare