ANOVA + Geometric interpretation

Prof. Maria Tackett

Sep 17, 2024

Announcements

Lab 02 due on Thursday at 11:59pm
- Push work to GitHub repo
- Submit final PDF on Gradescope + select all team members + mark pages for each question
HW 01 due Thursday at 11:59pm
- Note submission instructions

Homework submission

If you write your responses to Exercises 1 - 4 by hand, you will need to combine your written work to the completed PDF for Exercises 5 - 10 before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

Latex in this class

For this class you will need to be able to…

Properly write mathematical symbols, e.g., $\beta_1$ not B1, $R^2$ not R2
Write basic regression equations, e.g., $\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2$
Write matrix equations: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$
Write hypotheses (we’ll start this next week), e.g., $H_0: \beta = 0$

You are welcome to but not required to write math proofs using LaTex.

Topics

Compare models using Adjusted $R^2$
Introduce the ANOVA table
Use a geometric interpretation to find the least squares estimates

Computing setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(viridis) #adjust color palette

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income_th debt_to_income verified_income interest_rate
              <dbl>          <dbl> <fct>                   <dbl>
 1             59           0.558  Not Verified            10.9 
 2             60           1.31   Not Verified             9.92
 3             75           1.06   Verified                26.3 
 4             75           0.574  Not Verified             9.92
 5            254           0.238  Not Verified             9.43
 6             67           1.08   Source Verified          9.92
 7             28.8         0.0997 Source Verified         17.1 
 8             80           0.351  Not Verified             6.08
 9             34           0.698  Not Verified             7.97
10             80           0.167  Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

annual_income_th: Annual income (in $1000s)
debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response: interest_rate: Interest rate for the loan

Model fit in R

int_fit <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th, data = loan50)

int_fit2 <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th + verified_income * annual_income_th, data = loan50)

Model assessment and comparison

RMSE & $R^2$

Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)
R-squared, $R^2$ : Percentage of variability in the outcome explained by the regression model

Comparing models

Though we use $R^2$ to assess the model fit, it is generally unreliable for comparing models with different number of predictors. Why?
- $R^2$ will stay the same or increase as we add more variables to the model . Let’s show why this is true.
- If we only use $R^2$ to choose a best fit model, we will be prone to choose the model with the most predictor variables.

Adjusted $R^2$

Adjusted $R^2$: measure that includes a penalty for unnecessary predictor variables
Similar to $R^2$, it is a measure of the amount of variation in the response that is explained by the regression model

$R^2$ and Adjusted $R^2$

\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]

\[R^2_{adj} = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]

where

$n$ is the number of observations used to fit the model
$p$ is the number of terms (not including the intercept) in the model

Compare models

Which model would you select int_fit (main effects only) or int_fit2 (main effects + interaction) based on…

$R^2$

glance(int_fit)$r.squared

[1] 0.279854

glance(int_fit2)$r.squared

[1] 0.2963437

$Adj. R^2$

glance(int_fit)$adj.r.squared

[1] 0.215841

glance(int_fit2)$adj.r.squared

[1] 0.1981591

ANOVA table

Source	Sum of squares	DF	Mean square	F
Model	$\sum_{i=1}^n(\hat{y}_i - \bar{y})^2$	$p$	$SSM / p$	$MSM / MSR$
Residual	$\sum_{i=1}^n(y_i- \hat{y}_i)^2$	$n - p - 1$	$SSR / (n - p - 1)$
Total	$\sum_{i = 1}^n(y_i - \bar{y})^2$	$n - 1$

The degrees of freedom (df) are the number of independent pieces of information used to calculate a statistic.
Mean square (MS) is the sum of squares divided by the associated degrees of freedom.

Using $R^2$ and Adjusted $R^2$

Adjusted $R^2$ can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!
Use $R^2$ when describing the relationship between the response and predictor variables

Geometric interpretation

Geometry of least squares regression

Let $\text{Col}(\mathbf{X})$ be the column space of $\mathbf{X}$: the set all possible linear combinations (span) of the columns of $\mathbf{X}$
The vector of responses $\mathbf{y}$ is not in $\text{Col}(\mathbf{X})$.
Goal: Find another vector $\mathbf{z} = \mathbf{Xb}$ that is in $\text{Col}(\mathbf{X})$ and is as close as possible to $\mathbf{y}$.
- $\mathbf{z}$ is called a projection of $\mathbf{y}$ onto $\text{Col}(\mathbf{X})$ .

Geometry of least squares regression

For any $\mathbf{z} = \mathbf{Xb}$ in $\text{Col}(\mathbf{X})$, the vector $\mathbf{e} = \mathbf{y} - \mathbf{Xb}$ is the difference between $\mathbf{y}$ and $\mathbf{Xb}$.
- In other words, we want to minimize $||\mathbf{e}||^2 = ||\mathbf{y} - \mathbf{Xb}||^2$
This is minimized for the $\mathbf{b}$ ( we’ll call it $\hat{\boldsymbol{\beta}}$ ) that makes $\mathbf{e}$ orthogonal to $\text{Col}(\mathbf{X})$
Recall: If $\mathbf{e}$ is orthogonal to $\text{Col}(\mathbf{X})$, then the inner product of any vector in $\text{Col}(\mathbf{X})$ and $\mathbf{e}$ is 0 $\Rightarrow \mathbf{X}^T\mathbf{e} = \mathbf{0}$

Geometry of least squares regression

Therefore, we have

\[ \mathbf{X}^T(\mathbf{y} - \mathbf{Xb}) = \mathbf{0} \]

Let’s solve for $\mathbf{b}$ to get the least squares estimate.

Recap

Compared models using Adjusted $R^2$
Introduced the ANOVA table
Used a geometric interpretation to find the least squares estimates

Next class

Inference for regression
See Sep 19 prepare

Source	Sum of squares	DF	Mean square	F
Model	\(\sum_{i=1}^n(\hat{y}_i - \bar{y})^2\)	\(p\)	\(SSM / p\)	\(MSM / MSR\)
Residual	\(\sum_{i=1}^n(y_i- \hat{y}_i)^2\)	\(n - p - 1\)	\(SSR / (n - p - 1)\)
Total	\(\sum_{i = 1}^n(y_i - \bar{y})^2\)	\(n - 1\)

ANOVA + Geometric interpretation

Announcements

Homework submission

Latex in this class

Topics

Computing setup

Data: Peer-to-peer lender

Variables

Model fit in R

Model assessment and comparison

RMSE & \(R^2\)

Comparing models

Adjusted \(R^2\)

\(R^2\) and Adjusted \(R^2\)

Compare models

ANOVA table

Using \(R^2\) and Adjusted \(R^2\)

Geometric interpretation

Geometry of least squares regression

Geometry of least squares regression

Geometry of least squares regression

Recap

Next class