ANOVA + Geometric interpretation

Prof. Maria Tackett

Sep 17, 2024

Announcements

  • Lab 02 due on Thursday at 11:59pm

    • Push work to GitHub repo

    • Submit final PDF on Gradescope + select all team members + mark pages for each question

  • HW 01 due Thursday at 11:59pm

    • Note submission instructions

Homework submission

If you write your responses to Exercises 1 - 4 by hand, you will need to combine your written work to the completed PDF for Exercises 5 - 10 before submitting on Gradescope.

Instructions to combine PDFs:

Latex in this class

For this class you will need to be able to…

  • Properly write mathematical symbols, e.g., \(\beta_1\) not B1, \(R^2\) not R2

  • Write basic regression equations, e.g., \(\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2\)

  • Write matrix equations: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)

  • Write hypotheses (we’ll start this next week), e.g., \(H_0: \beta = 0\)

You are welcome to but not required to write math proofs using LaTex.

Topics

  • Compare models using Adjusted \(R^2\)

  • Introduce the ANOVA table

  • Use a geometric interpretation to find the least squares estimates

Computing setup

# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(viridis) #adjust color palette

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.

# A tibble: 50 × 4
   annual_income_th debt_to_income verified_income interest_rate
              <dbl>          <dbl> <fct>                   <dbl>
 1             59           0.558  Not Verified            10.9 
 2             60           1.31   Not Verified             9.92
 3             75           1.06   Verified                26.3 
 4             75           0.574  Not Verified             9.92
 5            254           0.238  Not Verified             9.43
 6             67           1.08   Source Verified          9.92
 7             28.8         0.0997 Source Verified         17.1 
 8             80           0.351  Not Verified             6.08
 9             34           0.698  Not Verified             7.97
10             80           0.167  Source Verified         12.6 
# ℹ 40 more rows

Variables

Predictors:

  • annual_income_th: Annual income (in $1000s)
  • debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
  • verified_income: Whether borrower’s income source and amount have been verified (Not Verified, Source Verified, Verified)

Response: interest_rate: Interest rate for the loan

Model fit in R

int_fit <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th, data = loan50)

int_fit2 <- lm(interest_rate ~ debt_to_income + verified_income  + annual_income_th + verified_income * annual_income_th, data = loan50)

Model assessment and comparison

RMSE & \(R^2\)

  • Root mean square error, RMSE: A measure of the average error (average difference between observed and predicted values of the outcome)

  • R-squared, \(R^2\) : Percentage of variability in the outcome explained by the regression model

Comparing models

  • Though we use \(R^2\) to assess the model fit, it is generally unreliable for comparing models with different number of predictors. Why?

    • \(R^2\) will stay the same or increase as we add more variables to the model . Let’s show why this is true.

    • If we only use \(R^2\) to choose a best fit model, we will be prone to choose the model with the most predictor variables.

Adjusted \(R^2\)

  • Adjusted \(R^2\): measure that includes a penalty for unnecessary predictor variables
  • Similar to \(R^2\), it is a measure of the amount of variation in the response that is explained by the regression model

\(R^2\) and Adjusted \(R^2\)

\[R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST}\]


\[R^2_{adj} = 1 - \frac{SSR/(n-p-1)}{SST/(n-1)}\]

where

  • \(n\) is the number of observations used to fit the model

  • \(p\) is the number of terms (not including the intercept) in the model

Compare models

Which model would you select int_fit (main effects only) or int_fit2 (main effects + interaction) based on…

\(R^2\)

glance(int_fit)$r.squared
[1] 0.279854
glance(int_fit2)$r.squared
[1] 0.2963437

\(Adj. R^2\)

glance(int_fit)$adj.r.squared
[1] 0.215841
glance(int_fit2)$adj.r.squared
[1] 0.1981591

ANOVA table

Source Sum of squares DF Mean square F
Model \(\sum_{i=1}^n(\hat{y}_i - \bar{y})^2\) \(p\) \(SSM / p\) \(MSM / MSR\)
Residual \(\sum_{i=1}^n(y_i- \hat{y}_i)^2\) \(n - p - 1\) \(SSR / (n - p - 1)\)
Total \(\sum_{i = 1}^n(y_i - \bar{y})^2\) \(n - 1\)


  • The degrees of freedom (df) are the number of independent pieces of information used to calculate a statistic.

  • Mean square (MS) is the sum of squares divided by the associated degrees of freedom.

Using \(R^2\) and Adjusted \(R^2\)

  • Adjusted \(R^2\) can be used as a quick assessment to compare the fit of multiple models; however, it should not be the only assessment!

  • Use \(R^2\) when describing the relationship between the response and predictor variables

Geometric interpretation

Geometry of least squares regression

  • Let \(\text{Col}(\mathbf{X})\) be the column space of \(\mathbf{X}\): the set all possible linear combinations (span) of the columns of \(\mathbf{X}\)

  • The vector of responses \(\mathbf{y}\) is not in \(\text{Col}(\mathbf{X})\).

  • Goal: Find another vector \(\mathbf{z} = \mathbf{Xb}\) that is in \(\text{Col}(\mathbf{X})\) and is as close as possible to \(\mathbf{y}\).

    • \(\mathbf{z}\) is called a projection of \(\mathbf{y}\) onto \(\text{Col}(\mathbf{X})\) .

Geometry of least squares regression

  • For any \(\mathbf{z} = \mathbf{Xb}\) in \(\text{Col}(\mathbf{X})\), the vector \(\mathbf{e} = \mathbf{y} - \mathbf{Xb}\) is the difference between \(\mathbf{y}\) and \(\mathbf{Xb}\).

    • In other words, we want to minimize \(||\mathbf{e}||^2 = ||\mathbf{y} - \mathbf{Xb}||^2\)
  • This is minimized for the \(\mathbf{b}\) ( we’ll call it \(\hat{\boldsymbol{\beta}}\) ) that makes \(\mathbf{e}\) orthogonal to \(\text{Col}(\mathbf{X})\)

  • Recall: If \(\mathbf{e}\) is orthogonal to \(\text{Col}(\mathbf{X})\), then the inner product of any vector in \(\text{Col}(\mathbf{X})\) and \(\mathbf{e}\) is 0 \(\Rightarrow \mathbf{X}^T\mathbf{e} = \mathbf{0}\)

Geometry of least squares regression

  • Therefore, we have

\[ \mathbf{X}^T(\mathbf{y} - \mathbf{Xb}) = \mathbf{0} \]

Let’s solve for \(\mathbf{b}\) to get the least squares estimate.

Recap

  • Compared models using Adjusted \(R^2\)

  • Introduced the ANOVA table

  • Used a geometric interpretation to find the least squares estimates

Next class

  • Inference for regression

  • See Sep 19 prepare