SLR: Matrix representation

cont’d

Prof. Maria Tackett

Sep 10, 2024

Announcements

  • Lab 01 due on Thursday, September 12 at 11:59pm

    • Push work to GitHub repo

    • Submit final PDF on Gradescope + mark pages for each question

  • HW 01 will be assigned on Thursday

Topics

  • Matrix representation of simple linear regression
    • Model form
    • Least square estimate
    • Predicted (fitted) values
    • Residuals

Matrix representation of simple linear regression

SLR in matrix form

Suppose we have \(n\) observations, a quantitative response variable, and a single predictor\[ \underbrace{ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} }_ {\mathbf{y}} \hspace{3mm} = \hspace{3mm} \underbrace{ \begin{bmatrix} 1 &x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} }_{\mathbf{X}} \hspace{2mm} \underbrace{ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} }_{\boldsymbol{\beta}} \hspace{3mm} + \hspace{3mm} \underbrace{ \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} }_\boldsymbol{\epsilon} \]


  • \(\mathbf{y}\): \(n\times 1\) vector of responses
  • \(\mathbf{X}\): \(n \times 2\) design matrix
  • \(\boldsymbol{\beta}\): \(2 \times 1\) vector of coefficients
  • \(\boldsymbol{\epsilon}\): \(n \times 1\) vector of error terms

Minimize sum of squared residuals

Goal: Find \(\boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix}\) that minimizes the sum of squared residuals \[ \begin{aligned} SSR = \sum_{i=1}^n e_i^2 = \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] \end{aligned} \]

Minimize sum of squared residuals

\[ \begin{aligned} \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &\class{fragment}{= (\mathbf{y}^T - \boldsymbol{\beta}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})}\\[10pt] &\class{fragment}{=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}}\\[10pt] \end{aligned} \]

Minimize sum of squared residuals

\[ \begin{aligned} \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &= (\mathbf{y}^T - \boldsymbol{\beta}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\\[10pt] &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\\[10pt] &=\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} \end{aligned} \]

Minimize sum of squared residuals

The estimate of \(\boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix}\) that minimizes SSR is the one such that

\[ \nabla_{\boldsymbol{\beta}} SSR = \nabla_{\boldsymbol{\beta}}( \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) = 0 \]

Side note: Vector operations

Let \(\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_k\end{bmatrix}\)be a \(k \times 1\) vector and \(f(\mathbf{x})\) be a function of \(\mathbf{x}\).

Then \(\nabla_\mathbf{x}f\), the gradient of \(f\) with respect to \(\mathbf{x}\) is

\[ \nabla_\mathbf{x}f = \begin{bmatrix}\frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_k}\end{bmatrix} \]

Side note: Vector operations

The Hessian matrix, \(\nabla_\mathbf{x}^2f\) is a \(k \times k\) matrix of partial second derivatives

\[ \nabla_{\mathbf{x}}^2f = \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2f}{\partial x_1\partial x_k} \\ \frac{\partial^2f}{\partial\ x_2 \partial x_1} & \frac{\partial^2f}{\partial x_2^2} & \dots & \frac{\partial^2f}{\partial x_2 \partial x_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_k\partial x_1} & \frac{\partial^2f}{\partial x_k\partial x_2} & \dots & \frac{\partial^2f}{\partial x_k^2} \end{bmatrix} \]

Side note: Vector operations

Proposition 1

Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{z}\) be a \(k \times 1\) vector, such that \(\mathbf{z}\) is not a function of \(\mathbf{x}\) .

The gradient of \(\mathbf{x}^T\mathbf{z}\) with respect to \(\mathbf{x}\) is

\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^T\mathbf{z} = \mathbf{z} \]

Side note: Proposition 1

\[ \begin{aligned} \mathbf{x}^T\mathbf{z} &= \class{fragment}{\begin{bmatrix}x_1 & x_2 & \dots &x_k\end{bmatrix} \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\z_k\end{bmatrix}} \\[10pt] &\class{fragment}{= x_1z_1 + x_2z_2 + \dots + x_kz_k} \\ &\class{fragment}{= \sum_{i=1}^k x_iz_i} \end{aligned} \]

Side note: Proposition 1

\[ \nabla_\mathbf{x}\hspace{1mm}\mathbf{x}^T\mathbf{z} = \class{fragment}{\begin{bmatrix}\frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_1} \\ \frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_2} \\ \vdots \\ \frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_k}\end{bmatrix}} = \class{fragment}{\begin{bmatrix}\frac{\partial}{\partial x_1} (x_1z_1 + x_2z_2 + \dots + x_kz_k) \\ \frac{\partial}{\partial x_2} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\\ \vdots \\ \frac{\partial}{\partial x_k} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\end{bmatrix}} = \class{fragment}{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_k\end{bmatrix} = \mathbf{z}} \]

Side note: Vector + matrix operations

Proposition 2

Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{A}\) be a \(k \times k\) matrix, such that \(\mathbf{A}\) is not a function of \(\mathbf{x}\) .

Then the gradient of \(\mathbf{x}^T\mathbf{A}\mathbf{x}\) with respect to \(\mathbf{x}\) is

\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^T\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x} + \mathbf{A}^T \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} \]

If \(\mathbf{A}\) is symmetric, then

\[ (\mathbf{A} + \mathbf{A}^T)\mathbf{x} = 2\mathbf{A}\mathbf{x} \]

Proof in HW 01

Find the least squares estimators

\[ \begin{aligned} \nabla_{\boldsymbol{\beta}} SSR &= \nabla_{\boldsymbol{\beta}}( \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) \\[10pt] & \class{fragment}{= \nabla_\boldsymbol{\beta} \hspace{1mm} \mathbf{y}^T\mathbf{y} - 2\nabla_\boldsymbol{\beta} \hspace{1mm} \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \nabla_\boldsymbol{\beta} \hspace{1mm} \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}} \\[10pt] &\class{fragment}{= 0 - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}}\class{fragment}{=0} \\[10pt] &\class{fragment}{\Rightarrow \mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}} \\[10pt] &\class{fragment}{\Rightarrow (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}} \\[10pt] &\class{fragment}{\color{#993399}{\Rightarrow \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}} \end{aligned} \]

Did we find a minimum?

\[ \begin{aligned} \nabla^2_{\boldsymbol{\beta}} SSR &= \nabla_{\boldsymbol{\beta}} (-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) \\[10pt] &\class{fragment}{=-2\nabla_{\boldsymbol{\beta}}\mathbf{X}^T\mathbf{y} + 2\nabla_{\boldsymbol{\beta}}(\mathbf{X}^T\mathbf{X}\mathbf{\beta})} \\[10pt] &\class{fragment}{\propto \mathbf{X}^T\mathbf{X}}\class{fragment}{ > 0} \end{aligned} \]

Show the details in HW 01

Predicted values and residuals

Predicted (fitted) values

Now that we have \(\hat{\boldsymbol{\beta}}\), let’s predict values of \(\mathbf{y}\) using the model

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T}_{\mathbf{H}}\mathbf{y} = \mathbf{H}\mathbf{y} \]

Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

  • \(\mathbf{H}\) is an \(n\times n\) matrix
  • Maps vector of observed values \(\mathbf{y}\) to a vector of fitted values \(\hat{\mathbf{y}}\)
  • It is only a function of \(\mathbf{X}\) not \(\mathbf{y}\)

Residuals

Recall that the residuals are the difference between the observed and predicted values

\[ \begin{aligned} \mathbf{e} &= \mathbf{y} - \hat{\mathbf{y}}\\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}} \\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{H}\mathbf{y}} \\[20pt] \class{fragment}{\color{#993399}{\mathbf{e}}} &\class{fragment}{\color{#993399}{=(\mathbf{I} - \mathbf{H})\mathbf{y}}} \\[10pt] \end{aligned} \]

Recap

  • Introduced matrix representation for simple linear regression

    • Model from
    • Least square estimate
    • Predicted (fitted) values
    • Residuals