cont’d
Sep 10, 2024
Lab 01 due on Thursday, September 12 at 11:59pm
Push work to GitHub repo
Submit final PDF on Gradescope + mark pages for each question
HW 01 will be assigned on Thursday
Suppose we have \(n\) observations, a quantitative response variable, and a single predictor\[ \underbrace{ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} }_ {\mathbf{y}} \hspace{3mm} = \hspace{3mm} \underbrace{ \begin{bmatrix} 1 &x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} }_{\mathbf{X}} \hspace{2mm} \underbrace{ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} }_{\boldsymbol{\beta}} \hspace{3mm} + \hspace{3mm} \underbrace{ \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} }_\boldsymbol{\epsilon} \]
Goal: Find \(\boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix}\) that minimizes the sum of squared residuals \[ \begin{aligned} SSR = \sum_{i=1}^n e_i^2 = \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] \end{aligned} \]
\[ \begin{aligned} \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &\class{fragment}{= (\mathbf{y}^T - \boldsymbol{\beta}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})}\\[10pt] &\class{fragment}{=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}}\\[10pt] \end{aligned} \]
\[ \begin{aligned} \mathbf{e}^T\mathbf{e} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &= (\mathbf{y}^T - \boldsymbol{\beta}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\\[10pt] &=\mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\\[10pt] &=\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} \end{aligned} \]
The estimate of \(\boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1 \end{bmatrix}\) that minimizes SSR is the one such that
\[ \nabla_{\boldsymbol{\beta}} SSR = \nabla_{\boldsymbol{\beta}}( \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) = 0 \]
Let \(\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_k\end{bmatrix}\)be a \(k \times 1\) vector and \(f(\mathbf{x})\) be a function of \(\mathbf{x}\).
Then \(\nabla_\mathbf{x}f\), the gradient of \(f\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x}f = \begin{bmatrix}\frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_k}\end{bmatrix} \]
The Hessian matrix, \(\nabla_\mathbf{x}^2f\) is a \(k \times k\) matrix of partial second derivatives
\[ \nabla_{\mathbf{x}}^2f = \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2f}{\partial x_1\partial x_k} \\ \frac{\partial^2f}{\partial\ x_2 \partial x_1} & \frac{\partial^2f}{\partial x_2^2} & \dots & \frac{\partial^2f}{\partial x_2 \partial x_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_k\partial x_1} & \frac{\partial^2f}{\partial x_k\partial x_2} & \dots & \frac{\partial^2f}{\partial x_k^2} \end{bmatrix} \]
Proposition 1
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{z}\) be a \(k \times 1\) vector, such that \(\mathbf{z}\) is not a function of \(\mathbf{x}\) .
The gradient of \(\mathbf{x}^T\mathbf{z}\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^T\mathbf{z} = \mathbf{z} \]
\[ \begin{aligned} \mathbf{x}^T\mathbf{z} &= \class{fragment}{\begin{bmatrix}x_1 & x_2 & \dots &x_k\end{bmatrix} \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\z_k\end{bmatrix}} \\[10pt] &\class{fragment}{= x_1z_1 + x_2z_2 + \dots + x_kz_k} \\ &\class{fragment}{= \sum_{i=1}^k x_iz_i} \end{aligned} \]
\[ \nabla_\mathbf{x}\hspace{1mm}\mathbf{x}^T\mathbf{z} = \class{fragment}{\begin{bmatrix}\frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_1} \\ \frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_2} \\ \vdots \\ \frac{\partial \mathbf{x}^T\mathbf{z}}{\partial x_k}\end{bmatrix}} = \class{fragment}{\begin{bmatrix}\frac{\partial}{\partial x_1} (x_1z_1 + x_2z_2 + \dots + x_kz_k) \\ \frac{\partial}{\partial x_2} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\\ \vdots \\ \frac{\partial}{\partial x_k} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\end{bmatrix}} = \class{fragment}{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_k\end{bmatrix} = \mathbf{z}} \]
Proposition 2
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{A}\) be a \(k \times k\) matrix, such that \(\mathbf{A}\) is not a function of \(\mathbf{x}\) .
Then the gradient of \(\mathbf{x}^T\mathbf{A}\mathbf{x}\) with respect to \(\mathbf{x}\) is
\[ \nabla_\mathbf{x} \hspace{1mm} \mathbf{x}^T\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x} + \mathbf{A}^T \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} \]
If \(\mathbf{A}\) is symmetric, then
\[ (\mathbf{A} + \mathbf{A}^T)\mathbf{x} = 2\mathbf{A}\mathbf{x} \]
\[ \begin{aligned} \nabla_{\boldsymbol{\beta}} SSR &= \nabla_{\boldsymbol{\beta}}( \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) \\[10pt] & \class{fragment}{= \nabla_\boldsymbol{\beta} \hspace{1mm} \mathbf{y}^T\mathbf{y} - 2\nabla_\boldsymbol{\beta} \hspace{1mm} \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \nabla_\boldsymbol{\beta} \hspace{1mm} \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}} \\[10pt] &\class{fragment}{= 0 - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}}\class{fragment}{=0} \\[10pt] &\class{fragment}{\Rightarrow \mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}} \\[10pt] &\class{fragment}{\Rightarrow (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}} \\[10pt] &\class{fragment}{\color{#993399}{\Rightarrow \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}} \end{aligned} \]
\[ \begin{aligned} \nabla^2_{\boldsymbol{\beta}} SSR &= \nabla_{\boldsymbol{\beta}} (-2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}) \\[10pt] &\class{fragment}{=-2\nabla_{\boldsymbol{\beta}}\mathbf{X}^T\mathbf{y} + 2\nabla_{\boldsymbol{\beta}}(\mathbf{X}^T\mathbf{X}\mathbf{\beta})} \\[10pt] &\class{fragment}{\propto \mathbf{X}^T\mathbf{X}}\class{fragment}{ > 0} \end{aligned} \]
Now that we have \(\hat{\boldsymbol{\beta}}\), let’s predict values of \(\mathbf{y}\) using the model
\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T}_{\mathbf{H}}\mathbf{y} = \mathbf{H}\mathbf{y} \]
Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
Recall that the residuals are the difference between the observed and predicted values
\[ \begin{aligned} \mathbf{e} &= \mathbf{y} - \hat{\mathbf{y}}\\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}} \\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{H}\mathbf{y}} \\[20pt] \class{fragment}{\color{#993399}{\mathbf{e}}} &\class{fragment}{\color{#993399}{=(\mathbf{I} - \mathbf{H})\mathbf{y}}} \\[10pt] \end{aligned} \]
Introduced matrix representation for simple linear regression