AE 04: Exam 01 Review

Published

October 3, 2024

Important

Go to the course GitHub organization and locate your ae-04 repo to get started.

Render, commit, and push your responses to GitHub by the end of class to submit your AE.

Packages

library(tidyverse)
library(tidymodels)
library(knitr)
library(patchwork)

Restaurant tips

What factors are associated with the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1

The variables we’ll focus on for this analysis are

  • Tip: amount of the tip
  • Party: number of people in the party
  • Age: Age of the payer

View the data set to see the remaining variables.

tips <- read_csv("data/tip-data.csv")

Exploratory data analysis

p1 <- ggplot(data = tips, aes(x = Tip)) + 
  geom_histogram(color = "white", binwidth = 2) +
  labs(x = "Tips ($)",
       title = "Tips at local restaurant")

p2 <- ggplot(data = tips, aes(x = Party)) + 
  geom_histogram(color = "white") +
  labs(x = "Party",
       title = "Number of diners in party") +
  xlim(c(0, 7))

p3 <- ggplot(data = tips, aes(x = Age)) + 
  geom_bar(color = "white") +
  labs(x = "",
       title = "Age of Payer") 

p1 / (p2 + p3)

p4 <- ggplot(data = tips, aes(x = Party, y = Tip)) + 
  geom_jitter() + 
  labs(x = "Number of diners in party", 
       y = "Tips ($)",
       title = "Tips vs. Party")

p5 <- ggplot(data = tips, aes(x = Age, y = Tip)) + 
  geom_boxplot() + 
  labs(x = "Age of payer", 
       y = "Tips ($)",
       title = "Tips vs. Age")

p4 + p5

We will use the number of diners in the party and age of the payer to understand variability in the tips.

Exercise 1

We will start with the main effects model.

  1. How many indicator variables for Age can we create from the data?
  2. How many indicator variables for Age will be in the regression model?
  3. Are the responses to parts a and b equal? If not, explain why not.
  4. Which of the following is true for this model? Select all that apply.
    1. The intercepts are the same for every level of Age.
    2. The intercepts differ by Age.
    3. The effect of Party is the same for every level of Age.
    4. The effect of Party differs by Age.

Exercise 2

  1. What is the dimension of the design matrix \(\mathbf{X}\) for the main effects model?

  2. Calculate the coefficient estimates \(\hat{\boldsymbol{\beta}}\) directly from the data.

  3. Write the equation of the estimated regression model.

# add code here 

Exercise 3

Compute the following directly from the data:

  1. The regression standard error \(\hat{\sigma}_{\epsilon}\) . Interpret this value in the context of the data.
  2. \(R^2\). Interpret this value in the context of the data.
  3. \(RMSE\). Interpret this value in the context of the data.
# add code here

Exercise 4

You decide to add an interaction effect between Age and Party to the model and fit a model of the following form:

\[ \hat{Tip}_i = \beta_0 + \beta_1Party_i + \beta_2SenCit_i + \beta_3Yadult_i + \beta_4Party_i \times SenCit_i + \beta_5 Party_i \times Yadult_i \]

  1. Which of the following is true for this model? Select all that apply.
    1. The intercepts are the same for every level of Age.
    2. The intercepts differ by Age.
    3. The effect of Party is the same for every level of Age.
    4. The effect of Party differs by Age.
  2. By how much does the intercept for tables with young adult payers differ from tables with middle age payers?
  3. Write the equation of the model for tables in which the payer is a senior citizen.
  4. Suppose you wish to test the hypotheses: \(H_0: \beta_5 = 0 \text{ vs. }H_a: \beta_5 \neq 0\) . State what is being tested in terms of the effect of Party.

Exercise 5

The output for the model with the interaction term and 90% confidence intervals for the coefficients is shown below.

tip_int_fit <- lm(Tip ~ Party + Age + Party * Age, data = tips)
tidy(tip_int_fit, conf.int = TRUE, conf.level = 0.9) |>
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.604 0.504 1.199 0.232 -0.229 1.438
Party 1.924 0.169 11.359 0.000 1.644 2.204
AgeSenCit 1.033 0.784 1.317 0.190 -0.265 2.330
AgeYadult -1.203 0.928 -1.297 0.197 -2.739 0.332
Party:AgeSenCit -0.259 0.262 -0.986 0.325 -0.692 0.175
Party:AgeYadult 0.199 0.504 0.395 0.693 -0.635 1.034
  1. What does 0.784, the standard error of AgeSenCit mean in the context of the data?
  2. What does 1.317, the test statistic for AgeSenCit mean in the context of the data?
  3. What does the p-value 0.190 mean in the context of the data?
  4. The 90% confidence interval corresponds to what \(\alpha\)-level?
  5. What is your conclusion about the effect of AgeSenCit?

Exercise 6

The following are general questions about regression. They are not specific to the tips data set.

  1. What does it mean for an estimator to be the “least-squares” estimator?

  2. Consider the following derivation of \(Var(\hat{\boldsymbol{\beta}})\) , the variance of the least-squares estimator:

    \[ \begin{aligned} Var(\hat{\boldsymbol{\beta}}) & = E[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T] \\ & = E[((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon})((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon})^T] \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TE(\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ & = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \end{aligned} \]

Explain how to go from Line 1 to Line 2.

  1. What assumptions are used to go from Line 3 to Line 4?
Submission

To submit the AE:

Render the document to produce the PDF with all of your work from today’s class.

Push all your work to your AE repo on GitHub. You’re done! 🎉

Footnotes

  1. Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎