library(tidyverse)
library(tidymodels)
library(knitr)
library(patchwork)
AE 04: Exam 01 Review
Go to the course GitHub organization and locate your ae-04 repo to get started.
Render, commit, and push your responses to GitHub by the end of class to submit your AE.
Packages
Restaurant tips
What factors are associated with the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1
The variables we’ll focus on for this analysis are
Tip
: amount of the tipParty
: number of people in the partyAge
: Age of the payer
View the data set to see the remaining variables.
<- read_csv("data/tip-data.csv") tips
Exploratory data analysis
<- ggplot(data = tips, aes(x = Tip)) +
p1 geom_histogram(color = "white", binwidth = 2) +
labs(x = "Tips ($)",
title = "Tips at local restaurant")
<- ggplot(data = tips, aes(x = Party)) +
p2 geom_histogram(color = "white") +
labs(x = "Party",
title = "Number of diners in party") +
xlim(c(0, 7))
<- ggplot(data = tips, aes(x = Age)) +
p3 geom_bar(color = "white") +
labs(x = "",
title = "Age of Payer")
/ (p2 + p3) p1
<- ggplot(data = tips, aes(x = Party, y = Tip)) +
p4 geom_jitter() +
labs(x = "Number of diners in party",
y = "Tips ($)",
title = "Tips vs. Party")
<- ggplot(data = tips, aes(x = Age, y = Tip)) +
p5 geom_boxplot() +
labs(x = "Age of payer",
y = "Tips ($)",
title = "Tips vs. Age")
+ p5 p4
We will use the number of diners in the party and age of the payer to understand variability in the tips.
Exercise 1
We will start with the main effects model.
- How many indicator variables for
Age
can we create from the data? - How many indicator variables for
Age
will be in the regression model? - Are the responses to parts a and b equal? If not, explain why not.
- Which of the following is true for this model? Select all that apply.
- The intercepts are the same for every level of
Age
. - The intercepts differ by
Age
. - The effect of
Party
is the same for every level ofAge
. - The effect of
Party
differs byAge
.
- The intercepts are the same for every level of
Exercise 2
What is the dimension of the design matrix \(\mathbf{X}\) for the main effects model?
Calculate the coefficient estimates \(\hat{\boldsymbol{\beta}}\) directly from the data.
Write the equation of the estimated regression model.
# add code here
Exercise 3
Compute the following directly from the data:
- The regression standard error \(\hat{\sigma}_{\epsilon}\) . Interpret this value in the context of the data.
- \(R^2\). Interpret this value in the context of the data.
- \(RMSE\). Interpret this value in the context of the data.
# add code here
Exercise 4
You decide to add an interaction effect between Age
and Party
to the model and fit a model of the following form:
\[ \hat{Tip}_i = \beta_0 + \beta_1Party_i + \beta_2SenCit_i + \beta_3Yadult_i + \beta_4Party_i \times SenCit_i + \beta_5 Party_i \times Yadult_i \]
- Which of the following is true for this model? Select all that apply.
- The intercepts are the same for every level of
Age
. - The intercepts differ by
Age
. - The effect of
Party
is the same for every level ofAge
. - The effect of
Party
differs byAge
.
- The intercepts are the same for every level of
- By how much does the intercept for tables with young adult payers differ from tables with middle age payers?
- Write the equation of the model for tables in which the payer is a senior citizen.
- Suppose you wish to test the hypotheses: \(H_0: \beta_5 = 0 \text{ vs. }H_a: \beta_5 \neq 0\) . State what is being tested in terms of the effect of
Party
.
Exercise 5
The output for the model with the interaction term and 90% confidence intervals for the coefficients is shown below.
<- lm(Tip ~ Party + Age + Party * Age, data = tips)
tip_int_fit tidy(tip_int_fit, conf.int = TRUE, conf.level = 0.9) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.604 | 0.504 | 1.199 | 0.232 | -0.229 | 1.438 |
Party | 1.924 | 0.169 | 11.359 | 0.000 | 1.644 | 2.204 |
AgeSenCit | 1.033 | 0.784 | 1.317 | 0.190 | -0.265 | 2.330 |
AgeYadult | -1.203 | 0.928 | -1.297 | 0.197 | -2.739 | 0.332 |
Party:AgeSenCit | -0.259 | 0.262 | -0.986 | 0.325 | -0.692 | 0.175 |
Party:AgeYadult | 0.199 | 0.504 | 0.395 | 0.693 | -0.635 | 1.034 |
- What does 0.784, the standard error of
AgeSenCit
mean in the context of the data? - What does 1.317, the test statistic for
AgeSenCit
mean in the context of the data? - What does the p-value 0.190 mean in the context of the data?
- The 90% confidence interval corresponds to what \(\alpha\)-level?
- What is your conclusion about the effect of
AgeSenCit
?
Exercise 6
The following are general questions about regression. They are not specific to the tips data set.
What does it mean for an estimator to be the “least-squares” estimator?
Consider the following derivation of \(Var(\hat{\boldsymbol{\beta}})\) , the variance of the least-squares estimator:
\[ \begin{aligned} Var(\hat{\boldsymbol{\beta}}) & = E[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T] \\ & = E[((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon})((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon})^T] \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TE(\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ & = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \end{aligned} \]
Explain how to go from Line 1 to Line 2.
- What assumptions are used to go from Line 3 to Line 4?
To submit the AE:
Render the document to produce the PDF with all of your work from today’s class.
Push all your work to your AE repo on GitHub. You’re done! 🎉
Footnotes
Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎