Lab 02: Multiple Linear Regression

Childcare costs in North Carolina

Due date

This lab is due on Thursday, September 19 at 11:59pm. To be considered on time, the following must be done by the due date:

  • Final .qmd and .pdf files pushed to your team’s GitHub repo
  • Final .pdf file submitted on Gradescope

Multiple Linear Regression

In this lab, you will use multiple linear regression to predict a county’s median childcare cost for school-age children based on demographic and labor market factors. You will also start working with your lab teams, write a team agreement, and be introduced to using GitHub for collaboration.

Learning goals

By the end of the lab you will be able to…

  • use multiple linear regression to describe the relationship between a response variable and multiple predictor variables.
  • interpret the coefficients of a multiple linear regression model.
  • use testing data to assess a model’s predictive performance.
  • collaborate with others using a single Github repo.

Meet your team!

Click here to see the team assignments for STA 221. This will be your team for labs and the final project. Before you get started on the lab, complete the following:

✅ Come up with a team name. You can’t use the same name as another team, so I encourage you to be creative! Your TA will get your team name by the end of lab.

✅ Fill out the team agreement. This will help you figure out a plan for communication, and working together during labs and outside of class. You can find the team agreement in the GitHub repo team-agreement-[github_team_name].

  • Have one person from the team clone the repo and start a new RStudio project. This person will type the team’s responses as you discuss the questions in the agreement. No one else in the team should type at this point but should be contributing to the discussion.

  • Be sure to push the completed agreement to GitHub. Each team member can refer to the document in this repo or download the PDF of the agreement for future reference. You do not need to submit the agreement on Gradescope.

Getting started

  • A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

  • Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix lab-02. It contains the starter documents you need to complete the lab.

  • Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.

  • Do not make any changes to the .qmd file until the instructions tell you do to so.

Workflow: Using Git and GitHub as a team

Important

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.

The following exercises must be done in order. Only one person should type in the .qmd file, commit, and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

⌨️ Team Member 1: Hands on the keyboard.

🙅🏽 All other team members: Hands off the keyboard until otherwise instructed!1

Change the author to your team name and include each team member’s name in the author field of the YAML in the following format: Team Name: Member 1, Member 2, Member 3, Member 4.

Team Member 1: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub.


Team Members 2, 3, 4: Once Team Member 1 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the updated name in your .qmd file.

Packages

You will use the following packages in today’s lab. Add other packages as needed.

library(tidyverse)
library(tidymodels)
library(knitr)

Data

Today’s data was pulled from the National Database of Childcare Prices and was featured on the data visualization challenge TidyTuesday in May 2023. This lab focuses on a subset of the data that contains information about childcare costs and labor market conditions across the 100 counties in North Carolina. This data set for this lab contains the average values for each variable across years 2008 - 2018.

You will work with two data sets in this analysis - a training set and a test set. You will primarily work with the training set, observations that are used to explore the data and fit the regression model. At the end of the analysis, you will use the test set of observations to assess the predictive power of the model. Observations are randomly assigned to the training and test sets, so each set can be treated as representative of the original data.

childcare_train <- read_csv("data/childcare-costs-train.csv")
childcare_test <- read_csv("data/childcare-costs-test.csv")

The response variable for this analysis is mcsa, defined as weekly, full-time median price charged for center-based care for those who are school age (ages 6 - 17 years old) based on the results reported in the market rate survey report for the county.

The full data dictionary is available in the README of the data folder in your GitHub repo. You can also find the full data dictionary on the TidyTuesday page.

Exercises

Goal: The goal of this analysis is to use various demographic and labor-market factors to predict the median price of childcare for school-age children for counties in North Carolina.

Important

Write all code and narrative in your Quarto file. Write all narrative in complete sentences. Make sure the teaching team can read all of your code in your PDF document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (|>) and plus sign (+).


Team Member 1: Type the team’s responses to exercises 1 - 2.

Exercise 1

Let’s start with some exploratory data analysis. Visualize the distribution of the response variable mcsa and calculate summary statistics. Describe the distribution of this variable, including the shape, center, spread, and presence of potential outliers.

Important

You will use the childcare_train for all analysis in Exercises 1 - 7.

Exercise 2

As you can see from the data dictionary in the README of the data folder, there are many interesting potential variables that could be included in the model to predict median childcare cost for school-age children. Therefore, we will do some feature selection and feature design to choose potential predictors and construct new ones.

As a team, select four variables you want to use as predictors for the model. For each variable, state the variable name, definition, and a brief explanation about why your team hypothesizes this will be a relevant predictor of median childcare costs. The explanation may (but is not required to) include some short exploratory analysis.

Team Member 1: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 1- 2.

Team Member 2: It’s your turn! Type the team’s response to exercises 3 - 4.

Exercise 3

Once we’ve identified potential predictor variables, we often need to transform some variables (e.g., change raw counts into proportions) or create new ones (e.g., create a categorical variable out of quantitative data) before fitting the regression model. This process is particularly useful when putting a variable in the model “as-is” may result in interpretation issues.

Choose one of the variables selected in the previous exercise. For this variable,

  • Transform the variable or use it to create a new variable. Be sure to save the variable to the childcare_train data frame.

  • Briefly explain your reasoning for the transformation or new variable.

  • Use visualizations and/or summary statistics to display the distribution of the original variable and the transformed / newly created variable. Note: This is to help ensure the transformation / new variable is what you expect.

An example using h_6to17_both_work is below. Note you cannot use this variable for your transformation / new variable.

Example

Say we believe that the amount of households with two working parents increases demand for childcare services, and hence their price. We have the column h_6to17_both_work encoding the number of such households per county. The raw population count differs across county, so having a larger value of h_6to17_both_work may reflect population size and not necessarily imply the type of such household is more prevalent in the county.

A reasonable thing would be creating a variable encoding the proportion of households with both parents working. We could do this by creating a variable p_6to17_both_work:

p_6to17_both_work = h_6to17_both_work / households

In this case, we would then use p_6to17_both_work not h_6to17_both_work as a predictor in the model.

You may decide to transform and/or create multiple new variables; however, you will only be graded on the one of them.

Important

You will use the transformed / new variable (not the original variable) in the model!

Exercise 4

Now let’s conduct some bivariate exploratory data analysis. Visualize the relationship between the response variable and one of your predictor variables.

Write two distinct observations from the visualization.

Team Member 2: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 3 - 4.

Team Member 3: It’s your turn! Type the team’s response to exercises 5 - 6.

Exercise 5

Use the matrix form of the model to represent the regression model with the variables you selected and transformed/created in exercises 2 and 3 as the predictors. For each symbol in the model

  • describe what it represents, and
  • state the dimensions.

The description and dimensions should be in the context of these data, not in general.

Exercise 6

Use lm to fit the regression model you described in the previous exercise.

  • Neatly display the model using a reasonable number of digits.
  • Interpret the coefficient for one predictor in the model.

Team Member 3: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the responses to exercises 5 - 6.

Team Member 4: It’s your turn! Type the team’s response to exercises 7 - 9.

Exercise 7

Now let’s assess the fit of the model.

  • How much of the variability in the childcare costs is explained by your chosen predictor variables?
  • Based on this, do you think the model explains a significant portion of the variability in childcare costs for school-age children in North Carolina? Briefly explain.

Exercise 8

Now let’s use the testing data to explore the predictive power of the model.

  • Add the variable you created in Exercise 3 to the testing data.

  • Then, use the code below to compute the predicted childcare costs for the observations in the testing data using the predict function.

# compute predictions
pred <- predict(childcare_fit, childcare_test)

# add predictions to testing data set
childcare_test <- childcare_test |>
  mutate(pred = pred)

Exercise 9

  • Compute the RMSE for the test set, and compare it to the standard deviation of the response variable mcsa.
  • How do these values compare?

  • Based on this, how would assess the predictive power of the model?

Team Member 4: Render, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and the rest of the team can see the completed lab.

All other team members: Pull to get the updated documents from GitHub. Click on the .qmd file, and you should see the team’s completed lab!

Exercise 10

If you haven’t already, make sure you have completed the team agreement (see the instructions in [Meet your team!]).

Wrapping up

Team Member 2: Render the document and confirm that the changes are visible in the PDF. Then, commit (with an informative commit message) both the .qmd and PDF documents, and finally push the changes to GitHub. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Once Team Member 3 is done rendering, committing, and pushing, confirm that the changes are visible on GitHub in your team’s lab repo. Then, in RStudio, click the Pull button in the Git pane to get the updated document. You should see the final version of your .qmd file.

Submission

You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

  • Access Gradescope through the menu on the STA 221 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 5
Ex 2 5
Ex 3 6
Ex 4 4
Ex 5 5
Ex 6 4
Ex 7 4
Ex 8 2
Ex 9 5
Ex 10 5
Workflow & formatting 5

The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.

Footnotes

  1. Don’t trust yourself to keep your hands off the keyboard? Put them in your pocket or cross your arms. No matter how silly it might feel, resist the urge to touch your keyboard until otherwise instructed!↩︎