AE 01: Simple linear regression

Houses in Duke Forest

Published

September 3, 2024

Important

Go to the course GitHub organization and locate your ae-01 repo to get started.

Render, commit, and push your responses to GitHub by the end of class to submit your AE.

This AE will not count towards your participation grade.

library(tidyverse)    # data wrangling and visualization
library(tidymodels)   # broom and yardstick package
library(openintro)    # duke_forest dataset
library(knitr)        # format output
library(scales)       # format plot axes
library(skimr)        # quickly calculate summary statistics

Data

The data are on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020. It was originally scraped from Zillow, and can be found in the duke_forest data set in the openintro R package.

We will focus on two variables:

  • area: Total area of the home in square feet (sqft)

  • price: Sale price in US Dollars (USD)

The goal of this analysis is to use the area to understand variability in the price of homes in Duke Forest.

glimpse(duke_forest)
Rows: 98
Columns: 13
$ address    <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha…
$ price      <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, …
$ bed        <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,…
$ bath       <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,…
$ area       <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,…
$ type       <chr> "Single Family", "Single Family", "Single Family", "Single …
$ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,…
$ heating    <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p…
$ cooling    <fct> central, central, central, central, central, central, centr…
$ parking    <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered…
$ lot        <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,…
$ hoa        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url        <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-…

Exploratory data analysis

Let’s begin by examining the univariate distributions of the price and area. The code to visualize and calculate summary statistics for price is below.

ggplot(data = duke_forest, aes(x = price)) + 
  geom_histogram() +
  labs(x = "Price in US dollars", 
       title = "Price of houses in Duke Forest") + 
  scale_x_continuous(labels = label_dollar(scale_cut = cut_long_scale()))

duke_forest |>
  summarise(min = min(price), q1 = quantile(price, 0.25), 
            median = median(price), q3 = quantile(price, 0.75), 
            max = max(price), mean = mean(price), sd = sd(price)) |>
  kable(digits = 3)
min q1 median q3 max mean sd
95000 450625 540000 643750 1520000 559898.7 225448.1

Exercise 1

What are 1 - 2 observations about the distribution of price?

Exercise 2

Visualize the distribution of area and calculate summary statistics.

# add code here
# add code here

Exercise 3

What are 1 - 2 observations about the distribution of area?

Exercise 4

Fill in the code to visualize the relationship between price and area. What are 1 - 2 observations about the relationship between these two variables?

Important

Remove #|eval: false after you have filled in the code!

ggplot(duke_forest, aes(x = ____, y = ____)) +
  geom_point(alpha = 0.7) +
  labs(
    x = "_______",
    y = "_________",
    title = "Price and area of houses in Duke Forest"
  ) +
  scale_y_continuous(labels = label_dollar()) 

Regression model

Exercise 5

You want to fit a model of the form

\[ price = \beta_0 + \beta_1 ~ area + \epsilon, \hspace{5mm} \epsilon \sim N(0, \sigma^2_\epsilon) \]

Would a model of this form be a reasonable fit for the data? Why or why not?

Exercise 6

Fit the linear model described in the previous exercise and neatly display the output.

See notes for example code.

# add code here

Exercise 7

  • Interpret the slope in the context of the data.

  • Interpret the slope in terms of area increasing by 100 sqft.

  • Which interpretation do you think is more meaningful in practice?

Exercise 8

Does it make sense to interpret the intercept? If so, interpret it in the context of the data. Otherwise, explain why not.

Submission

Important

To submit the AE:

  • Render the document to produce the PDF with all of your work from today’s class.
  • Push all your work to your AE repo on GitHub. You’re done! 🎉