HW 04: Logistic regression

Due date

This assignment is due on Thursday, November 21 at 11:59pm.

Introduction

In this assignment you will work with logistic regression models and use them to understand multivariable relationships in a variety of data contexts.

Learning goals

In this assignment, you will…

  • Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables

  • Conduct exploratory data analysis for logistic regression

  • Interpret coefficients of logistic regression model

  • Use statistics to help choose the best fit model

  • Assess the fit of a logistic regression model

Getting started

  • Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.

  • Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

In the 2014 article “The Biggest Predictor of How Long You’ll Be Unemployed Is When You Lose Your Job”, author Ben Casselman analyzes the relationship between numerous factors such as age, race, and education and the odds an adult is long-term unemployed (unemployed for over a year).

Casselman fits a logistic regression model using the unemployment rate at the time the person lost their job to predict whether an adult is long-term unemployed. He states the following from the model:

“A one-point increase in the unemployment rate raises an individual’s odds of becoming long-term unemployed by 35 percent.”

What is the coefficient for unemployment rate in this model? Show how you calculated the answer.

Exercise 21

You would like to study how well the combined earnings for a child’s parents predicts high school graduate. Suppose the probability a child graduates from high school is 27% for children whose parents earn no income and is 88% for children whose parents earn $60,000 combined income.

Write the equation for the logistic regression model with the single predictor variable income that is consistent with the information above. You can assume income is measured in tens of thousands of dollars.

Exercise 32

Berry (2001) examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.

Let \(d\) be the draft position \((d = 1, 2, 3, \ldots)\) and \(\pi\) be the probability of eventually being named an all star. The researcher modeled the relationship between \(d\) and \(\pi\) using the following model:

\[ \log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 \log d_i \]

  1. Using this model, show that the odds of being named an all star are \(e^{\beta_0}d^{\beta_1}\) . Then, show how to calculate \(\pi_i\) based on this model.

  2. Show that the odds of being named an all star for a first draft pick are \(e^{\beta_0}\) .

  3. In the study, Berry reported that for professional basketball \(\hat{\beta}_0 = 2.3\) and \(\hat{\beta}_1 = -1.1\), and for professional baseball \(\hat{\beta}_0 = 0.7\) and \(\hat{\beta}_1 = -0.6\) . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.

Exercise 4

In the paper Employing Standardized Risk Assessment in Pretrial Release Decisions: Association With Criminal Justice Outcomes and Racial Equity” Marlowe et al. (2020) analyze the risk predictions produced by a black-box algorithm used to determine whether a defendant is considered “high risk” of being rearrested if they are released while awaiting trial. Such algorithms are used by judges in some states to help determine whether or not defendants are released while awaiting trial.

The authors examine the algorithm’s risk predictions and whether a person was rearrested for over 500 defendants released pretrial in a southern state. For each person, the algorithm produced one of the following predictions: “High Risk” or “Low Risk”. The observed outcome was “Rearrested” (coded as 1) or “Not Rearrested” (coded as 0). Below are some results from the analysis:

  • Sensitivity: 86%
  • Specificity: 24%
  • Positive predictive power: 57%
  • Negative predictive power: 60%
Tip
  • Positive Predictive Power: P(Y = 1 | Y classified as 1 from the model)

  • Negative Predictive Power: P(Y = 0 | Y classified as 0 from the model)

  1. Explain what each of the following mean in the context of the analysis:

    • Sensitivity

    • Specificity

    • Positive predictive power

    • Negative predictive power

  2. What is the false positive rate? What does this value mean in the context of the analysis?

Exercise 53

In 2009, reporter Gina Kolata reported the story “Picture Emerging on Genetic Risks of IVF” in the New York Times. The calculated results from the study are shown below.

In November, the Centers for Disease Control and Prevention published a paper reporting that babies conceived with IVF, or with a technique in which sperm are injected directly into eggs, have a slightly increased risk of several birth defects, including a hole between the two chambers of the heart, a cleft lip or palate, an improperly developed esophagus and a malformed rectum. The study involved 9,584 babies with birth defects and 4,792 babies without. Among the mothers of babies without birth defects, 1.1% had used IVF or related methods, compared with 2.4% of mothers of babies with birth defects.

  1. Use the calculated results to construct a table displaying the relationship between the whether or not the mother used IVF and whether or not the baby had a birth defect.

  2. Calculate the odds that a baby in this study was born with a birth defect.

  3. The conclusion from the report was the following:

The findings are considered preliminary, and researchers say they believe IVF does not carry excessive risks. There is a 3% chance that any given baby will have a birth defect.

Does the data support this conclusion? Include any relevant calculations to support your answer.

Applied exercises

Instructions

The applied exercises are focused on applying the concepts to analyze data.

All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: Understanding pro-environmental behavior

Ibanez and Roussel (2022) conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch an video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants played a game in which they had an opportunity to donate to an environmental organization.

The data set is available in nature-experiment.csv in the data folder. We will use the following variables:

  • donation_binary:

    • 1 - participant donated to environmental organization
    • 0 - participant did not donate
  • age: Age in years

  • gender: Participant’s reported gender

    • 1 - male

    • 0 - non-male

  • treatment:

    • “URBAN (T1)” - the control group
    • “NATURE (T2)” - the treatment group
  • nep_high:

    • 1 - score of 4 or higher on the New Ecological Paradigm (NEP)
    • 0 - score less than 4
Tip

See the Introduction and Methods sections of Ibanez and Roussel (2022) for more detail about the variables.

Click here to access the paper online.

Exercise 6

  1. Create a visualization of the relationship between donating and treatment. Use the visualization to describe the relationship between the two variables.

  2. Create a visualization of the relationship between donating and age. Use the visualization to describe the relationship between the two variables.

  3. We would like to use the mean-centered value of age in the model. Create a new variable age_cent that contains the mean-centered ages.

Exercise 7

  1. Fit a logistic regression model using age_cent, gender, treatment, and nep_high to predict the likelihood of donating. Neatly display the model using 3 digits.

  2. The researchers are most interested in the effect of watching the nature documentary. Describe the effect of treatment in terms of the odds of donating.

  3. What group of participants is described by the intercept? What is the predicted probability a randomly selected individual in this group donates?

  4. Based on the model, what is the predicted probability of donating for a 25-year old male in the treatment group with a NEP score of 5?

Exercise 8

Produce the ROC curve for the model from the previous exercise and calculate the area under curve (AUC). Write 1 - 2 sentences describing how well the model fits the data.

Exercise 9

The authors include an interaction effect between nep_high and treatment in one of their models.

  1. Explain what an interaction between nep_high and treatment means in the context of the data.
  2. Create a visualization to explore the potential of an interaction effect between these two variables. Based on the visualization, do you think there is an interaction effect? Briefly explain.

Exercise 10

Conduct a drop-in-deviance test to determine if the interaction between nep_high and treatment should be added to the model fit in Exercise 7. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

To submit your assignment:

  • Access Gradescope through the menu on the STA 221 Canvas site.

  • Click on the assignment, and you’ll be prompted to submit it.

  • Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).

  • Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component Points
Ex 1 3
Ex 2 4
Ex 3 6
Ex 4 4
Ex 5 5
Ex 6 5
Ex 7 7
Ex 8 4
Ex 9 5
Ex 10 4
Workflow & formatting 3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Berry, Scott M. 2001. “A Statistician Reads the Sports Pages: Luck in Sports.” Chance 14 (1): 52–57.
Ibanez, Lisette, and Sébastien Roussel. 2022. “The Impact of Nature Video Exposure on Pro-Environmental Behavior: An Experimental Investigation.” Plos One 17 (11): e0275806.
Marlowe, Douglas B, Timothy Ho, Shannon M Carey, and Carly D Chadick. 2020. “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association with Criminal Justice Outcomes and Racial Equity.” Law and Human Behavior 44 (5): 361.

Footnotes

  1. Exercise adapted from an exercise in Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill.↩︎

  2. Exercise adapted froman exercise in Categorical Data Analysis by Agresti.↩︎

  3. Exercise adapted from an exercise in Beyond Multiple Linear Regression by Roback and Legler.↩︎