library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)
# load other packages as needed
HW 04: Logistic regression
This assignment is due on Thursday, November 21 at 11:59pm.
Introduction
In this assignment you will work with logistic regression models and use them to understand multivariable relationships in a variety of data contexts.
Learning goals
In this assignment, you will…
Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables
Conduct exploratory data analysis for logistic regression
Interpret coefficients of logistic regression model
Use statistics to help choose the best fit model
Assess the fit of a logistic regression model
Getting started
Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 01 instructions for details on cloning a repo and starting a new project in R.
Packages
The following packages are used in this assignment:
Conceptual exercises
Instructions
The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.
You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.
Exercise 1
In the 2014 article “The Biggest Predictor of How Long You’ll Be Unemployed Is When You Lose Your Job”, author Ben Casselman analyzes the relationship between numerous factors such as age, race, and education and the odds an adult is long-term unemployed (unemployed for over a year).
Casselman fits a logistic regression model using the unemployment rate at the time the person lost their job to predict whether an adult is long-term unemployed. He states the following from the model:
“A one-point increase in the unemployment rate raises an individual’s odds of becoming long-term unemployed by 35 percent.”
What is the coefficient for unemployment rate in this model? Show how you calculated the answer.
Exercise 21
You would like to study how well the combined earnings for a child’s parents predicts high school graduate. Suppose the probability a child graduates from high school is 27% for children whose parents earn no income and is 88% for children whose parents earn $60,000 combined income.
Write the equation for the logistic regression model with the single predictor variable income
that is consistent with the information above. You can assume income is measured in tens of thousands of dollars.
Exercise 32
Berry (2001) examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.
Let \(d\) be the draft position \((d = 1, 2, 3, \ldots)\) and \(\pi\) be the probability of eventually being named an all star. The researcher modeled the relationship between \(d\) and \(\pi\) using the following model:
\[ \log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 \log d_i \]
Using this model, show that the odds of being named an all star are \(e^{\beta_0}d^{\beta_1}\) . Then, show how to calculate \(\pi_i\) based on this model.
Show that the odds of being named an all star for a first draft pick are \(e^{\beta_0}\) .
In the study, Berry reported that for professional basketball \(\hat{\beta}_0 = 2.3\) and \(\hat{\beta}_1 = -1.1\), and for professional baseball \(\hat{\beta}_0 = 0.7\) and \(\hat{\beta}_1 = -0.6\) . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.
Exercise 4
In the paper “Employing Standardized Risk Assessment in Pretrial Release Decisions: Association With Criminal Justice Outcomes and Racial Equity” Marlowe et al. (2020) analyze the risk predictions produced by a black-box algorithm used to determine whether a defendant is considered “high risk” of being rearrested if they are released while awaiting trial. Such algorithms are used by judges in some states to help determine whether or not defendants are released while awaiting trial.
The authors examine the algorithm’s risk predictions and whether a person was rearrested for over 500 defendants released pretrial in a southern state. For each person, the algorithm produced one of the following predictions: “High Risk” or “Low Risk”. The observed outcome was “Rearrested” (coded as 1) or “Not Rearrested” (coded as 0). Below are some results from the analysis:
- Sensitivity: 86%
- Specificity: 24%
- Positive predictive power: 57%
- Negative predictive power: 60%
Positive Predictive Power: P(Y = 1 | Y classified as 1 from the model)
Negative Predictive Power: P(Y = 0 | Y classified as 0 from the model)
Explain what each of the following mean in the context of the analysis:
Sensitivity
Specificity
Positive predictive power
Negative predictive power
What is the false positive rate? What does this value mean in the context of the analysis?
Exercise 53
In 2009, reporter Gina Kolata reported the story “Picture Emerging on Genetic Risks of IVF” in the New York Times. The calculated results from the study are shown below.
In November, the Centers for Disease Control and Prevention published a paper reporting that babies conceived with IVF, or with a technique in which sperm are injected directly into eggs, have a slightly increased risk of several birth defects, including a hole between the two chambers of the heart, a cleft lip or palate, an improperly developed esophagus and a malformed rectum. The study involved 9,584 babies with birth defects and 4,792 babies without. Among the mothers of babies without birth defects, 1.1% had used IVF or related methods, compared with 2.4% of mothers of babies with birth defects.
Use the calculated results to construct a table displaying the relationship between the whether or not the mother used IVF and whether or not the baby had a birth defect.
Calculate the odds that a baby in this study was born with a birth defect.
The conclusion from the report was the following:
The findings are considered preliminary, and researchers say they believe IVF does not carry excessive risks. There is a 3% chance that any given baby will have a birth defect.
Does the data support this conclusion? Include any relevant calculations to support your answer.
Applied exercises
Instructions
The applied exercises are focused on applying the concepts to analyze data.
All work for the applied exercises must be typed in your Quarto document following a reproducible workflow.
Write all narrative using complete sentences and include informative axis labels / titles on visualizations.
Data: Understanding pro-environmental behavior
Ibanez and Roussel (2022) conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch an video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants played a game in which they had an opportunity to donate to an environmental organization.
The data set is available in nature-experiment.csv
in the data
folder. We will use the following variables:
donation_binary
:- 1 - participant donated to environmental organization
- 0 - participant did not donate
age
: Age in yearsgender
: Participant’s reported gender1 - male
0 - non-male
treatment
:- “URBAN (T1)” - the control group
- “NATURE (T2)” - the treatment group
nep_high
:- 1 - score of 4 or higher on the New Ecological Paradigm (NEP)
- 0 - score less than 4
See the Introduction and Methods sections of Ibanez and Roussel (2022) for more detail about the variables.
Click here to access the paper online.
Exercise 6
Create a visualization of the relationship between donating and treatment. Use the visualization to describe the relationship between the two variables.
Create a visualization of the relationship between donating and age. Use the visualization to describe the relationship between the two variables.
We would like to use the mean-centered value of
age
in the model. Create a new variableage_cent
that contains the mean-centered ages.
Exercise 7
Fit a logistic regression model using
age_cent
,gender
,treatment
, andnep_high
to predict the likelihood of donating. Neatly display the model using 3 digits.The researchers are most interested in the effect of watching the nature documentary. Describe the effect of
treatment
in terms of the odds of donating.What group of participants is described by the intercept? What is the predicted probability a randomly selected individual in this group donates?
Based on the model, what is the predicted probability of donating for a 25-year old male in the treatment group with a NEP score of 5?
Exercise 8
Produce the ROC curve for the model from the previous exercise and calculate the area under curve (AUC). Write 1 - 2 sentences describing how well the model fits the data.
Exercise 9
The authors include an interaction effect between nep_high
and treatment
in one of their models.
- Explain what an interaction between
nep_high
andtreatment
means in the context of the data. - Create a visualization to explore the potential of an interaction effect between these two variables. Based on the visualization, do you think there is an interaction effect? Briefly explain.
Exercise 10
Conduct a drop-in-deviance test to determine if the interaction between nep_high
and treatment
should be added to the model fit in Exercise 7. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.
Submission
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.
Instructions to combine PDFs:
Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): help.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 3 |
Ex 2 | 4 |
Ex 3 | 6 |
Ex 4 | 4 |
Ex 5 | 5 |
Ex 6 | 5 |
Ex 7 | 7 |
Ex 8 | 4 |
Ex 9 | 5 |
Ex 10 | 4 |
Workflow & formatting | 3 |
The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.