STA 221 - Fall 2024 – Logistic Regression: Assumptions + Estimation

Announcements

Lab 06 due Thursday, November 21 at 11:59pm
HW 04 due Thursday, November 21 at 11:59pm
Project meetings: November 25 and 26
- Click here to sign up (1 slot per team) by November 22
Statistics experience due Tuesday, November 26
Project: Draft report due + peer review in December 2 lab

Topics

Conditions for logistic regression
Estimating coefficients for logistic regression model

Computational setup

library(tidyverse)
library(tidymodels)
library(knitr)
library(kableExtra)
library(Stat2Data)  #empirical logit plots
library(patchwork)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

COVID-19 infection prevention practices at food establishments

Researchers at Wollo University in Ethiopia conducted a study in July and August 2020 to understand factors associated with good COVID-19 infection prevention practices at food establishments. Their study is published in Andualem et al. (2022).

They were particularly interested in the understanding implementation of prevention practices at food establishments, given the workers’ increased risk due to daily contact with customers.

Access to personal protective equipment

We will use the data from Andualem et al. (2022) to explore the association between age, sex, years of service, and whether someone works at a food establishment with access to personal protective equipment (PPE) as of August 2020. We will use access to PPE as a proxy for wearing PPE.

The study participants were selected using a simple random sampling at the selected establishments.

age	sex	years	ppe_access
34	Male	2	1
32	Female	3	1
32	Female	1	1
40	Male	4	1
32	Male	10	1

Full model results

Bivariate EDA: categorical predictor

Bivariate EDA: quantitative predictor

Empirical logit

The empirical logit is the log of the observed odds:

$logit (\hat{p}) = \log (\frac{\hat{p}}{1 - \hat{p}}) = \log (\frac{# Yes}{# No})$

Calculating empirical logit (categorical predictor)

If the predictor is categorical, we can calculate the empirical logit for each level of the predictor.

covid_df |>
  count(sex, ppe_access) |>
  group_by(sex) |>
  mutate(prop = n/sum(n)) |>
  filter(ppe_access == "1") |>
  mutate(emp_logit = log(prop/(1-prop)))

# A tibble: 2 × 5
# Groups:   sex [2]
  sex    ppe_access     n  prop emp_logit
  <fct>  <fct>      <int> <dbl>     <dbl>
1 Female 1            103 0.475    -0.101
2 Male   1            119 0.647     0.605

Calculating empirical logit (quantitative predictor)

Divide the range of the predictor into intervals with approximately equal number of cases. (If you have enough observations, use 5 - 10 intervals.)
Compute the empirical logit for each interval

You can then calculate the mean value of the predictor in each interval and create a plot of the empirical logit versus the mean value of the predictor in each interval.

Empirical logit plot in R (quantitative predictor)

Created using dplyr and ggplot functions.

Empirical logit plot in R (quantitative predictor)

Created using dplyr and ggplot functions.

covid_df |> 
  mutate(age_bin = cut_number(age, n = 10)) |>
  group_by(age_bin) |>
  mutate(mean_age = mean(age)) |>
  count(mean_age, ppe_access) |>
  mutate(prop = n/sum(n)) |>
  filter(ppe_access == "1") |>
  mutate(emp_logit = log(prop/(1-prop))) |>
  ggplot(aes(x = mean_age, y = emp_logit)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Mean Age", 
       y = "Empirical logit", 
       title = "Empirical logit of PPE Access vs. Age")

Empirical logit plot in R (quantitative predictor)

Using the emplogitplot1 function from the Stat2Data R package

emplogitplot1(ppe_access ~ age,  data = covid_df, ngroups = 10)

Empirical logit plot in R (interactions)

Using the emplogitplot2 function from the Stat2Data R package

emplogitplot2(ppe_access ~ age + sex, data = covid_df, 
              ngroups = 10, 
              putlegend = "bottomright")

Logistic regression model

ppe_model <- glm(ppe_access ~ age + sex + years, 
                 data = covid_df, family = binomial)
tidy(ppe_model, conf.int = TRUE) |>
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2.127	0.458	-4.641	0.000	-3.058	-1.257
age	0.056	0.017	3.210	0.001	0.023	0.091
sexMale	0.341	0.224	1.524	0.128	-0.098	0.780
years	0.264	0.066	4.010	0.000	0.143	0.401

Visualizing coefficient estimates

model_odds_ratios <- tidy(ppe_model, exponentiate = TRUE, conf.int = TRUE)

ggplot(data = model_odds_ratios, aes(x = term, y = estimate)) +
  geom_point() +
  geom_hline(yintercept = 1, lty = 2) + 
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
  labs(title = "Adjusted odds ratios",
       x = "",
       y = "Estimated AOR") +
  coord_flip()

Logistic Regression: Assumptions + Estimation

Announcements

Topics

Computational setup

COVID-19 infection prevention practices at food establishments

Access to personal protective equipment

Full model results

Bivariate EDA: categorical predictor

Bivariate EDA: quantitative predictor

Empirical logit

Calculating empirical logit (categorical predictor)

Calculating empirical logit (quantitative predictor)

Empirical logit plot in R (quantitative predictor)

Empirical logit plot in R (quantitative predictor)

Empirical logit plot in R (quantitative predictor)

Empirical logit plot in R (interactions)

Logistic regression model

Visualizing coefficient estimates

Assumptions for logistic regression

Assumptions for logistic regression

Checking linearity

Checking randomness

Checking independence

Estimating $β$

Estimating $β$

Estimating $β$

Newton-Raphson method

Newton-Raphson method

Example

Example

Example

Score vector & Hessian

Newton-Raphson for logistic regression

Newton-Raphson for logistic regression

PPE access example

References