STA 221 - Fall 2024 – Probabilites, odds, odds ratios

Method	Outcome	Model
Linear regression	Quantitative	$y_{i} = β_{0} + β_{1} x_{i}$
Linear regression (transform Y)	Quantitative	$\log (y_{i}) = β_{0} + β_{1} x_{i}$
Logistic regression	Binary	$\log (\frac{π_{i}}{1 - π_{i}}) = β_{0} + β_{1} x_{i}$

Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease. These notes focus on the following variables:

high_risk:
- 1: High risk of having heart disease in next 10 years
- 0: Not high risk of having heart disease in next 10 year
education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College

High risk vs. education

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Compare the odds for two groups

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

We want to compare the risk of heart disease for those with a High School diploma/GED and those with a college degree.
We’ll use the odds to compare the two groups

$odds = \frac{P (success)}{P (failure)} = \frac{# of successes}{# of failures}$

Compare the odds for two groups

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Odds of being high risk for the High school or GED group: $\frac{147}{1106} = 0.133$
Odds of being high risk for the College group: $\frac{70}{403} = 0.174$
Based on this, we see those with a college degree had higher odds of being high risk for heart disease than those with a high school diploma or GED.

Odds ratio (OR)

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Let’s summarize the relationship between the two groups. To do so, we’ll use the odds ratio (OR).

$O R = \frac{{odds}_{1}}{{odds}_{2}} = \frac{ω_{1}}{ω_{2}}$

OR: College vs. High school or GED

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

$O R = \frac{{odds}_{C o l l e g e}}{{odds}_{H S}} = \frac{0.174}{0.133} = 1.308$

The odds of being high risk for heart disease are 1.30 times higher for those with a college degree than those with a high school diploma or GED.

OR: College vs. Some high school

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

$O R = \frac{{odds}_{C o l l e g e}}{{odds}_{S o m e H S}} = \frac{70 / 403}{323 / 1397} = 0.751$

The odds of being high risk for having heart disease for those with a college degree are 0.751 times the odds of being high risk for heart disease for those with some high school.

More natural interpretation

It’s more natural to interpret the odds ratio with a statement with the odds ratio greater than 1.
The odds of being high risk for heart disease are 1.33 times higher for those with some high school than those with a college degree.

Making the table 1

First, rename the levels of the categorical variables:

heart_disease <- heart_disease |>
  mutate(
    high_risk_names = if_else(high_risk == "1", "High risk", "Not high risk"),
    education_names = case_when(
      education == "1" ~ "Some high school",
      education == "2" ~ "High school or GED",
      education == "3" ~ "Some college or vocational school",
      education == "4" ~ "College"
    ),
    education_names = fct_relevel(education_names, "Some high school", "High school or GED", "Some college or vocational school", "College")
  )

Making the table 2

Then, make the table:

heart_disease |>
  count(education_names, high_risk_names) |>
  pivot_wider(names_from = high_risk_names, values_from = n) |>
  kable(col.names = c("Education", "High risk", "Not high risk"))

Deeper look into the code

heart_disease |>
  count(education_names, high_risk_names)

# A tibble: 8 × 3
  education_names                   high_risk_names     n
  <fct>                             <chr>           <int>
1 Some high school                  High risk         323
2 Some high school                  Not high risk    1397
3 High school or GED                High risk         147
4 High school or GED                Not high risk    1106
5 Some college or vocational school High risk          88
6 Some college or vocational school Not high risk     601
7 College                           High risk          70
8 College                           Not high risk     403

Deeper look into the code

heart_disease |>
  count(education_names, high_risk_names) |>
  pivot_wider(names_from = high_risk_names, values_from = n)

# A tibble: 4 × 3
  education_names                   `High risk` `Not high risk`
  <fct>                                   <int>           <int>
1 Some high school                          323            1397
2 High school or GED                        147            1106
3 Some college or vocational school          88             601
4 College                                    70             403

Deeper look into the code

heart_disease |>
  count(education_names, high_risk_names) |>
  pivot_wider(names_from = high_risk_names, values_from = n) |>
  kable()

education_names	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Deeper look into the code

heart_disease |>
  count(education_names, high_risk_names) |>
  pivot_wider(names_from = high_risk_names, values_from = n) |>
  kable(col.names = c("Education", "High risk", "Not high risk"))

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Probabilites, odds, odds ratios

Announcements

Topics

Computational setup

Predicting categorical outcomes

Types of outcome variables

Models for categorical outcomes

Do teenagers get 7+ hours of sleep?

Plot the data

Let’s fit a linear regression model

Let’s use proportions

What happens if we zoom out?

Let’s try another model

The code

Different types of models

Linear vs. logistic regression

Probabilities and odds

Binary response variable

Odds

Are teenagers getting enough sleep?

From odds to probabilities

Odds ratios

Risk of coronary heart disease

High risk vs. education

Compare the odds for two groups

Compare the odds for two groups

Odds ratio (OR)

OR: College vs. High school or GED

OR: College vs. Some high school

More natural interpretation

Making the table 1

Making the table 2

Deeper look into the code

Deeper look into the code

Deeper look into the code

Deeper look into the code

Application exercise

Recap