STA 221 - Fall 2024 – Logistic Regression: Inference

term	estimate	std.error	statistic	p.value
(Intercept)	-1.72294	0.0436342	-39.486	0

term	df.residual	residual.deviance	df	deviance	p.value
high_risk ~ age + totChol + currentSmoker	4082	3224.812	NA	NA	NA
high_risk ~ age + totChol + currentSmoker + education	4079	3217.600	3	7.212	0.065

Distribution of $\hat{β}$

When $n$ is large, $\hat{β}$ , the estimated coefficients of the logistic regression model, is approximately normal.

How do we know the distribution of $\hat{β}$ is normal for large $n$ ?

Distribution of $\hat{β}$

The expected value of $\hat{β}$ is the true parameter, $β$ , i.e., $E (\hat{β}) = β$

$V a r (\hat{β})$ , the matrix of variances and covariances between estimators, is found by taking the second partial derivatives of the log-likelihood function (Hessian matrix)

$V a r (\hat{β}) = (X^{T} V X)^{- 1}$

where $V$ is a $n \times n$ diagonal matrix such that $V_{i i}$ is the estimated variance for the $i^{t h}$ observation

Test for a single coefficient

Hypotheses: $H_{0} : β_{j} = 0 vs H_{a} : β_{j} \neq 0$ , given the other variables in the model

(Wald) Test Statistic: $z = \frac{{\hat{β}}_{j} - 0}{S E ({\hat{β}}_{j})}$

where $S E ({\hat{β}}_{j})$ is the square root of the $j^{t h}$ diagonal element of $V a r (\hat{β})$

P-value: $P (| Z | > | z |)$ , where $Z \sim N (0, 1)$ , the Standard Normal distribution

Confidence interval for $β_{j}$

We can calculate the C% confidence interval for $β_{j}$ as the following:

${\hat{β}}_{j} \pm z^{*} S E ({\hat{β}}_{j})$

where $z^{*}$ is calculated from the $N (0, 1)$ distribution

Note

This is an interval for the change in the log-odds for every one unit increase in $x_{j}$

Interpretation in terms of the odds

The change in odds for every one unit increase in $x_{j}$ .

$\exp {{\hat{β}}_{j} \pm z^{*} S E ({\hat{β}}_{j})}$

Interpretation: We are $C %$ confident that for every one unit increase in $x_{j}$ , the odds multiply by a factor of $\exp {{\hat{β}}_{j} - z^{*} S E ({\hat{β}}_{j})}$ to $\exp {{\hat{β}}_{j} + z^{*} S E ({\hat{β}}_{j})}$ , holding all else constant.

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.673	0.378	-17.647	0.000	-7.423	-5.940
age	0.082	0.006	14.344	0.000	0.071	0.094
totChol	0.002	0.001	1.940	0.052	0.000	0.004
currentSmoker1	0.443	0.094	4.733	0.000	0.260	0.627

Hypotheses:

$H_{0} : β_{a g e} = 0 vs H_{a} : β_{a g e} \neq 0$ , given total cholesterol and smoking status are in the model.

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.673	0.378	-17.647	0.000	-7.423	-5.940
age	0.082	0.006	14.344	0.000	0.071	0.094
totChol	0.002	0.001	1.940	0.052	0.000	0.004
currentSmoker1	0.443	0.094	4.733	0.000	0.260	0.627

Test statistic:

$z = \frac{0.0825 - 0}{0.00575} = 14.34$

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.673	0.378	-17.647	0.000	-7.423	-5.940
age	0.082	0.006	14.344	0.000	0.071	0.094
totChol	0.002	0.001	1.940	0.052	0.000	0.004
currentSmoker1	0.443	0.094	4.733	0.000	0.260	0.627

P-value:

$P (| Z | > | 14.34 |) \approx 0$

2 * pnorm(14.34,lower.tail = FALSE)

[1] 1.230554e-46

Coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.673	0.378	-17.647	0.000	-7.423	-5.940
age	0.082	0.006	14.344	0.000	0.071	0.094
totChol	0.002	0.001	1.940	0.052	0.000	0.004
currentSmoker1	0.443	0.094	4.733	0.000	0.260	0.627

Conclusion:

The p-value is very small, so we reject $H_{0}$ . The data provide sufficient evidence that age is a statistically significant predictor of whether someone is high risk of having heart disease, after accounting for total cholesterol and smoking status.

CI for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6.673	0.378	-17.647	0.000	-7.423	-5.940
age	0.082	0.006	14.344	0.000	0.071	0.094
totChol	0.002	0.001	1.940	0.052	0.000	0.004
currentSmoker1	0.443	0.094	4.733	0.000	0.260	0.627

Interpret the 95% confidence interval for age in terms of the odds of being high risk for heart disease.

Overview of testing coefficients

Test a single coefficient

Likelihood ratio test
Drop-in-deviance test
Wald hypothesis test and confidence interval

Test a subset of coefficients

Likelihood ratio test
Drop-in-deviance test

Can use AIC and BIC to compare models in both scenarios

References

Wilks, SS. 1935. “The Likelihood Test of Independence in Contingency Tables.” The Annals of Mathematical Statistics 6 (4): 190–96.

Logistic Regression: Inference

Announcements

Topics

Computational setup

Risk of coronary heart disease

Modeling risk of coronary heart disease

Test for overall significance

Likelihood ratio test

Likelihood ratio test statistic

Likelihood ratio test statistic

$χ^{2}$ distribution

Heart disease model: likelihood ratio test

Heart disease model: likelihood ratio test

Heart disease model: likelihood ratio test

Test a subset of coefficients

Testing a subset of coefficients

Deviance

Drop-in-deviance test

Heart disease model: drop-in-deviance test

Heart disease model: drop-in-deviance test

Heart disease model: drop-in-deviance test

Drop-in-deviance test in R

AIC and BIC

Compare models using AIC

Compare models using BIC

Test for a single coefficient

Distribution of $\hat{β}$

Distribution of $\hat{β}$

Test for a single coefficient

Confidence interval for $β_{j}$

Interpretation in terms of the odds

Coefficient for `age`

Coefficient for `age`

Coefficient for `age`

Coefficient for `age`

CI for `age`

Overview of testing coefficients

References