library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)
Lab 06: Logistic Regression
This lab is due on Thursday, November 21 at 11:59pm. To be considered on time, the following must be done by the due date:
Final
.qmd
and.pdf
files pushed to your team’s GitHub repoFinal
.pdf
file submitted on Gradescope
Logistic Regression
In this lab, you will use logistic regression analysis to classify rice varieties based on features derived from image data. You will fit and interpret the logistic regression models using training data, then use testing data to assess how well the model classifies observations into the two varieties.
Learning goals
By the end of the lab you will be able to…
conduct exploratory data analysis for data with a binary response variable.
fit and interpret logistic regression model.
use relevant metrics and subject-matter context to identify a threshold for classification.
use training and testing data to fit and assess model performance.
Getting started
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the sta221-fa24 organization on GitHub. Click on the repo with the prefix lab-06. It contains the starter documents you need to complete the lab.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.
Workflow: Using Git and GitHub as a team
There are no Team Member markers in this lab; however, you should use a similar workflow as in Lab 02. Only one person should type in the group’s .qmd file at a time to avoid merge conflicts. Once that person has finished typing the group’s responses, they should render, commit, and push the changes to GitHub. All other teammates can pull to see the updates in RStudio.
Every teammate must have at least one commit in the lab. Everyone is expected to contribute to discussion even when they are not typing.
Packages
You will use the following packages in today’s lab. Add other packages as needed.
Data
The data in this lab contains measures describing the shape and structure of two rice varieties - Cammeo and Osmancik. To curate the data set, researchers used images from over 3,000 grains a rice in these two varieties. They then used automated methods to process the image data and extract 7 morphological features (features related to the structure) from each image. The data were originally presented and analyzed in Cinar and Koklu (2019) and was obtained from the UCI Machine Learning Repository.
This analysis will focus on the following variables:
Class
: Cammeo or OsmancikArea
: Size of the rice grain measured in pixelsEccentricity
: A measure of how round the ellipse is, i.e., how close the shape of the grain is to a circle.
Click here for the full data dictionary.
<- read_csv("data/rice.csv") rice
Exercises
Goal: The goal of the analysis is to use Area and Eccentricity to identify grains from the Cammeo variety versus those from the Osmancik variety.
Exercise 1
We’ll begin by exploring the data. Create a scatterplot of Area
versus Eccentricity
with the color and shape of the points by Class
.
Exercise 2
Based on the plot from the previous exercise, do you think the two types of rice can be distinguished based on Area
and Eccentricity
. Briefly explain.
Exercise 3
With this type of classification problem, it is common to test the performance of a logistic regression model by splitting the data into a training and a test set. (Recall training and test sets from Lab 02). This means that we will choose a random subset of the data to estimate the regression coefficient, and we will then test the predictions on the remaining observations.
Randomly select 75% of the rows of the data to create the training set
rice_train
. Useset.seed(221)
.Put the remaining 25% of the observations in the test set called
rice_test
.
Exercise 4
In a logistic regression model, the log-odds of the response being “1” (or a “success”) is given by \[\log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \mathbf{x}_i^\top \boldsymbol{\beta}\]
In this analysis a “success” is the Class “Osmancik”.
What does each \(\pi_i\) represent in the context of this analysis?
How is the probability of the response variable being “1” calculated from the log-odds? Show the mathematical steps to go from the log-odds to the probability in your response.
Exercise 5
Use rice_train
to fit the logistic regression model for the response variable Class
on the predictors Area
and Eccentricity
.
- Neatly display the output using 3 digits.
- Does the intercept have any reasonable interpretation? If so, interpret the intercept. Otherwise explain why not.
Exercise 6
- Interpret the coefficients on
Area
andEccentricity
in the context of the data in terms of the odds and comment on whether they are useful predictors ofClass
in the model.
Exercise 7
How would you expect the log-odds of the rice grain being of the Osmancik variety to change if the measures eccentricity changes from 0.85 to 0.9?
How would you expect the odds of the rice grain being of the Osmancik variety to change if the measures eccentricity changes from 0.85 to 0.9?
Exercise 8
Now let’s test our model on the test set. Use the predict()
function to obtain the predicted probabilities of the rice being Osmancik for the observations in rice_test.
Exercise 9
With these estimated probabilities, we can now try to classify the rice in the test set. Choose a threshold for assigning a class to each observation based on the estimated probability. Briefly explain your reasoning for selecting the threshold, including any analysis used to make your decision.
Exercise 10
Compare the estimated class assignments you constructed with the actual classes. Comment on the result and thus the model performance.
Submission
You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select all team members’ names, so they receive credit on the assignment. Click here for video on adding team members to assignment on Gradescope.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.
Grading
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 4 |
Ex 3 | 4 |
Ex 4 | 5 |
Ex 5 | 5 |
Ex 6 | 6 |
Ex 7 | 6 |
Ex 8 | 3 |
Ex 9 | 4 |
Ex 10 | 5 |
Workflow & formatting | 4 |
The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member, a neatly organized document with readable code, and updating the team name and date in the YAML.