Final project

Project milestones

Research questions due Thursday, September 26

Project proposal due Thursday, October 3

Exploratory data analysis due Thursday, October 31

Presentation + Presentation comments Monday, November 11 (in lab)

Analysis draft + peer review Monday, December 2 (peer review in lab)

Round 1 submission (optional) due Sunday, December 8 at 11:59pm

Written report due Thursday, December 12 at 9pm

Reproducibility + organization due Thursday, December 12 at 9pm

Introduction

TL;DR: Pick a data set and do a regression analysis. That is your final project.

The goal of the final project is for you to use regression analysis to analyze a data set of your own choosing. The data set may already exist or you may collect your own data by scraping the web.

Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.

All analyses must be done in RStudio using Quarto and GitHub, and your analysis and written report must be reproducible.

Logistics

You will work on the project with your lab groups. The primary deliverables for the project are

  1. an in-person presentation about the exploratory data analysis and initial modeling

  2. a written, reproducible final report detailing your analysis

  3. a GitHub repository containing all work from the project

There are intermediate milestones and peer review assignments throughout the semester to help you work towards the final deliverables.

Research questions

The goal of this milestone is to discuss topics and develop potential research questions your team is interested in investigating for the project. You are only developing potential research questions; you do not need to have a data set identified at this point.

Develop three potential research questions. Include the following for each question:

  1. A statement of the research question.
  2. The target population of interest for this question.
  3. A statement about your motivation for investigating this research question and why this question is important.
  4. Ideas about the type of data you might use to answer this question. Note: These are your ideas about the type of data you could use. You do not need to have a data set at this point.

Submission

Write your responses in research-questions.qmd in your team’s project GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Thursday, September 26 at 11:59pm.

Project proposal

The purpose of the project proposal is for your team to identify the data set you’re interested in analyzing to investigate one of your potential research questions. You will also do some preliminary exploration of the response variable and begin thinking about the modeling strategy. If you’re unsure where to find data, you can use the list of potential data sources on the Tips + resources page as a starting point.

Important

You must the data set(s) in the proposal for the final project, unless instructed otherwise when given feedback.

The data set must meet the following criteria:

  • At least 500 observations

  • At least 10 columns, such that at least 6 of the columns are useful and unique predictor variables.

    • e.g., identifier variables such as “name”, “ID number”, etc. are not useful predictor variables.

    • e.g., if you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.

  • At least one variable that can be identified as a reasonable response variable.

    • The response variable can be quantitative or categorical.
  • A mix of quantitative and categorical variables that can be used as predictors.

  • May not be data that has previously been used in any course materials, or any derivation of data that has been used in course materials.

Types of data sets to avoid
  • Data that are likely violate the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.

  • Data sets in which there is no information about how the data were originally collected

  • Data sets in which there are missing or unclear definitions about the observations and/or variables

Ask a member of the teaching team if you’re unsure whether your data set meets the criteria.

The proposal will include the following sections:

Section 1: Introduction

Tip

Reuse and iterate on the work from the Research Questions milestone.

  • An introduction to the subject matter you’re investigating (citing any relevant literature)

  • Statement of a well-developed research question.

  • The motivation for your research question and why it is important

  • Your team’s hypotheses regarding the research question

    • This is a narrative about what you think regarding the research question, not formal statistical hypotheses.

Section 2: Data description

  • The source of the data set

  • A description of when and how the data were originally collected (by the original data curator, not necessarily how you found the data)

  • A description of the observations and general characteristics being measured

Section 3: Initial exploratory data analysis

  • Description of data cleaning you need to do to prepare for analysis (can focus on the response variable for now), such as joining data sets, imputing missing values, variable transformation, creating a new variable, etc.

  • Visualizations, summary statistics, and narrative to describe the distribution of the response variable.

Section 4: Analysis approach

  • a description of the potential predictor variables of interest

  • regression model technique (multiple linear regression for quantitative response variable or logistic regression for a categorical response variable)

Data dictionary (aka code book)

Submit a data dictionary for all the variables in your data set in the README of the data folder. You do not need to include the data dictionary in the PDF document.

Submission

Write your narrative and analysis for Sections 1 - 4 in the proposal.qmd file in your team’s GitHub repo. Put the data set and the data dictionary in the data folder in the repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Thursday, October 3 at 11:59pm.

Grading

The anticipated length, including all graphs, tables, narrative, etc., is 2 -4 pages.

The proposal is worth 10 points and will be graded based on accurately and comprehensively addressing the criteria stated above. Points will be assigned based on a holistic review of the project proposal.

  • Excellent (9 - 10 points) : All required elements are completed and are accurate. The data set meets the requirements (or the team has otherwise discussed the data with Professor Tackett) and the data do not pose obvious violations to the modeling assumptions. There is a thoughtful and comprehensive description of the data and exploration of the response variable as described above. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.

  • Strong (7 - 8 points): Requirements are mostly met, but there are some elements that are incomplete or inaccurate. Some minor revision of the work required before team is ready for modeling.

  • Satisfactory (5 - 6 points): Requirements partially met, but there are some elements that are incomplete and/or inaccurate. Major revision of the work required before team is ready for modeling.

  • Needs Improvement (4 or fewer points points): Requirements are largely unmet, and there are large elements that are incomplete and/or inaccurate. Substantial revisions of the work required before team is ready for modeling.

Exploratory data analysis

Tip

Reuse and iterate on the work from the previous milestones.

The purpose of this milestone is begin exploring the data and get early feedback on your data and analysis. You will submit a draft of the beginning of your report that includes the introduction and exploratory data analysis, with an emphasis on the EDA. It will also help you prepare for the presentation of the exploratory data analysis results.

Below is a brief description of the sections to include in this step:

Introduction

This section includes an introduction to the project motivation, background, data, and research question.

Exploratory data analysis

This section includes the following:

  • Description of the data set and key variables.

  • Exploratory data analysis of the response variable and key predictor variables.

    • Univariate EDA of the response and key predictor variables.
    • Bivariate EDA of the response and key predictor variables
    • Potential interaction effects.

Submission

Write your draft introduction and exploratory data analysis in the written-report.qmd file in your team’s GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Thursday, October 31 at 11:59pm.

Grading

The anticipated length, including all graphs, tables, narrative, etc. with code, warnings, and messages suppressed is about about 4 - 6 pages.

Tip

You can suppress code, warnings, and messages by including the following in the YAML:

execute: 
  echo: false
  message: false
  warning: false

The exploratory data analysis is worth 15 points and will be graded based on accurately and comprehensively addressing the criteria stated above, along with incorporating the feedback from the proposal. Points will be assigned based on a holistic review of the exploratory data analysis.

  • Excellent (14 - 15 points) : All required elements are completed and are accurate. There is a thorough exploration of the data as described above, and the team has demonstrated a careful and thoughtful approach exploring the data and preparing it for analysis. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.

  • Strong: (11 - 13 points): Requirements are mostly met, but there are some elements that are incomplete or inaccurate. Some revision of the work required before team is ready for modeling.

  • Satisfactory (8 - 10 points): Requirements partially met, but there are some elements that are incomplete and/or inaccurate. Major revision of the work required before team is ready for modeling.

  • Needs Improvement (7 or fewer points points): Requirements are largely unmet, and there are large elements that are incomplete and/or inaccurate. Substantial revisions of the work required before team is ready for modeling.

Presentation

Important

Presentations will take place in class during labs on November 11. Presentation order will be announced in advance.

Your team will do an in-person presentation that summarizes and showcases the work you’ve done on the project thus far. Because the presentations will take place while you’re still working on the project, it will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, showcase key results from the exploratory data analysis, and discuss primary modeling strategies and/or results. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.

You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!

The presentation is expected to be between 5 to 8 minutes. It may not exceed 8 minutes, due to the limited time during lab.

Every team member is expected to speak in the presentation. Part of the grade will be whether every team member had a meaningful speaking role in the presentation.

Slides

The slide deck should have no more than 6 content slides + 1 title slide to ensure you have enough time to discuss each slide. s Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

  • Title Slide

  • Slide 1: Introduce the subject, motivation, and research question

  • Slide 2: Introduce the data set

  • Slide 3 - 4: Highlights from the EDA (be sure to include EDA for the response variable!)

  • Slide 5: Initial modeling strategies / results

  • Slide 6: Next steps and any questions you’d like to get feedback on

Submission

You can submit the presentation slides in two ways:

  • Put a PDF of the slides or Quarto slides in the presentation folder in your team’s GitHub repo.

  • Put the URL to your slides in the README of the presentation folder. If you share the URL, please make sure permissions are set so Prof. Tackett can view the slides.

Important

Slides must be submitted by the start of your lab on November 11. We will use a classroom computer for the presentations.

Grading

The presentation is worth points. It will be graded based on the following:

  • Content: The team told a unified story that clearly introduced the subject matter, research question, and exploration of the data.

  • Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.

  • Presentation: The team’s communication style was clear and professional. The team divided the time well and stayed within the 8 minute time limit, with each team member making a meaningful contribution to the presentation.

80% of the presentation grade will be the average of the teaching team scores and 20% will be the average of the peer scores.

Presentation comments

Important

Click here to see the teams you’re scoring and a link to the feedback form.

This portion of the project is worth 2 points and will be assessed individually.

You will provide feedback on two teams’ presentations. The assigned teams and link to the feedback form will be available in advance of the presentations. Please provide all scores and comments by the end of the lab session. There will be a few minutes between each presentation to submit scores and comments.

The grade will be based on submitting the scores and comments for both of your assigned teams by the end of the presentation day, November 11.

Analysis + peer review

The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis, modeling, and initial interpretations.

Draft report

Important

The draft report is due in your GitHub repo by December 2 at 10am.

Write the draft in the written-report.qmd file in your project repo.

Below is a brief description of the sections to focus on in the draft:

Introduction and data

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the body of the report, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Methodology

This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, any variable transformations (if needed), and any other relevant considerations that were part of the model fitting process.

Results

In this section, you will output the final model and include a brief discussion of the model assumptions, diagnostics, and any relevant model fit statistics.

This section also includes initial interpretations and conclusions drawn from the model.

Grading

The draft will be graded based on whether there is demonstration of a reasonable attempt at each of the sections described below in the written report file in the GitHub repo by the deadline.

Peer review

Important

Peer review comments are due in GitHub on December 3 at 11:59pm.

Critically reviewing others’ work is a crucial part of the scientific process, and STA 221 is no exception. Each lab team will be assigned two other teams’ projects to review. Each team should push their draft to their GitHub repo by 10 am on the day their lab’s draft is due. The lab that week will be dedicated to the peer review, so your team will have time to review and provide quality feedback to two other teams.

During the peer review process, you will be provided read-only access to your partner teams’ GitHub repos. Provide your review in the form of GitHub issues to your partner team’s GitHub repo using the issue template provided in the repo.

Steps for peer review

Peer review assignments

Click here to see the teams you’re peer reviewing.You’ll spend about 30 minutes reviewing each project.

  • When you get to lab, you should have access to the GitHub repos for the teams you’re reviewing. In GitHub, search the repositories for project, and you should see the repos for the projects you’re reviewing. You will be able to read the files in the repo and post issues, but you cannot push changes to the repo. You will have access to the repo until the deadline for the peer review.

  • You may choose to all work on both peer reviews or have some team members focus on a single peer review. Either way there will be one peer review grade assigned per team.

  • For each team you’re reviewing:

    • Open that team’s repo, read the project draft, and browse the rest of the repo.

    • Go to the Issues tab in that repo, click on New issue, and click on Get started for the Peer Review issue. Write your responses to the prompts in the issue. You will answer the the following questions:

      • Describe the goal of the project.

      • Describe the data set used in the project. What are the observations in the data? What is the source of the data? How were the data originally collected?

      • Consider the exploratory data analysis (EDA). Describe one aspect of the EDA that is effective in helping you understand the data. Provide constructive feedback on how the team might improve the EDA.

      • Describe the statistical methods, analysis approach, and discussion of model assumptions, diagnostics, model fit.

      • Provide constructive feedback on how the team might improve their analysis. Make sure your feedback includes at least one comment on the statistical modeling aspect of the project, but also feel free to comment on aspects beyond the modeling.

      • Provide constructive feedback on the interpretations and initial conclusion. What is most effective in the presentation of the results? What additional detail can the team provide to make the results and conclusions easier for the reader to understand?

      • What aspect of this project are you most interested in and think would be interesting to highlight in the written report?

      • Provide constructive feedback on any issues with file and/or code organization.

      • (Optional) Any further comments or feedback?

Grading

The peer review will be graded on the extent to which each comprehensively and constructively addresses the components on the peer review form. There will be one peer review grade per team.

Round 1 submission (optional)

Important

The Round 1 submission is due on Sunday, December 8 at 11:59pm. Reports submitted after this date will not receive preliminary feedback.

The Round 1 submission is an opportunity to receive detailed feedback on your analysis and Written report before the final submission. Therefore, to make the feedback most useful, you must submit a complete written report to receive feedback. You will also be notified of the grade you would receive at that point. You will have the option to keep the grade (and thus you don’t need to turn in an updated report) or resubmit the written report by the final submission deadline to receive a new grade.

To submit the Round 1 submission:

  1. Push the updated written-report.qmd and written-report.pdf to your GitHub repo.

  2. Open an issue with the title “Round 1 Submission”. You can use the template issue in the GitHub repo. Make sure I am tagged in the issue (@matackett), so I receive an email notification of your Round 1 submission. See Creating an issue from a repository for instructions on opening an issue. Please ask a member of the teaching team for assistance if you need help opening the issue.

Note that this is optional, so there is no grade penalty for not turning in a Round 1 submission. Due to time constraints at the end of the semester, only high-level feedback will be given for the reports submitted at the final written report deadline on December 12.

Written report

Important

The written report is due on Thursday, December 12 at 9pm.

Your written report must be completed in the written-report.qmd file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.

Before you finalize your write up, make sure the code chunks are not visible and all messages and warnings are suppressed.

  • You will submit the PDF of your final report on GitHub.

  • The PDF you submit must match the .qmd in your GitHub repository exactly. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including tables and visualizations, must be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.

  • Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.

  • You are welcome to include an appendix with additional work at the end of the written report document; however, grading will overwhelmingly be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.

Introduction and data

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.

Methodology

This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.

Results

Describe the key results from the model. The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Discussion + Conclusion

In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Organization + formatting

This is an assessment of the overall presentation and formatting of the written report.

Reproducibility + organization

Peer teamwork evaluation

You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly.

Overall grading

The grade breakdown is as follows:

Total 100 pts
Research question 3 pts
Project proposal 10 pts
Exploratory data analysis 15 pts
Presentation 10 pts
Presentation comments 2 pts
Draft report + peer review 15 pts
Written report 40 pts
Reproducibility + organization 5 pts

Grading summary

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?

  • Correctness - Are statistical procedures carried out and explained correctly?

  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?

  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.

  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.

  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.

  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.

  • Below 60%: Student is not making a sufficient effort.

Late work policy

There is no late work accepted on the draft report or presentation. Other components of the project may be accepted up to 48 hours late. A 10% late deduction will apply for each 24-hour period late.

Be sure to turn in your work early to avoid any technological mishaps.