For your final assignment in 140.712.01 Advanced Data Science II, you will work on a month-long data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results.
You may work individually or in a team. If working alone, you are still expected to complete all parts of the project (described below). If part of a team, you will work closely with other classmates (a maximum of 3 students per team) on this project. You can come up with your own teams and use Slack to find prospective team members. In general, we do not anticipate that the grades for each group member will be different. However, we reserve the right to assign different grades to each group member based on peer assessments (see below).
There are a few milestones for your final project. It is critical to note that no extensions will be given for any of the project due dates for any reason. Late days may not be used. Projects submitted after the final due date will not be graded. If you anticipate any issues (e.g., due to travel) you need to send an email to the teaching staff at least one week in advance.
Date | Description |
---|---|
November 15 by 11:59pm | Form a team and submit a project proposal |
November 18 (in class) | Final project team proposal presentations |
December 13 by 11:59pm | RMarkdown and compiled HTML due |
December 13 by 11:59pm | Project webpage and screencast due |
December 15 by 11:59pm | Peer assessment due |
December 16, 18 | Final project presentations, project screencasts shown |
There are several deliverables for your project that will be graded individually to make up your final project score.
You start by filling out this google form to define your teams and project proposal. This form should be filled out by Friday November 15, 2019 at 11:59pm. The title can be changed at a later date. Each team (or individual if working alone) will only need to submit one form. Based on your proposals, you will give a presentation in class the following Monday November 18, 2019 to get feedback on the idea.
An important part of your project is the RMarkdown and HTML files. This will detail your steps in developing your solution, including how you collected the data, alternative solutions you tried, describing statistical methods you used, and the insights you got. Equally important to your final results is how you got there! Your RMarkdown and HTML files are the place you describe and document the space of possibilities you explored at each step of your project. We strongly advise you to include many visualizations.
Your RMarkdown should include the following topics. Depending on your project type, the amount of discussion you devote to each of them will vary:
Motivation and Overview: Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal.
Related Work: Anything that inspired you, such as a paper, a web site, or something we discussed in class.
Initial Questions: What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Data: What is the data source? Document the data import, wrangling, etc.
Exploratory Data Analysis: What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions? You should use this section to motivate the statistical analyses that you decided to use in the next section.
Data Analysis: What statistical or computational method did you apply and why? What others did you consider?
Narrative and Summary: What did you learn about the data? How did you answer the questions? How can you justify your answers? What are the limitations of the analyses?
As this will be your only chance to describe your project in detail, make sure that your RMarkdown file and compiled HTML file are standalone documents that fully describe your process and results. The RMarkdown and HTML files are due Friday, December 13 by 11:59pm. For instructions on how to submit, please see Submission Instructions below.
We expect you to write high-quality and readable R code in your RMarkdown file. You should strive for doing things the right way and think about aspects such as reproducibility and writing efficient code. We also expect you to document your code.
You will create a public website for your project (e.g. Github Pages or any other web hosting service of your choice). The web site should effectively summarize the main results of your project and tell a story. Consider your audience (the site is public) and keep the level of discussion at the appropriate level. Your RMarkdown file, HTML file and data should be linked from your GitHub Repository (see below) to the web site as well. Also embed your main visualizations and your screencast in your website.
The final project website is due Friday, December 13 by 11:59pm. For instructions on how to submit, please see Submission Instructions below.
Each team will create a two minute screencast with narration showing a demo of your project and/or some slides. Information about how to prepare these screencasts can be found here. Please make sure that the sound quality of your video is good. Upload the video to an online video-platform such as YouTube or Vimeo and embed it into your project web page. You will show your team’s video in class.
We will strictly enforce the two minute time limit for the video, so please make sure you are not running longer. Use principles of good storytelling and presentations to get your key points across. Focus the majority of your screencast on your main contributions rather than on technical details. What do you feel is the best part of your project? What insights did you gain? What is the single most important thing you would like your audience to take away? Make sure it is upfront and center rather than at the end.
The final project screencast is due Friday, December 13 by 11:59pm. For instructions on how to submit, please see Submission Instructions below.
It is important to provide positive feedback to people who truly worked hard for the good of the team and to also make suggestions to those you perceived not to be working as effectively on team tasks. We ask you to provide an honest assessment of the contributions of the members of your team, including yourself. The feedback you provide should reflect your judgment of each team member:
Preparation: Were they prepared during team meetings?
Contribution: Did they contribute productively to the team discussion and work?
Respect for others’ ideas: Did they encourage others to contribute their ideas?
Flexibility: Were they flexible when disagreements occurred?
Your teammates’ assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall project score. The peer assessment is due Sunday, December 15 by 11:59pm. For instructions on how to submit, please see Submission Instructions below.
Each individual team member needs to fill out this google form for the peer evaluation. Your individual project score will take into account your self and peer assessment.
The final project will be graded in three main parts:
Your individual project score will also be modified by your peer evaluations.
The course instructors will be grading the final projects based on the submitted materials described above. As part of the grading, the following factors will be taken into consideration.
Question. Your project should address a clearly specified question that is presented early on. The crafting of your question should reflect careful consideration of how the results of your analysis (and hence the answers to your question) may result in further actions or decision-making. A highly specific question may lead to logical next steps but may also be difficult to generalize. A more vague question may be quite general but may simply lead to more questions. A balance must be struck in building your data analysis.
Audience. You should describe the audience at which your analysis/presentation is aimed or targeted. Who do you imagine would (or should) be most interested in your analysis? You should also describe how this audience might make use of the information and analyses you are presenting.
Technical Level. Given the audience that you’ve selected, the technical level of detail in your presentation should be appropriately matched to the audience’s background.
Narrative. Your project presentation–in the video, R markdown, and oral presentation–should build a coherent narrative that tells a story that is supported by the data analysis. You should not be presenting R output that is stitched together with a few sentences.
Code Quality. The code in the R markdown docoument should be well-written and clear in terms of what it is intended to do. The code should also be reproducible by others.
Statistical Methodology. Throughout the analysis, proper statistical methodology should be used. In addition, some rationale should be given that describes how or why certain methods were chosen over others. If for example, a project involves feature selection/engineering, then you should describe the rationale for choosing some features over others. In a cluster analysis, you should describe why a certain number of clusters was chosen, etc.
Lessons Learned. You should try to communicate any lessons learned through your analysis. These lessons should be as general as possible and should be something that may be useful in future analyses. Not every analysis will produce profound insights, but there should still be something that we learn from the data. The communication of any lessons learned should be distinct from the description of what was done.
Exploratory Analysis. Throughout the analysis, there should be exploration of the data and of the possible methods that can be applied to the data. In addition, there should be a selection process amongst the many possibilities generated during exploration that allows you focus on one or a few paths forward. You should describe the rationale for making these choices on what to focus on.
Here are some examples of successful final projects. Note: These projects came from previous courses we taught on Data Science similar to this one, except the sometimes the courses used Python, not R.