An important part of data science is being able to evaluate data analyses and data analytic products. In this homework assignment you will practice this skill by comparing and contrasting data analyses and coming up with your own rubic for evaluating the analyses.
There are two parts to this homework assignment each with different due dates:
Here is a prompt that includes a brief introduction to the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
Assume two data analysts have been given the same prompt above and you are the audience who will evaluate the two analyses. Assume you do not have access to the database itself until you see the data analyses. Create a rubric that you would use to evaluate the data analyses.
Helpful hints:
The rubric needs to include:
For each data analysis principle (\(i\)), include:
Note: You could just exclude reproduciblity entirely from your rubric (e.g. this is a principle that you as an audience are ambivalent about and it does not factor into your evaluation of the data analyses). Alternatively, you could include it in your rubric, but intentionally downweight the importance of reproducibility by giving it a small weight (e.g. maybe analyzing the data as fast as possible is your priority and you do not want the analyst wasting time on making it reproducible). These are two different approaches that you should carefully think about when building your rubric.
Note: Part 1 is due Friday September 13, 2019 at 11:59 PM. Write down your answer in this R Markdown document, knit the R Markdown, commit your changes, and push to GitHub, but you also need to submit your answer to this google form by the due date. We will release Part 2 on Saturday September 14, 2019 at 12:01 AM.
On Saturday September 14, 2019 at 12:01 AM we will release the links to the two data analyses by sending the links on Slack.
In this part, apply the rubric you built in Part 1 to the two data analyses.
For a given data analysis and for each data analysis principle (\(i\)) in your rubric:
Next, create a data frame containing four columns: For a given analysis, each row should contain the data_analysis
(Column 1) that you are evaluating, the data analysis principle
(Column 2), the weight
(Column 3) associated with each principle, and the score
(Column 4) used in your rubric for each analysis. For example, if there were only two principles in the rubric, an example of the data frame could look like this:
rubric <- data.frame(data_analysis = rep(paste("analysis", 1:2, sep="_"), each = 2),
principle = rep(c("reproducbility", "exhaustive"),2),
weight = rep(c(0.1, 0.9), 2),
score = c(10, 4, 2, 6))
rubric
## data_analysis principle weight score
## 1 analysis_1 reproducbility 0.1 10
## 2 analysis_1 exhaustive 0.9 4
## 3 analysis_2 reproducbility 0.1 2
## 4 analysis_2 exhaustive 0.9 6
Once you have completed this, think about and make specific suggestions on how could the two analyses could be improved (e.g. what changes could be made to make it better from the perspective of you being the audience and your rubric)?
In this part, answer the following questions:
In this part, we will now introduce two different audiences. Instead of assuming you are the audience, let’s assume there are two new (but different) audiences:
The head of the United States Environmental Protection Agency (EPA). This is an independent agency of the United States federal government with a goal of protecting the environment.
A potential homebuyer who is interested in purchasing a house along the coast in the state of Florida.
Reflecting on these new audiences:
Update the data frame you created above to include this new information. Include a new column called audience
corresponding to which audience is evaluating the data analyses. For example, if there were only two principles included in rubric that all three audiences cared about, this is an example of what the data frame might look like
final_rubric <- data.frame(data_analysis = rep(paste("analysis", 1:2, sep="_"), each = 6),
audience = rep(rep(c("me", "EPA", "homebuyer"), each = 2), 2),
principle = rep(c("reproducbility", "exhaustive"),6),
weight = rep(c(0.1, 0.9), 6),
score = c(10, 4, 2, 6, 1, 9, 4, 3, 4, 5, 8, 4))
final_rubric
## data_analysis audience principle weight score
## 1 analysis_1 me reproducbility 0.1 10
## 2 analysis_1 me exhaustive 0.9 4
## 3 analysis_1 EPA reproducbility 0.1 2
## 4 analysis_1 EPA exhaustive 0.9 6
## 5 analysis_1 homebuyer reproducbility 0.1 1
## 6 analysis_1 homebuyer exhaustive 0.9 9
## 7 analysis_2 me reproducbility 0.1 4
## 8 analysis_2 me exhaustive 0.9 3
## 9 analysis_2 EPA reproducbility 0.1 4
## 10 analysis_2 EPA exhaustive 0.9 5
## 11 analysis_2 homebuyer reproducbility 0.1 8
## 12 analysis_2 homebuyer exhaustive 0.9 4
Save this data frame to .csv
file in your github repo with the title 711-HW1-scores-<yourlastname>.csv
where you replace <yourlastname>
with your last name.
readr::write_csv(final_rubric, path = "711-HW1-scores-<yourlastname>.csv")
There are two parts to this homework assignment each with different due dates: