One important skill in data science is to be able to evaluate data analyses that you build, from your peers / colleagues, or data analyses that you see in the wild. In HW1 and HW4, you practiced the skills of critiquing and evaulating data analyses.
A second important skill is being able to summarize the key ideas in a data analysis that either you built or one that is presented to you.
In this homework assignment, you will practice both of these skills. Your goal is to evaluate and summarize a data analysis. But instead of working with a dataset and data analysis that you created, you will work with a set that someone else created. You will be randomly assigned someone else’s dataset and analysis that they created in HW3.
We have randomly assigned individuals using the sample()
function in the code below. We only include first names here. Identify the person that you have been randomly assigned to, connect with the individual (either in class, by email, or on slack) and ask them to share their twitter_data_hw2.csv
file that they started with in HW3 and their 711-hw3-assignment.Rmd
that they built in HW3. Save both of these into into your HW4 GitHub repo. This is the data and data analysis you will use for your HW4.
student_names <- c("Brian", "Erjia", "Eric", "Elizabeth",
"Grant", "Haley", "Jennifer (Shiyao)",
"Jiyang", "Joe", "Jingning", "Linda",
"Kate (Yueyi)", "Runzhe", "Yifan", "Zebin")
set.seed(123)
student_names <- student_names[sample(seq_along(student_names))]
set.seed(111)
data.frame("for_HW4" = student_names,
"use_this_data_from_HW3" = student_names[sample(seq_along(student_names))])
## for_HW4 use_this_data_from_HW3
## 1 Zebin Jiyang
## 2 Eric Brian
## 3 Yifan Jingning
## 4 Jingning Yifan
## 5 Erjia Kate (Yueyi)
## 6 Haley Erjia
## 7 Grant Linda
## 8 Elizabeth Runzhe
## 9 Kate (Yueyi) Eric
## 10 Jennifer (Shiyao) Zebin
## 11 Brian Jennifer (Shiyao)
## 12 Linda Haley
## 13 Runzhe Grant
## 14 Jiyang Joe
## 15 Joe Elizabeth
Identify the person that you have been randomly assigned to and get access to
twitter_data_hw2.csv
file that they used in HW3.Rmd
) that they built in HW3Read the data analysis closely that they built (the .Rmd
) and make sure you can reproduce their analysis results by kniting the RMarkdown document. Then, load the data into R using the readr
package.
Relfect on the analysis and answer the following questions:
Identify who was the audience that this analysis was built for (this should be clearly stated in the analysis in HW3; if not, then make an educated guess).
Describe the audience here:
Create a rubric (like you did in HW1 and a self evaluation in HW3) to evaluate the data analyst’s analysis while keeping in mind who the audience was defined to be. As part of HW3, the analyst has provide a set of principles (HW3, Problem 2) that they prioritized when building the analysis. They also provided a self-evaluation (HW3, Problem 4).
Now, you have the opportunity to also evaluate this analysis. Using the data analysis principles and corresponding weights \(w_i\) defined in HW3 by the data analyst, assign integer scores \(S_i \in [0, 10]\) that represents how much the given data analysis exhibits this principle where 0 is low and 10 is high. The scores can be similar (or different) than what was provided in HW3 by the data analyst.
For example if the principle is reproduciblity, give a score between 0 and 10 on how reproducible individual parts or the overall analysis (up to you) is where 0 is not reproducible and 10 is reproducible.
As a reminder, if there were only two principles, an example rubric might look like:
rubric <- data.frame(principle = rep(c("reproducbility", "exhaustive"),2),
weight = rep(c(0.1, 0.9), 2),
score_from_analyst_in_hw3 = c(10, 4, 2, 6),
score_from_you =c(8, 9, 1, 4))
rubric
## principle weight score_from_analyst_in_hw3 score_from_you
## 1 reproducbility 0.1 10 8
## 2 exhaustive 0.9 4 9
## 3 reproducbility 0.1 2 1
## 4 exhaustive 0.9 6 4
Describe in words about why you gave the scores that you did. Do you think the analyst fairly evaluated themself?
Are there other principles that you believe should have been included and/or removed?
In this problem, you will practice creating data visualizations.
Take one plot from HW3, reproduce it here:
Reflect on it and list four ways to improve it:
In this problem, you will practice the data analysis principle of transparency. Create one plot (or set of plots) that demonstrates transparency. You cannot use a plot that was created by the data analyst in HW3 already.
Helpful hints: as a reminder, the meaning of transparent analyses are:
Transparent analyses present an element or subset of elements summarizing or visualizing data that are influential in explaining how the underlying data phenomena or data-generation process connects to any key output, results, or conclusions. While the totality of an analysis may be complex and involve a long sequence of steps, transparent analyses extract one or a few elements from the analysis that summarize or visualize key pieces of evidence in the data that explain the most “variation” or are most influential to understanding the key results or conclusion. One aspect of being transparent is showing the approximate mechanism by which the data inform the results or conclusion.
Explain why you think this is a transparent plot.
Another way to summarize the key ideas in an analysis is with words. Write the following:
For this problem, assume the audience is the general public.