First, we load a few R packages

library(tidyverse)
library(broom) # installed, but not loaded with library(tidyverse)
library(dslabs) # needs to be installed 
library(HistData) # needs to be installed

Most of the material in this lecture was taken from here and this wonderful dsbook from Rafael Irizarry.

Motivation

In this lecture, we will discuss one of the most important aspects of analyzing data: being skeptical of results. We discuss some reasons why and give you examples.

“Correlation is not causation” is perhaps the most important lesson one learns in a statistics class. In this lecture, we will consider tools useful for quantifying associations between variables. However, we must be careful not to overinterpret these associations.

There are many reasons that a variable \(X\) can be correlated with a variable \(Y\) without either being a cause for the other. Here we examine three common ways that can lead to misinterpreting data.

Spurious correlation
Outliers
Reversing cause and effect
Confounders

Next, we will discuss in detail what each of these are and give an example.

Spurious correlation

The following comical example underscores that correlation is not causation. It shows a very strong correlation between divorce rates and margarine consumption.

the_title <- paste("Correlation =", 
      round(with(divorce_margarine, 
      cor(margarine_consumption_per_capita, divorce_rate_maine)),2))
data(divorce_margarine)
divorce_margarine %>% 
  ggplot(aes(margarine_consumption_per_capita, 
             divorce_rate_maine)) + 
  geom_point(cex=3) + 
  geom_smooth(method = "lm") + 
  ggtitle(the_title) +
  xlab("Margarine Consumption per Capita (lbs)") + 
  ylab("Divorce rate in Maine (per 1000)")

Does this mean that margarine causes divorces? Or do divorces cause people to eat more margarine? Of course the answer to both these questions is no. This is just an example of what we call a spurious correlation.

You can see many more absurd examples of this wesbsite completely dedicated to spurious correlations.

The cases presented in the spurious correlation site are all examples of what is generally called data dredging, data fishing, or data snooping. It’s basically a form of what in the US they call cherry picking. An example of data dredging would be if you look through many results produced by a random process and pick the one that shows a relationship that supports a theory you want to defend.

A Monte Carlo simulation can be used to show how data dredging can result in finding high correlations among uncorrelated variables. We will save the results of our simulation into a tibble:

N <- 25
G <- 1000000
set.seed(1000)
sim_data <- tibble(group = rep(1:G, each = N),
                   X = rnorm(N*G), Y = rnorm(N*G))
sim_data

## # A tibble: 25,000,000 x 3
##    group       X       Y
##    <int>   <dbl>   <dbl>
##  1     1 -0.446  -0.481 
##  2     1 -1.21    1.47  
##  3     1  0.0411 -0.183 
##  4     1  0.639  -0.733 
##  5     1 -0.787   1.61  
##  6     1 -0.385   0.0306
##  7     1 -0.476  -0.0607
##  8     1  0.720  -1.32  
##  9     1 -0.0185 -2.38  
## 10     1 -1.37    1.03  
## # … with 24,999,990 more rows

The first columns denotes group and we simulated of groups each with 25 observations. For each group we generate 25 observations, which are stored in the second and third columns. These are just random independent normally distributed data. So we know, because we constructed the simulation, that \(X\) and \(Y\) are not correlated.

Next, we compute the correlation between X and Y for each group and look at the max:

res <- sim_data %>% 
  group_by(group) %>% 
  summarize(r = cor(X, Y)) %>% 
  arrange(desc(r))
res

## # A tibble: 1,000,000 x 2
##     group     r
##     <int> <dbl>
##  1 625883 0.857
##  2 892542 0.767
##  3 545703 0.766
##  4 908306 0.765
##  5 412189 0.762
##  6 114308 0.760
##  7  12623 0.760
##  8 273433 0.754
##  9 274449 0.753
## 10 625177 0.750
## # … with 999,990 more rows

We see a correlation of 0.857 and if you just plot the data from that group it shows a convincing plot that \(X\) and \(Y\) are in fact correlated:

sim_data %>% 
  filter(group == res$group[which.max(res$r)]) %>%
  ggplot(aes(X, Y)) +
  geom_point() + 
  geom_smooth(method = "lm")

Remember that the correlation summary is a random variable. Here is the distribution generated by the Monte Carlo simulation:

res %>% 
  ggplot(aes(x=r)) + 
  geom_histogram(binwidth = 0.1, color = "black")

It is just a mathematical fact that if we observe 10^{6} random correlations that are expected to be 0 but have a standard error of 0.204, the largest one will be close 1.

Note: If we performed regression on this group and interpreted the p-value, we would incorrectly claim this was a statistically significant relation:

sim_data %>% 
  filter(group == res$group[which.max(res$r)]) %>%
  do(tidy(lm(Y ~ X, data = .)))

## # A tibble: 2 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)   -0.226     0.122     -1.84 0.0781      
## 2 X              1.11      0.139      7.97 0.0000000453

This particular form of data dredging is referred to as p-hacking. P-hacking is a topic of much discussion because it is a problem in scientific publications.

Because publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results. In epidemiology and the social sciences, for example, researchers may look for associations between an adverse outcome and several exposures and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to adjust for confounding and pick the one that yields the smallest p-value. In experimental disciplines, an experiment might be repeated more than once, yet only the results of the one experiment with a small p-value reported.

This does not necessarily happen due to unethical behavior, but rather to statistical ignorance or wishful thinking. In advanced statistics courses you learn methods to adjust for these multiple comparisons.

Outliers

Suppose we take measurements from two independent outcomes, \(X\) and \(Y\), and we standardize the measurements. However, imagine we make a mistake and forget to standardize entry 23. We can simulate such data using:

set.seed(1)
x <- rnorm(100,100,1)
y <- rnorm(100,84,1)
x[-23] <- scale(x[-23])
y[-23] <- scale(y[-23])

The data look like this:

tibble(x,y) %>%
  ggplot(aes(x,y)) + 
  geom_point(alpha = 0.5)

Not surprisingly, the correlation is very high:

cor(x,y)

## [1] 0.9881391

But this is driven by the one outlier. If we remove this outlier, the correlation is greatly reduced to almost 0, what it should be:

cor(x[-23], y[-23])

## [1] -0.001066464

There is an alternative to the sample correlation for estimating the population correlation that is robust to outliers. It is called Spearman correlation.

The idea is simple: compute the correlation on the ranks of the values. Here is a plot of the ranks plotted against each other:

tibble(x,y) %>% 
  ggplot(aes(rank(x),rank(y))) + 
  geom_point(alpha = 0.5)

The outlier is no longer associated with a very large value and the correlation comes way down:

cor(rank(x), rank(y))

## [1] 0.06583858

Spearman correlation can also be calculated like this:

cor(x, y, method = "spearman")

## [1] 0.06583858

There are also methods for robust fitting of linear models which you can learn about in, for instance, this book:

Robust Statistics: Edition 2 Peter J. Huber Elvezio M. Ronchetti

Reversing cause and effect

Another way association is confused with causation is when the cause and effect are reversed. An example of this is claiming that tutoring makes students perform worse because they test lower than peers that are not tutored. In this case, the tutoring is not causing the low test scores, but the other way around.

A form of this claim was actually made it into an op-ed in the New York Times titled Parental Involvement Is Overrated.

Consider this quote from the article:

When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades… Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse.

A very likely possibility is that the children needing regular parental help, receive this help because they don’t perform well in school.

We can easily construct an example of cause and effect reversal using the father and son height data. If we fit the model:

\[X_i = \beta_0 + \beta_1 y_i + \varepsilon_i, i=1, \dots, N\]

to the father and son height data, with \(X_i\) the father height and \(y_i\) the son height, we do get a statistically significant result:

library(HistData)
data("GaltonFamilies")
GaltonFamilies %>%
  filter(childNum == 1 & gender == "male") %>%
  select(father, childHeight) %>%
  rename(son = childHeight) %>% 
  do(tidy(lm(father ~ son, data = .)))

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   34.0      4.57        7.44 4.31e-12
## 2 son            0.499    0.0648      7.70 9.47e-13

The model fits the data very well. If we look at the mathematical formulation of the model above, it could easily be incorrectly interpreted as to suggest that the son being tall caused the father to be tall. But given what we know about genetics and biology, we know it’s the other way around. The model is technically correct. The estimates and p-values were obtained correctly as well. What is wrong here is the interpretation.

Confounders

Confounders are perhaps the most common reason that leads to associations being misinterpreted.

If \(X\) and \(Y\) are correlated, we call \(Z\) a confounder if changes in \(Z\) cause changes in both \(X\) and \(Y\).

Incorrect interpretation due to confounders is ubiquitous in the lay press. It is sometimes hard to detect. Here we present two examples, both related to gender discrimination.

Case Study: UC Berkeley admissions

Admission data from six U.C. Berkeley majors, from 1973, showed that more men were being admitted than women: 44% men were admitted compared to 30% women. PJ Bickel, EA Hammel, and JW O’Connell. Science (1975).

Here are the data:

data(admissions)
admissions

##    major gender admitted applicants
## 1      A    men       62        825
## 2      B    men       63        560
## 3      C    men       37        325
## 4      D    men       33        417
## 5      E    men       28        191
## 6      F    men        6        373
## 7      A  women       82        108
## 8      B  women       68         25
## 9      C  women       34        593
## 10     D  women       35        375
## 11     E  women       24        393
## 12     F  women        7        341

We see the percent of men and women that were accepted was:

admissions %>% 
  group_by(gender) %>% 
  summarize(percentage =
            round(sum(admitted*applicants)/sum(applicants),1))

## # A tibble: 2 x 2
##   gender percentage
##   <chr>       <dbl>
## 1 men          44.5
## 2 women        30.3

A statistical test clearly rejects the hypothesis that gender and admission are independent:

dat <- admissions %>% 
  group_by(gender) %>% 
  summarize(total_admitted = round(sum((admitted/100)*applicants)), 
            not_admitted = sum(applicants) - sum(total_admitted)) %>%
  select(-gender) 

dat %>% 
  do(tidy(chisq.test(.)))

## # A tibble: 1 x 4
##   statistic  p.value parameter method                                      
##       <dbl>    <dbl>     <int> <chr>                                       
## 1      91.6 1.06e-21         1 Pearson's Chi-squared test with Yates' cont…

But closer inspection shows a paradoxical result. Here are the percent admissions by major:

admissions %>% 
  select(major, gender, admitted) %>%
  spread(gender, admitted) %>%
  mutate(women_minus_men = women - men)

##   major men women women_minus_men
## 1     A  62    82              20
## 2     B  63    68               5
## 3     C  37    34              -3
## 4     D  33    35               2
## 5     E  28    24              -4
## 6     F   6     7               1

Four out of the six majors favor women. More importantly all the differences are much smaller than the 14.2 difference that we see when examining the totals.

The paradox is that analyzing the totals suggests a dependence between admission and gender, but when the data is grouped by major, this dependence seems to disappear.

What’s going on? This actually can happen if an uncounted confounder is driving most of the variability.

So let’s define three variables:

\(X=1\) for men and \(X=0\) for women
\(Y=1\) for those admitted and \(Y=0\) otherwise
\(Z\) quantifies how selective is the major

A gender bias claim would be based on the fact that \(\mbox{Pr}(Y=1 | X = x)\) is higher for \(X=1\) then \(X=0\). But \(Z\) is an important confounder to consider. Clearly \(Z\) is associated with \(Y\), as the more selective a major, the lower \(\mbox{Pr}(Y=1 | Z = z)\).

But is major selectivity \(Z\) associated with gender \(X\)?

One way to see this is to plot the total percent admitted to a major versus the percent of women that made up the applicants:

admissions %>% 
  group_by(major) %>% 
  summarize(major_selectivity = sum(admitted*applicants)/sum(applicants),
            percent_women_applicants = sum(applicants*(gender=="women")/sum(applicants))*100) %>%
  ggplot(aes(major_selectivity, percent_women_applicants, label = major)) +
  geom_text()

There seems to be an association. The plot suggests that women were much more likely to apply to the two “hard” majors: gender and major’s selectivity are confounded. Compare, for example, major B and major E. Major E is much harder to enter than major B and over 60% of applicants to major E were women while less than 10% of the applicants of major B were women.

Confounding explained graphically

The following plot shows the percent of applicants that were accepted by gender.

admissions %>% 
  mutate(percent_admitted = admitted*applicants/sum(applicants)) %>%
  ggplot(aes(gender, y = percent_admitted, fill = major)) +
  geom_bar(stat = "identity", position = "stack")

It also breaks down the acceptance rates by major: the size of the colored bars represents the percent of each major students were admitted to. This breakdown allows us to see that the majority of accepted men come from two majors: A and B. It also lets us see that few women were accepted to these majors.

What the plot does not show us is the number of applicants for each major.

Average after stratifying

In this plot, we can see that if we condition or stratify by major, and then look at differences, we control for the confounder and this effect goes away.

admissions %>% 
  ggplot(aes(major, admitted, col = gender, 
             size = applicants)) +
  geom_point() + xlab("Major") + 
  ylab("Admission percentage by major for each gender")

Now we see that major by major, there is not much difference. The size of the dot represents the number of applicants, and explains the paradox: we see large red dots and small blue dots for the easiest majors, A and B.

If we average the difference by major we find that the percent is actually 3.5% higher for women.

admissions %>% 
  group_by(gender) %>% 
  summarize(average = mean(admitted))

## # A tibble: 2 x 2
##   gender average
##   <chr>    <dbl>
## 1 men       38.2
## 2 women     41.7

Simpson’s Paradox

The case we have just covered is an example of Simpson’s Paradox. It is called a paradox because we see the sign of the correlation of flip when comparing the entire dataset and specific strata. The following is an illustrative example. Suppose you have three variables \(X\), \(Y\) and \(Z\). Here is a scatterplot of \(Y\) versus \(X\):

N <- 100
Sigma <- matrix(c(1,0.75,0.75, 1), 2, 2)*1.5
means <- list(c(11,3), c(9,5), c(7,7), c(5,9), c(3,11))
dat <- lapply(means, function(mu) 
  MASS::mvrnorm(N, mu, Sigma))
dat <- Reduce(rbind, dat)
colnames(dat) <- c("X", "Y")
dat <- tbl_df(dat) %>% 
  mutate(Z = as.character(rep(seq_along(means), each = N)))
dat %>% ggplot(aes(X,Y)) + geom_point(alpha = .5) +
  ggtitle(paste("correlation = ", round(cor(dat$X, dat$Y), 2)))

You can see that \(X\) and \(Y\) are negatively correlated. However, once we stratify by \(Z\), shown in different colors below, another pattern emerges.

means <- as_tibble(Reduce(rbind, means)) %>% 
  rename(x = V1, y = V2) %>% 
  mutate(z = as.character(seq_along(means)))

## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.

corrs <- dat %>% 
  group_by(Z) %>% 
  summarize(cor = cor(X,Y)) %>%
  pull(cor) 

dat %>% ggplot(aes(X, Y, color = Z)) + 
  geom_point(show.legend = FALSE, alpha = 0.5) +
  ggtitle(paste("correlations =", 
                paste(signif(corrs,2), collapse=" "))) +
  annotate("text", x = means$x, y = means$y, 
           label = paste("Z=", means$z), cex = 5)

It is really \(Z\) that is negatively correlated with \(X\). If we stratify by \(Z\), the \(X\) and \(Y\) are actually positively correlated.

Case Study: Research funding success

Here we examine a similar case to the UC Berkeley admissions example, but much more subtle.

A 2014 PNAS paper analyzed success rates from funding agencies in the Netherlands and concluded that their:

results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language used in instructional and evaluation materials.

or that gender contributes to personal research funding success in The Netherlands.

The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need:

data("research_funding_rates")
research_funding_rates

##            discipline applications_total applications_men
## 1   Chemical sciences                122               83
## 2   Physical sciences                174              135
## 3             Physics                 76               67
## 4          Humanities                396              230
## 5  Technical sciences                251              189
## 6   Interdisciplinary                183              105
## 7 Earth/life sciences                282              156
## 8     Social sciences                834              425
## 9    Medical sciences                505              245
##   applications_women awards_total awards_men awards_women
## 1                 39           32         22           10
## 2                 39           35         26            9
## 3                  9           20         18            2
## 4                166           65         33           32
## 5                 62           43         30           13
## 6                 78           29         12           17
## 7                126           56         38           18
## 8                409          112         65           47
## 9                260           75         46           29
##   success_rates_total success_rates_men success_rates_women
## 1                26.2              26.5                25.6
## 2                20.1              19.3                23.1
## 3                26.3              26.9                22.2
## 4                16.4              14.3                19.3
## 5                17.1              15.9                21.0
## 6                15.8              11.4                21.8
## 7                19.9              24.4                14.3
## 8                13.4              15.3                11.5
## 9                14.9              18.8                11.2

We can construct the two-by-two table used for the conclusion above:

two_by_two <- research_funding_rates %>% 
  select(-discipline) %>% 
  summarize_all(sum) %>%
  summarize(yes_men = awards_men, 
            no_men = applications_men - awards_men, 
            yes_women = awards_women, 
            no_women = applications_women - awards_women) %>%
  gather %>%
  separate(key, c("awarded", "gender")) %>%
  spread(gender, value)
two_by_two

##   awarded  men women
## 1      no 1345  1011
## 2     yes  290   177

Compute the difference in percentage:

two_by_two %>% 
  mutate(men = round(men/sum(men)*100, 1),
         women = round(women/sum(women)*100, 1)) %>% 
  filter(awarded == "yes")

##   awarded  men women
## 1     yes 17.7  14.9

Note: It’s lower for women, and find that it is almost statistically significant at the \(\alpha=0.05\) level:

two_by_two %>% 
  select(-awarded) %>% 
  chisq.test() %>% 
  tidy

## # A tibble: 1 x 4
##   statistic p.value parameter method                                       
##       <dbl>   <dbl>     <int> <chr>                                        
## 1      3.81  0.0509         1 Pearson's Chi-squared test with Yates' conti…

So there appears to be some evidence of an association. But can we infer causation here? Is gender bias causing this observed difference?

A response was published a few months later titled No evidence that gender contributes to personal research funding success in The Netherlands: A reaction to Van der Lee and Ellemers which concluded:

However, the overall gender effect borders on statistical significance, despite the large sample. Moreover, their conclusion could be a prime example of Simpson’s paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines (i.e., with low application success rates for both men and women), then an analysis across all disciplines could incorrectly show “evidence” of gender inequality.

In the UC Berkeley admissions example, the overall differences were explained by difference across disciplines. We use the same approach on the research funding data and look at comparisons by discipline:

dat <- research_funding_rates %>% 
  rename(success_total = success_rates_total,
         success_men = success_rates_men,
         success_women = success_rates_women) %>%
  gather(key, value, -discipline) %>%
  separate(key, c("type", "gender")) %>%
  spread(type, value) %>%
  filter(gender != "total") %>%
  mutate(discipline = reorder(discipline, applications, sum))

dat %>% 
  ggplot(aes(discipline, success, size = applications, color = gender)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point()

Here we see that some fields favor men and other women. We see that the two fields with the largest difference favoring men, are also the fields with the most applications. However, are any of these differences statistically significant?

Keep in mind that even when there is no bias, we will see differences due to random variability in the review process as well as random variability across candidates. If we perform a Fisher test in each discipline, we see that most differences result in p-values larger than \(\alpha = 0.05\).

do_fisher_test <- function(m, x, n, y){
  tab <- tibble(men = c(x, m-x), women = c(y, n-y))
  tidy(fisher.test(tab)) %>% 
    rename(odds = estimate) %>%
    mutate(difference = y/n - x/m)
}

res <- research_funding_rates %>% 
  group_by(discipline) %>%
  do(do_fisher_test(.$applications_men, .$awards_men, 
                    .$applications_women, .$awards_women)) %>%
  ungroup() %>%
  select(discipline, difference, p.value) %>%
  arrange(difference)
res

## # A tibble: 9 x 3
##   discipline          difference p.value
##   <chr>                    <dbl>   <dbl>
## 1 Earth/life sciences   -0.101    0.0367
## 2 Medical sciences      -0.0762   0.0175
## 3 Physics               -0.0464   1     
## 4 Social sciences       -0.0380   0.127 
## 5 Chemical sciences     -0.00865  1     
## 6 Physical sciences      0.0382   0.651 
## 7 Humanities             0.0493   0.217 
## 8 Technical sciences     0.0509   0.437 
## 9 Interdisciplinary      0.104    0.0671

We see that for Earth/Life Sciences, there is a difference of 10% favoring men and this has a p-value of 0.04. But is this a spurious correlation? We performed 9 tests. Reporting only the one case with a p-value less than 0.05 would be cherry picking.

The overall average of the difference is only -0.3%, which is much smaller than the standard error:

res %>% summarize(overall_avg = mean(difference), 
                  se = sd(difference)/sqrt(n()))

## # A tibble: 1 x 2
##   overall_avg     se
##         <dbl>  <dbl>
## 1    -0.00311 0.0226

Furthermore, note that the differences appear to follow a normal distribution:

res %>% 
  ggplot(aes(sample = scale(difference))) + 
  stat_qq() + geom_abline()

which suggests the possibility that the observed differences are just due to chance.

Importance of Skepticism in Data Science

Sept 25, 2019