First, we install a few packages:

Next, we load a few R packages

Motivation

You are a data scientist who has been hired by a law firm to assess if there is a statistically significant difference in repair times between two sets of Verizon customers. You immediately think of a standard way to test if there is a difference between the means of the two groups, namely the \(t\)-test. We will explore this approach and then discuss alternative ways of assessing if there is a difference in the repair times between two sets of customers.

In this lecture, some of the alternative approaches that we will learn about are modern resampling techniques, specifically permutation tests and bootstrap methods to understand the meaning of things such as sampling distributions, sampling variability, \(p\)-values, hypothesis tests and confidence intervals. The material from this lecture is from the Mathematical Statistics with Resampling and R book by Laura Chihara and Tim Hesterberg (First edition, 2011).

image source

Data

A description of the data is provided on page 3 of the book above:

Verizon is the primary local telephone company (incumbent local exchange carrier, ILEC) for a large area of the eastern United States. As such it is responsible for providing repair service for customers of other telephone companies known as competing local exchange carrier (CLECs) in this region. Verizon is subject to fines if the repair time (the time it takes to fix a problem) for CLEC customers are substationally worse than those for Verizon customers.

The data set verizon contains a random sample of repair times for 1664 ILEC and 23 CLEC customers. The mean repair time for ILEC customers is 8.4 hours, while that for CLEC customers is 16.5 hrs. Could a difference this large be easily explained by chance?

Let’s dig into the data to answer this question.

Data import

## Parsed with column specification:
## cols(
##   Time = col_double(),
##   Group = col_character()
## )
## # A tibble: 1,687 x 2
##     Time Group
##    <dbl> <chr>
##  1 17.5  ILEC 
##  2  2.4  ILEC 
##  3  0    ILEC 
##  4  0.65 ILEC 
##  5 22.2  ILEC 
##  6  1.2  ILEC 
##  7  2.08 ILEC 
##  8  4.97 ILEC 
##  9  0    ILEC 
## 10 27.6  ILEC 
## # … with 1,677 more rows
## # A tibble: 1,687 x 2
##     time group
##    <dbl> <chr>
##  1 17.5  ILEC 
##  2  2.4  ILEC 
##  3  0    ILEC 
##  4  0.65 ILEC 
##  5 22.2  ILEC 
##  6  1.2  ILEC 
##  7  2.08 ILEC 
##  8  4.97 ILEC 
##  9  0    ILEC 
## 10 27.6  ILEC 
## # … with 1,677 more rows

Exploratory data analysis

Let’s start by having a look at the data.

## [1] 1687    2
## 
## CLEC ILEC 
##   23 1664

Next, we calculate what the difference is between the mean repair time of ILEC and CLEC customers.

## # A tibble: 2 x 2
##   group mean_repair_time
##   <chr>            <dbl>
## 1 CLEC             16.5 
## 2 ILEC              8.41
## [1] -8.09752

Reminder: dplyr

Quick reminder on how dplyr functions work.

## # A tibble: 2 x 2
##   group mean_repair_time
##   <chr>            <dbl>
## 1 CLEC             16.5 
## 2 ILEC              8.41
## # A tibble: 1 x 2
##    CLEC  ILEC
##   <dbl> <dbl>
## 1  16.5  8.41

Continued

Continuing from above.

## [1] -8.09752
## # A tibble: 2 x 2
##   group mean_repair_time
##   <chr>            <dbl>
## 1 CLEC             16.5 
## 2 ILEC              8.41
## 
## CLEC ILEC 
##   23 1664

We see the repair time is longer for the CLEC group, but this could be random variability rather than a real difference in repair times.

We cannot tell for sure whether it’s a real effect, but what we can do is estimate how easily pure random chance would produce a difference this large.

  • If that probability is small, we can conclude something else besides random variability is at work and the data provide convincing enough evidence of a true difference.
  • If the probability is not small, all we can say is that the data available does not provide convincing enough evidence that there is a true difference.

Hypothesis testing

This is the core idea of statistical significance or classical hypothesis testing to calculate how often pure random chance would give an effect as large as the one observed in the data in the absence of any real effect.

In our Verizon example, let’s denote \(\mu_1\) as the mean repair time for ILEC customers and \(\mu_2\) as the mean repair time for CLEC customers. We can test

\[ H_0: \mu_1 = \mu_2\] versus \[ H_1: \mu_1 < \mu_2\]

i.e. the mean repair time for ILEC customers is less than for CLEC customers under the alternative hypothesis, which would mean that Verizon is providing worse service to CLEC customers. (Note that in general, we prefer two-sided tests unless there is a strong reason to prefer one-sided. Here, we have chosen a one-sided test because the specific question we are interested in is whether ILEC repair times are less than CLEC repair times.)

In hypothesis testing, the idea is that we want to compare the test statistic (observed difference in mean repair times, \(\mu_1 - \mu_2\), or observed_repair) to a reference distribution, or null distribution (distribution of the test statistic if the null hypothesis is true).

There are different ways to calculate exact or approximate null distributions and \(p\)-values.

For example, we could try a \(t\)-test for comparing two means. Using a pooled variance \(t\)-test,

## 
##  Two Sample t-test
## 
## data:  verizon$time[verizon$group == "ILEC"] and verizon$time[verizon$group == "CLEC"]
## t = -2.6125, df = 1685, p-value = 0.004534
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -2.996491
## sample estimates:
## mean of x mean of y 
##  8.411611 16.509130

We see a \(p\)-value is about 0.0045, suggesting that there is a difference at the \(\alpha\) = 0.01 level.

However, let’s think… Why might the \(t\)-test not be appropriate here?

I can think of two reasons:

  1. The distribution of each of the populations should follow a normal distribution (can assess this using histograms or normal quantile (qq) plots)
  2. Not accurate for skewed populations and imbalanced sample sizes.

It seems like we have both skewed populations and imbalanced sample sizes in our case.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histograms and QQ-plots show that the samples are strongly skewed. For more details on QQ-plots, see chapter 2 of the textbook by Chihara and Hesterberg.

## 
## CLEC ILEC 
##   23 1664

So, let’s try to explore a different way to calculate a \(p\)-value that doesn’t depend on a theoretical distribution.

Permutation testing

Permutation testing compares the test statistic to a reference distribution using permutations of the data.

What does that mean?

If there is no difference between the groups of customers (CLEC, ILEC), then we could split the customers into two new groups (or permute the group labels) at random and re-calculate the difference in repair times. We can repeat this process many, many times and plot the distribution of difference in repair times. Finally, we calculate a \(p\)-value as the fraction of times the random statistics exceeds the original statistic.

This is called a permutation test.

A few issues to think about when implementing a permutation test:

  • When computing the \(p\)-value, it’s useful to add 1 to both numerator and denominator. This corresponds to including the original data as an extra resample. This avoids reporting an impossible \(p\)-value of 0, since there is always at least one resample that is as extreme as the original data, namely, the original data itself.
  • Sample with replacement from the null distribution. Ideally, we want to draw resamples that are unique, so without replacement is more accurate. However, sometimes it’s not feasible, may require too much memory/time to check that a new sample doesn’t match any previous sample. In those cases, we draw with replacement.
  • More samples for better accuracy. In general, the more resamples, the better the accuracy. If the true \(p\)-value is \(p\), then the estimated \(p\)-value has variance approximately equal to \(p(1-p)/N\) where \(N\) is the number of resamples.

Let’s try implementing a permutation test.

## [1] -1.7633356 -7.4125287 -3.9963890 -1.7571645  2.7415609  0.8324898

We can see that the observed difference in repair times (blue dotted line) is quite extreme, although there are some permutation values that are even lower. How strong is this evidence?

We can calculate the proportion of times that we see simulated differences as extreme or smaller than the observed difference (observed_repair).

## [1] 0.0178

Interesting. The \(p\)-value calculated from permutation testing 0.0178 indicates that the observed difference in means is not significant at the 1% level (though it is at the 5% level).

In the above simulation, we used \(10^4 - 1\) resamples for speed, but for more accuracy, we should use more resamples (e.g. half a million resamples). The goal is to have only a very small chance of a test wrongly being declared significant or not, due to random sampling.

A few final thoughts about permutation testing:

  • Permutation testing works with unbalanced sample sizes (e.g. in our case, 1664 and 23) and for very skewed data because the observed test statistic and permutation resamples are affected by the imbalance and skewness in the same way.
  • There are no distributional assumptions (e.g. normality) for the two populations of data.
  • You should explore the use of different, possibly more robust statistics such as the median or trimmed mean, which are not sensitive to outliers (like the mean is)
  • You can also use permutation testing to investigate other questions, such as comparing the difference in variances

The bootstrap

In the leadup to elections, pollsters frequently conduct polls to see how voters feel about particular issues.

For example, what proportion \(p\) of registered voters plan to support the Republicans in races for Congress? This is also known as the Generic Congressional Vote. Here, we will use data from the 2018 Generic Congressional Vote. If we pick a particular pollster, e.g. NPR/PBS NewsHour/Marist, we see they released the results of a poll on Oct 1, 2018 for the 2018 Generic Congressional Vote reporting \(\hat{p} = 0.42\).

This was one random sample of \(n=996\) registered voters. However a different random sample might have yielded \(\hat{p}=0.47\), or \(\hat{p}=0.39\). It would be nice to get a sense of the accuracy of the original estimate. To do that, we need to understand how the proportions \(\hat{p}\) vary sample to sample.

In the last section, we learned about the idea of comparing an observed test statistic to a null distribution. Here we are interested in doing something similar: we want to know how a statistic varies due to random sampling. But first, we need to define a sampling distribution.

Sampling distribution

In the above scenario, we know there are other pollsters (e.g. CNN) who are also conducting other surveys in the same time frame. In this poll, \(\hat{p}=0.41\) using a random sample of size \(n=739\) from the same population of registered voters. If we asked 85 different pollsters to all conduct surveys asking about the 2018 generic congressional ballot, we could see a distribution of the \(\hat{p}\) from the recorded values from the various pollsters.

Case study 1

For example, let’s assume the true proportion of people who vote for the republican candidate (\(p\)) is \(p=0.47\). The following code simulates a distribution of \(\hat{p}\) after 85 pollsters all conducted surveys of sample sizes \(n=1000\).

The distribution above is (an approximation to) the sampling distribution of \(\hat{p}\). The most likely outcome is 0.469 and the standard deviation is 0.014. We call the standard deviation of a statistic the standard error.

or more exactly:

## [1] 0.4686235
## [1] 0.01391554

The permutation distributions above are sampling distributions, as is the example above. The most important thing to understand about sampling distributions is it’s the distribution of a test statistic that summarizes a dataset and represents how the statistic varies across many random datasets.

A histogram of one set of observations drawn from a population does not represent a sampling distribution.

A histogram of permutation means, each from one sample, does represent a sampling distribution.

We also know from the Central Limit Theorem that if \(X \sim Binom(n,p)\) and \(\hat{p} = X/n\), the proportion of successes, then for sufficiently large \(n\), the sampling distribution of \(\hat{p}\) is approximately normal with mean \(p\) and standard deviation \(\sqrt{p(1-p)/n}\). Similarly the sampling distribution of \(X\) is approximately normal with mean \(np\) and standard deviation \(\sqrt{np(1-p)}\).

Therefore in our case study, based on the CLT, the sampling distribution is normal with mean \(p=0.47\) and (theoretical) standard error \(\sqrt{0.47(1-0.47)/1000}\) = 0.0157829 which matches very closely to the observed standard error 0.0139155.

Case study 2

Let’s look at the sampling distribution for a different statistic for a bit of variety.

Here we draw random samples of size 12 from the uniform distribution on the interval [0,1] and take the maximum of each sample. We simulate the sampling distribution of the maximum by taking 1000 samples of size 12 from the uniform distribution.

We see that the maximum is usually larger than 0.8 and rarely less than 0.6. If we want to calculate the mean and standard error, we can again use the mean() and sd() functions.

## [1] 0.9307502
## [1] 0.06256068

Case study 3

Australia held a federal election on 18 May 2019. In the Australian political system, there are two main political parties – the Australian Labor Party (center-left) and the Liberal-National coalition (center-right). Most elections are won by one of these two blocks.

In the 2019 election, most opinion polls expected a Labor victory, however in the end there was an upset victory by the Liberal-National coalition.

What can we learn from the opinion polls?

The Wikipedia page for the opinion polls shows that the final major opinion poll was by Newspoll on 15-16 May 2019. This opinion poll reported a voting intention proportion of 48.5% for the Liberal-National coalition and 51.5% for the Australian Labor Party (in two-party preferred terms under Australia’s preferential or ranked choice voting system).

Let’s simulate some data as previously, assuming the true proportion is \(p=0.485\) (more on this later).

What does this tell us?

This is the (approximate) sampling distribution of our statistic \(\hat{p}\), assuming a true \(p\) of 0.485. We can then investigate the spread of the sampling distribution, and the proportion of simulated values above 0.5, given our assumptions (values of \(p\), sample sizes, etc).

But what if we don’t know the true value \(p\)? This is the more common situation in practice.

Finally, for an election, we are more interested in the final voting result on the day of the election, instead of the voting intention on the day of the opinion poll. But the voting intention on the day of the opinion poll is all that we have access to. If people change their minds over the following days, this is not captured by our model or our sample of data. When we interpret our results, we need to keep in mind the population that our sample has been drawn from.

In statistical inference, it is crucial to think clearly about the population that we are sampling from.

Back to the bootstrap

In the previous sections, we described what is a sampling distribution and some ways to compute or estimate them when the populations were known.

Now we will move to the setting when the population is unknown. In this scenario, all we have are the data and a statistic estimated from the data. We need to estimate the sampling distribution of the statistic to understand how much variability or uncertainity there is.

For the Verizon repair example, we know the sample mean for the difference in repair times is -8.098 using a sample size of 1687.

If you recall, we are interested in the difference in mean repair times (\(\mu_1 - \mu_2\)) where \(\mu_1\) is denoted as the mean repair time for ILEC customers and \(\mu_2\) is the mean repair time for CLEC customers. The difference (\(\mu_1 - \mu_2\)) is the true difference repair time for all Verizon customers. This is probably not the same as the sample mean.

Our goal is to test:

\[ H_0: \mu_1 = \mu_2\] versus \[ H_1: \mu_1 < \mu_2\]

We already mentioned that different numbers of samples of the same size will lead to different sample means. The question is how can we gauge the accuracy of -8.098 as an estimate to \(\mu_1 - \mu_2\).

If we knew the sampling distribution of the difference in means for samples of size 1000 from the population of all Verizon customers, we would be able to assess how the estimate in difference in means varies sample to sample.

Of course we do not have all the repair times, so we cannot generate the sampling distribution directly.

Instead, we will use what’s called the bootstrap to create a new distribution called the bootstrap distribution, which approximates the sampling distribution for test statistics.

How does the bootstrap work?

To find the bootstrap distribution of the difference in repair times, we draw samples (called resamples or bootstrap samples) of size \(n\), with replacement, from the original sample and then compute the mean of each resample.

A few things to note about bootstrap distributions:

  1. We treat the original sample as the population. If the original sample is representative of the population, then the bootstrap distribution of the test statistics will look approximately like the sampling distribution of the test statistic (same spread and shape).

  2. The bootstrap standard error is the standard deviation of the bootstrap distribution of that statistic.

  3. However, the mean of the bootstrap distribution will be the same as the mean of the original sample (not necessarily that of the original population).

Steps to construct the bootstrap distribution

  1. Start with a sample of size \(n\) from a population
  2. Draw a resample of size \(n\) with replacement from the sample
  3. Compute a statistic that describes the sample, such as the sample mean
  4. Repeat the resampling process many times
  5. Construct the bootstrap distribution of the statistic. Inspect the spread, bias and shape.

Let’s try to construct a bootstrap distribution for our case study of Verizon repair times. Here, we simulate \(B=10,000\) bootstrap samples, calculate the mean for each ilec and clec sample, then calculate the difference in the means and record it in time_diff_boot.

Note: We actually have two populations (ilec and clec), so we will need to resample from the two populations and compute the mean of the repair times separately.

Next, we plot the bootstrap distributions.

Looks like the ilec bootstrap distribution is symmetric and has a narrow spread (mostly due to large sample size). The clec bootstrap distribution has a larger spread (due to small sample size) and is very skewed. The difference in repair times bootstrap distribution is also strongly skewed with a wide spread.

Bootstrap percentile confidence intervals

The interval between the 2.5 and 97.5 percentiles of the bootstrap distribution of a statistic is a 95% bootstrap percentile confidence interval for the corresponding parameter.

We can then say that we are 95% confident that the true statistic lies within this interval.

In our example, we can extract the 2.5 and 97.5 percentiles using the quantile() function:

## [1] -8.031375
##       2.5%      97.5% 
## -16.925963  -1.652614

Thus, we can say that with 95% confidence, the repair times for the ILEC customers are, on average, -16.93 to -1.65 hours shorter than the repair times for CLEC customers.

Other statistics

We can also use the bootstrap to investigate uncertainty for other statistics, such as the median, trimmed mean, ratios, proportion of extreme values, etc.

This is a major advantage of the bootstrap – we can estimate uncertainty for statistics that would be difficult to calculate using other methods. This allows us to focus on the most meaningful estimators for the situation at hand, instead of choosing estimators based on mathematical convenience (e.g. the mean).

Below, we calculate a bootstrap confidence interval for the difference in trimmed means.

## [1] -10.336

Calculate the bootstrap confidence interval.

## [1] -10.2225
##       2.5%      97.5% 
## -15.393227  -4.728773

Similarly, we can calculate a bootstrap confidence interval for the ratio of means.

## [1] 0.5095126

Calculate the bootstrap confidence interval.

## [1] 0.5418096
##      2.5%     97.5% 
## 0.3283292 0.8364061

Relationship between the bootstrap and Monte Carlo simulations

The bootstrap is an example of a more generalized idea called Monte Carlo methods or simulations. Here the idea is the bootstrap is based on using data that we have to estimate the uncertainty of a statistic or an estimator (i.e. we don’t need to know the true distribution because we treat the original data as the population and sample with replacement from it).

However, Monte Carlo simulations are broadly defined as approaches that generate data from the same distribution or original population to similarly estimate the uncertainty of a test statistic.

For example, our example of simulating poll data from pollsters (Case Study 1 under Sampling distribution) was an example of a Monte Carlo simulation. We assumed the true proportion of Republican voters (\(p\)) was \(p=0.47\) and then wanted to estimate the uncertainty of \(\hat{p}\) after 85 pollsters all conducted surveys of sample sizes \(n=1000\).

Sources of variation in a bootstrap distribution

How accurate is the bootstrap? We conclude with a note on sources of variation in a bootstrap distribution.

Bootstrap distributions and conclusions based on them include two sources of random variation:

  • Variation due to sampling the original sample from the population
  • Variation due to bootstrap resamples chosen from the original sample

If we increase the number of bootstrap samples, we can reduce the second source of variation, but not the first.

To reduce the first source of variation, we would need a larger sample size in the original experiment. Since we can easily reduce the second source of variation, most of the variation in bootstrap distributions usually comes from the first source. Bootstrapping does not overcome the weakness of small sample sizes in the original sample.

In general, around 10,000 bootstrap resamples will give good estimates. More resamples are recommended if accuracy is very important.

Additional references

  • Mathematical Statistics with Resampling and R textbook by Laura Chihara and Tim Hesterberg. This is the main reference we have used for this lecture, in particular Chapters 1 to 5. This textbook provides an excellent overview of mathematical statistics based around the idea of modern resampling techniques, along with numerous examples in R.

  • Statistical Thinking for the 21st Century online textbook by Russell A. Poldrack. See Chapter 8 on “Resampling and simulation”. This textbook is another great introductory reference on modern statistics (available free online).

  • Computer Age Statistical Inference textbook by Bradley Efron and Trevor Hastie. See Chapters 10 and 11. This is a more advanced textbook (PDF available free online).