The goal of this homework assignment is to use US opinion polling data (any any other sources of data) for the current 2018 Senate Midterm Elections and predict the results in each state.
We will have three competitions with the terms for scoring entries described below (see Problem 3). The winner of each competition will win an Amazon gift card!
The three competitions are the following:
To submit your predictions, we will post a link on Slack to a Google Form.
Some data you will find useful are:
Create a master data frame called candidates
containing information about each race and show the head of the data frame. Specifically, each row should represent one race and the data frame should have the following columns:
state
= the state abbreviation where the race is being heldclass
= the class of the Senate race (1, 2, 3)special
= TRUE/FALSE status representing whether this is a special election or notR
= name of republican candidateD
= name of non-republican candidate (democrat or independent)race_id
= in lowercase letters the abbreviation of the state underscore the senate class number (e.g. az_1
)safe
= a TRUE/FALSE logical vector indicating whether RealClear Politics has indicated if the incumbent in this race is safe or not.race_url
= a URL to where you got your poll data from e.g. Texas or NA
if no poll data exists.Hints:
safe
, but it also does not have any poll data (as of 2018-10-17). So you can consider this race a safe
race for purposes of this homework.R
and Dianne Feinstein as D
to calculate the difference between R
-D
.I
to D
for purposes of this homework assignment.## add your code here
Create a list
object of length 35 and name the object polls
. Within the polls
object, name each item in your list the same as your race_id
in the candidates
data frame.
polls <- vector(mode="list", length=35)
names(polls) <- candidates$race_id
Then, scrape in opinion poll data (if available) from the RealClear Politics website for each of the senate midterm races and store the poll data for that race in the corresponding slot in the list.
Show the head of the data frame containing poll data from Arizona class 1 race.
Hint:
## add your code here
Compute a 99% confidence interval for each state
Assume you have \(M\) polls with sample sizes \(n_1, \ldots n_M\). If the polls are independent, what is the average of the variances of each poll if the true proportion is \(p\)?
First, compute the following for the republican candidates in each race:
Second, create a scatter plot of the observed versus theoretical (average of theoretical standard deviations) with the size of the point proportion to the number of polls. How do these compare?
## add your code here
Repeat Problem 2.2, but include only the most recent polls from since September 1, 2018. Do they match better or worse or the same? Can we trust the theoretical values? Why might they be different?
## add your code here
Create a scatter plot with each point representing one state. Is there one or more races that are outliers in that it they have much larger variabilities than expected? Explore the original poll data and explain why the discrepancy?
## add your code here
Construct 99% confidence intervals for the difference in each race. Use either theoretical or data driven estimates of the standard error and use the results in Problem 2.4, to justify your choice.
Plot the differences with 99% confidence intervals along the x-axis (one for each race) and the difference along the y-axis. Order the x-axis from the most negative difference to most positive difference.
How does your answer here compare to the other poll aggregators?
## add your code here
Predict the results for the 2018 Senate Midterm Elections. We will have three competitions with the terms for scoring entries described below. For the questions below, explain or provide commentary on how you arrived at your predictions including code.
Some possible suggestions on analyses to explore:
Good luck!!
To enter the competition, we will post a link on Slack to a Google Form for you to submit your predictions.
Predict the number of Republican senators. You may provide an interval. Smallest interval that includes the election day result wins.
Note: We want the total so add the numbers of those that are not up for election.
## add your code here
Predict the R-D difference in each state. The predictions that minimize the residual sum of squares between predicted and observed differences wins.
## add your code here
Report a confidence interval for the R-D difference in each state. If the election day result falls outside your confidence interval in more than two states you are eliminated. For those surviving this cutoff, we will add up the size of all confidence intervals and sum. The smallest total length of confidence interval wins.
Note: You can use Bayesian credible intervals or whatever else you want.
## add your code here