Homework 2: Health Effects of Major Storms

Preface

This homework is due Friday October 12, 2018 at 11:59 PM.
When you have completed the assignment, knit the R Markdown, commit your changes and push to GitHub.
If you do not include axis labels and plot titles, then points will be deducted.
If you do not include prose/text after the sections titled “Add a summary of your findings here”, then points will be deducted.
As as reminder, you can use up to two late days on this assignment (if you have them available) without any penalty (see Syllabus on course website for more details on Late Day Policy).
You are welcome and encouraged to discuss homework problems with others in order to better understand it, but the work you turn in must be your own. You must write your own code, data analyses, and communicate and explain the results in your own words and with your own visualizations. All students turning in plagiarized solutions will be reported to Office of Academic Integrity, and will fail the assignment.

Motivation

Extreme weather, such as major storms, hurricanes, tornadoes, and floods can cause major damage and health effects in areas that come in contact with them. However, assessing the public health impacts of major events can be difficult because attributing health problems to storms is largely inferential in nature. Official counts of deaths from events like hurricanes tend to undercount the overall impact of a hurricane on the population because of what constitutes a death “directly caused” by the event. The definition used by the National Weather Service for a direct fatality is “A direct fatality or injury is defined as a fatality or injury directly attributable to the hydro-meteorological event itself, or impact by airborne/falling/moving debris, i.e., missiles generated by wind, water, ice, lightning, tornado, etc.”

In a recently publicized example, the official death count for Hurricane Maria, which hit the island of Puerto Rico, was 64. One systematic investigation of mortality after the hurricane estimated 4,645 (95% CI, 793 to 8,498) deaths from September 20 through December 31, 2017. A simpler analysis using just time series data estimated about 1,750 deaths in the same time period.

What about all of the other storms and floods that hit cities in the U.S. on an annual basis? What do we know about the mortality effects of those events? The goal of this Homework is to just scratch the surface of answering that question and to look at some data that might be useful for addressing these kinds of quesitons.

Overall Objective

The goal of this assignment is to develop an estimate for the health impact of major storm events in the United States. For this problem we will focus on mortality impacts. Using the NOAA Storm Event Database and the NMMAPS mortality data, you must link the two together, fit whatever models are needed to develop your estimate, and then report on your estimate while noting the limitations and uncertainties.

During the course of your analysis there will be many options for you to explore and various approaches that you may take. Part of your job will be to choose amongst these many options and focus on a specific approach that you find most interesting or has the greatest potential for success.

Data

Datasets that we will focus on here are

NOAA Storm Events Database: The Storm Events Database documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce; Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event. Compressed CSV files of annual data can be downloaded here. You can also read detailed information on the variables.
NMMAPS Mortality Data: These data come from the National Mortality, Morbidity, and Air Pollution Study conducted at Johns Hopkins University. This file contains daily mortality data and temperature data from 108 cities spanning the years 1987–2005. Causes of death include accidents, COPD, cardiovascular disease (cvd), all non-accidental causes (death), and respiratory disease (resp). There is a separate file (nmmaps_cities.csv) containing metadata on the cities included in the NMMAPS data.
Your are welcome to incorporate other data into your analysis, if necessary, as long as those other data sources are documented and their origin is clear.

Problem 1: Exploring the Mortality Data

Problem 1.1

The mortality data are available in a single mortality.zip file. Inside the zip file is a single CSV file named mortality_1987-2005.csv.bz2. Download and unzip the file. Read the mortality_1987-2005.csv.bz2 data into R using the readr package. Then, read the nmmaps_cities.csv file into R to get the metadata on the cities in the NMMAPS study.

## Add your code

Note that many of the “cities” in the NMMAPS data are actually combinations of different counties. So for example, “New York City” is a combination of 6 sepaparate counties. The nmmaps_cities.csv files provides the mapping of counties to “cities”. In addition, the 5-digit FIPS code identifying each county is provided here.

Problem 1.2

Make a plot of the daily mortality from all non-accidental deaths (death) versus date for the city of New Orleans, LA in the year 2004.

## Add your code

Are there interesting features such as when mortality tends to be high or low?

Add a summary of your findings here

Problem 1.3

Try making this same plot for different years, different cities, and different causes of death.

## Add your code

If you were to focus your analysis on a single city, or a few cities, which ones might be the most practical or interesting ones to choose?

Add a summary of your findings here

Problem 1.4

Take all of the data on non-accidental deaths for New York City, NY and divide them by season of the year. In which season does mortality tend to be the highest?

## Add your code

Add a summary of your findings here

Problem 1.5

Is the seasonal pattern of mortality the same in every city in the NMMAPS data? Summarize your results below.

## Add your code

Add a summary of your findings here

Problem 1.6

Take a look at any other temporal patterns in the data. Are there day-of-week effects? Weekly or monthly trends? Yearly trends?

## Add your code

Add a summary of your findings here

Problem 2: Exploring the Storm Event Data

The storm event database goes back until 1950, with one file per year. You will likely not need all of it.

Problem 2.1

Just to start, download the data for 2004 (these files are labeled StormEvents_details-*), read the data into R and take a look at it. You will see that the storm event data have a column for the year and date/time of each event. There are also separate columns for when the event began (i.e. BEGIN_DATE_TIME) and for when the event ended (END_DATE_TIME). Convert these columns into R date/time objects and add a new column that contains the length of each event in the dataset.

## Add your code

Problem 2.2

Given your new begin/end date columns, we can look at temporal patterns of specific storm events. How many flash floods occur in each of the four seasons of 2004 for the state of Texas?

## Add your code

Add a summary of your findings here

Problem 2.3

What is the seasonal pattern for other major storm events in the database? Adapt your code from above to explore these patterns for other event types and other locations.

## Add your code

Add a summary of your findings here

Problem 3: Linking the Datasets

Having thoroughly explored the mortality and storm events database, you will eventually need to link the two together in order to determine what if any connection there is between major storms and mortailty.

Problem 3.1

The mortality data are presented as a time series with the number of deaths for each day. However, the storm events data are presented as events, with one record for each event. One transformation that is likely to be useful is to convert the storm events data into a time series. A dataset of this nature will have a column for the date and another column indicating whether there was an event (i.e. flood, hurricane) occurring on that date.

Create a time series for flash floods for the state of Louisiana in the year 2004. Then, make a time series plot of the flash floods.

## Add your code

Problem 3.2

Link the time series flood data with mortality data for New Orleans, LA from the NMMAPS dataset. Then, make a scatterplot of storm event and deaths in New Orleans for 2004.

## Add your code

Problem 3.3

Make the same scatterplot as above, but now stratified by season so that there are four plots.

## Add your code

At this point you may want to modify and adapt the code you’ve written above to explore the following additional questions:

What does the relationship between death and hurricanes or other major event times look like?
How does the scatterplot look in different years or states?
Does it make sense to look at a smaller geographical unit than the state? What about county/city?
Which major storm events are most frequent in which states? It might make sense to focus on events that are somewhat regularly occurring rather than very rare events.

Problem 4: Narrowing Down the Question

Once you’ve linked the data together and had the chance to work through some plots and visuals of the data, you will need to narrow things down in order to focus on an question for the purpose of the homework.

Consider the following questions as you narrow things down:

What is the best way to link the two datasets together?
What are the temporal and spatial units for linking the datasets together?
Which method of linkage requires the fewest assumptions about the data? Which methods requires the most assumptions?
What should the temporal and spatial scope of the final analysis be? Are we looking at all cities and all years? Or will we focus on one city in a single year? Or something in between?
What will the final product look like? A series of plots? Tables? (Note: It may not be entirely possible to answer this question yet.)

At this point you have should have a good idea of what specific aspect of the data you would like to use to develop an estimate of the mortality effect of major storms.

Briefly state what your approach will be here

Note: You do not have to cover anything and everything. Rather, you should produce a produce a reasonable answer given the data and time available.

Problem 5: Modeling

For this part, you will need to develop a model for relating your chosen storm factor (or combination thereof) with a mortality outcome.

Decide on your principal outcome and key predictor. While you may be interested in multiple factors, it’s good to focus on one first.
Before setting off, you may want to draw a quick sketch of what you want your final “result” to look like. Is it a plot? Is it a table? Having some idea beforehand can limit the work (but obviously, you can change if you want).
Start with a simple, naive model to get a sense of the relationship.
Consider potential confounding factors and systematically incorporate them into your modeling.
Conduct sensitivity analysis of any results based on model form and structure.
If you want, consider other outcomes or storm factors for comparison.

## Add your code

Problem 6: Narrative

Given the model results you produced in Problem 5, it’s time to narrow them down to a presentable format.

To get full credit for the problem, you will need the following:

Write a one paragraph abstract summarizing your findings from Problems 4-5 above. The paragraph should draw a conclusion from the evidence or explain why no conclusion can be drawn.
Produce one summary figure/table/graphic/etc. that provides evidence supporting the statements made in the abstract above.
Produce a figure/table/graphic/etc. that indicates the robustness (or lack thereof) of the findings to assumptions about the data and model form.
Write one paragraph outlining the limitations of your analysis and what could be done in the future to address those limitations.