R has developed a special representation of dates and times
Dates are represented by the Date
class
Times are represented by the POSIXct
or the POSIXlt
class
Dates are stored internally as the number of days since 1970-01-01
Times are stored internally as the number of seconds since 1970-01-01
lubridate
packageThe lubridate
package is a very useful package for dealing with all the little annoying aspects of dates/times
Largely replaces the default date/time functions in base R
Methods for date/time arithmetic
Handles time zones, leap year, leap seconds, etc.
install.packages("lubridate")
## Not part of `tidyverse` package
Dates are represented by the Date
class and can be coerced from a character string using the ymd()
function.
library(lubridate)
x <- ymd("1970-01-01")
x
[1] "1970-01-01"
class(x)
[1] "Date"
unclass(x)
[1] 0
x <- ymd("2019-10-03")
unclass(x)
[1] 18172
Date
objects have their own special print methods that will always format as “YYYY-MM-DD”.
Different locales have different ways formatting dates
ymd("2016-09-13") ## International standard
[1] "2016-09-13"
ymd("2016/09/13") ## Just figure it out
[1] "2016-09-13"
mdy("09-13-2016") ## Mostly U.S.
[1] "2016-09-13"
dmy("13-09-2016") ## Europe
[1] "2016-09-13"
All of the above are valid and lead to the exact same object.
Even if the individual dates are formatted differently, ymd()
can usually figure it out.
x <- c("2016-04-05",
"2016/05/06",
"2016,10,4")
ymd(x)
[1] "2016-04-05" "2016-05-06" "2016-10-04"
Times are represented using the POSIXct
or the POSIXlt
class
POSIXct
is just a very large integer under the hood; it is a useful class when you want to store times in something like a data frame
POSIXlt
is a list underneath and it stores a bunch of other useful information like the day of the week, day of the year, month, day of the month
Times are represented as the number of seconds since 1970-01-01 00:00:00.
x <- ymd_hms("2019-10-03 13:30:00")
class(x)
[1] "POSIXct" "POSIXt"
unclass(x)
[1] 1570109400
attr(,"tzone")
[1] "UTC"
If you want to know more about the international date/time standard, you can read about ISO Standard 8601.
Times can be coerced from a character string with ymd_hms()
ymd_hms("2016-09-13 14:00:00")
[1] "2016-09-13 14:00:00 UTC"
ymd_hms("2016-09-13 14:00:00", tz = "America/New_York")
[1] "2016-09-13 14:00:00 EDT"
ymd_hms("2016-09-13 14:00:00", tz = "")
[1] "2016-09-13 14:00:00 EDT"
Time zones were created to make your data analyses more difficult.
ymd_hms()
function will by default use UTC as the time zone
Specifying tz = ""
will use the local time zone
Better to specify time zone when possible to avoid ambiguity
You can go to Wikipedia to find the list of time zones
Daylight savings time
Some states are in two time zones
Southern hemisphere is opposite
Finally, there is the strptime()
function in case your dates are written in a different format
datestring <- c("January 10, 2012 10:40",
"December 9, 2011 9:10")
x <- strptime(datestring, "%B %d, %Y %H:%M",
tz = "America/Los_Angeles")
x
[1] "2012-01-10 10:40:00 PST" "2011-12-09 09:10:00 PST"
Check ?strptime
for details of formatting strings
When reading in data with read_csv()
, you may need to read in as character first and then convert to date/time
You can add and subtract dates and times. You can do comparisons too (i.e. ==
, <=
)
x <- ymd("2012-01-01", tz = "") ## Midnight
y <- dmy_hms("9 Jan 2011 11:34:21", tz = "")
x - y
Time difference of 356.5178 days
x + y ## Nope!
Error in `+.POSIXt`(x, y): binary '+' is not defined for "POSIXt" objects
Add a second to a time
y + 1
[1] "2011-01-09 11:34:22 EST"
Just keep the date portion
y <- date(y)
y
[1] "2011-01-09"
Add a number to the date (in this case 1 day)
y + 1
[1] "2011-01-10"
Even keeps track of leap years, leap seconds, daylight savings, and time zones.
Leap years
x <- ymd("2012-03-01")
y <- ymd("2012-02-28")
x - y
Time difference of 2 days
Beware of time zones!
x <- ymd_hms("2012-10-25 01:00:00", tz = "")
y <- ymd_hms("2012-10-25 06:00:00", tz = "GMT")
y - x
Time difference of 1 hours
There are also leap seconds.
.leap.seconds
[1] "1972-06-30 20:00:00 EDT" "1972-12-31 19:00:00 EST"
[3] "1973-12-31 19:00:00 EST" "1974-12-31 19:00:00 EST"
[5] "1975-12-31 19:00:00 EST" "1976-12-31 19:00:00 EST"
[7] "1977-12-31 19:00:00 EST" "1978-12-31 19:00:00 EST"
[9] "1979-12-31 19:00:00 EST" "1981-06-30 20:00:00 EDT"
[11] "1982-06-30 20:00:00 EDT" "1983-06-30 20:00:00 EDT"
[13] "1985-06-30 20:00:00 EDT" "1987-12-31 19:00:00 EST"
[15] "1989-12-31 19:00:00 EST" "1990-12-31 19:00:00 EST"
[17] "1992-06-30 20:00:00 EDT" "1993-06-30 20:00:00 EDT"
[19] "1994-06-30 20:00:00 EDT" "1995-12-31 19:00:00 EST"
[21] "1997-06-30 20:00:00 EDT" "1998-12-31 19:00:00 EST"
[23] "2005-12-31 19:00:00 EST" "2008-12-31 19:00:00 EST"
[25] "2012-06-30 20:00:00 EDT" "2015-06-30 20:00:00 EDT"
[27] "2016-12-31 19:00:00 EST"
There are a set of helper functions in lubridate
that can extract sub-elements of dates/times
x <- ymd_hms(c("2012-10-25 01:13:46",
"2015-04-23 15:11:23"), tz = "")
year(x)
[1] 2012 2015
month(x)
[1] 10 4
day(x)
[1] 25 23
weekdays(x)
[1] "Thursday" "Thursday"
x <- ymd_hms(c("2012-10-25 01:13:46",
"2015-04-23 15:11:23"), tz = "")
minute(x)
[1] 13 11
second(x)
[1] 46 23
hour(x)
[1] 1 15
week(x)
[1] 43 17
library(readr)
storm <- read_csv("../data/storm_events_2002.csv.gz", progress = FALSE)
names(storm)
[1] "BEGIN_YEARMONTH" "BEGIN_DAY" "BEGIN_TIME"
[4] "END_YEARMONTH" "END_DAY" "END_TIME"
[7] "EPISODE_ID" "EVENT_ID" "STATE"
[10] "STATE_FIPS" "YEAR" "MONTH_NAME"
[13] "EVENT_TYPE" "CZ_TYPE" "CZ_FIPS"
[16] "CZ_NAME" "WFO" "BEGIN_DATE_TIME"
[19] "CZ_TIMEZONE" "END_DATE_TIME" "INJURIES_DIRECT"
[22] "INJURIES_INDIRECT" "DEATHS_DIRECT" "DEATHS_INDIRECT"
[25] "DAMAGE_PROPERTY" "DAMAGE_CROPS" "SOURCE"
[28] "MAGNITUDE" "MAGNITUDE_TYPE" "FLOOD_CAUSE"
[31] "CATEGORY" "TOR_F_SCALE" "TOR_LENGTH"
[34] "TOR_WIDTH" "TOR_OTHER_WFO" "TOR_OTHER_CZ_STATE"
[37] "TOR_OTHER_CZ_FIPS" "TOR_OTHER_CZ_NAME" "BEGIN_RANGE"
[40] "BEGIN_AZIMUTH" "BEGIN_LOCATION" "END_RANGE"
[43] "END_AZIMUTH" "END_LOCATION" "BEGIN_LAT"
[46] "BEGIN_LON" "END_LAT" "END_LON"
[49] "EPISODE_NARRATIVE" "EVENT_NARRATIVE" "DATA_SOURCE"
Let’s take a look at the BEGIN_DATE_TIME and DEATHS_DIRECT variables
library(dplyr)
select(storm, BEGIN_DATE_TIME, EVENT_TYPE, DEATHS_DIRECT)
# A tibble: 52,956 x 3
BEGIN_DATE_TIME EVENT_TYPE DEATHS_DIRECT
<chr> <chr> <int>
1 03-JUL-03 21:30:00 Thunderstorm Wind 0
2 04-JUL-03 08:35:00 Marine Thunderstorm Wind 0
3 04-JUL-03 08:35:00 Marine Thunderstorm Wind 0
4 11-AUG-03 16:33:00 Thunderstorm Wind 0
5 11-AUG-03 18:00:00 Hail 0
# ... with 5.295e+04 more rows
We can first convert the date/time to a date/time R object.
storm_sub <- select(storm, BEGIN_DATE_TIME, EVENT_TYPE, DEATHS_DIRECT) %>%
mutate(begin = dmy_hms(BEGIN_DATE_TIME)) %>%
rename(type = EVENT_TYPE,
deaths = DEATHS_DIRECT) %>%
select(begin, type, deaths)
storm_sub
# A tibble: 52,956 x 3
begin type deaths
<dttm> <chr> <int>
1 2003-07-03 21:30:00 Thunderstorm Wind 0
2 2003-07-04 08:35:00 Marine Thunderstorm Wind 0
3 2003-07-04 08:35:00 Marine Thunderstorm Wind 0
4 2003-08-11 16:33:00 Thunderstorm Wind 0
5 2003-08-11 18:00:00 Hail 0
# ... with 5.295e+04 more rows
We can make a histogram of the dates/times to get a sense of when storm events occur.
library(ggplot2)
storm_sub %>%
ggplot(aes(x = begin)) +
geom_histogram(bins = 20) +
theme_bw()
We can group by event type too.
library(ggplot2)
storm_sub %>%
ggplot(aes(x = begin)) +
facet_wrap(~ type) +
geom_histogram(bins = 20) +
theme_bw() +
theme(axis.text.x.bottom = element_text(angle = 90))
storm_sub %>%
ggplot(aes(begin, deaths)) +
geom_point()
If we focus on a single month, the x-axis adapts.
storm_sub %>%
filter(month(begin) == 6) %>%
ggplot(aes(begin, deaths)) +
geom_point()
Similarly, we can focus on a single day.
storm_sub %>%
filter(month(begin) == 6, day(begin) == 16) %>%
ggplot(aes(begin, deaths)) +
geom_point()
Dates and times have special classes in R that allow for numerical and statistical calculations
Dates use the Date
class
Times use the POSIXct
and POSIXlt
class
Character strings can be coerced to Date/Time classes using the ymd()
and ymd_hms()
functions. In strange cases, you can use the strptime()
or the as.Date()
functions.
The lubridate
package is essential for manipulating date/time data
Both plot
and ggplot
“know” about dates and times and will handle axis labels appropriately