forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
datetimes.Rmd
676 lines (490 loc) · 28.1 KB
/
datetimes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
863
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
# Dates and times
This chapter will show you how to work with dates and times in R. Dates and times follow their own rules, which can make working with them difficult. For example dates and times are ordered, like numbers; but the timeline is not as orderly as the numberline. The timeline repeats itself, and has noticeable gaps due to Daylight Savings Time, leap years, and leap seconds. Datetimes also rely on ambiguous units: How long is a month? How long is a year? Time zones give you another headache when you work with dates and times. The same instant of time will have different "names" in different time zones.
This chapter will focus on R's __lubridate__ package, which makes it much easier to work with dates and times in R. You'll learn the basic date time structures in R and the lubridate functions that make working with them easy. We will also rely on some of the packages that you already know how to use, so load this entire set of packages to begin:
```{r message = FALSE, warning = FALSE}
library(nycflights13)
library(dplyr)
library(stringr)
library(ggplot2)
library(lubridate)
```
## Parsing times
Time data normally comes as character strings, or numbers spread across columns, as in the `flights` dataset from [Relational data].
```{r}
flights %>%
select(year, month, day, hour, minute)
```
Getting R to agree that your dataset contains the dates and times that you think it does can be tricky. Lubridate simplifies that. To combine separate numbers into datetimes, use `make_datetime()`.
```{r}
datetimes <- flights %>%
mutate(departure = make_datetime(year = year, month = month, day = day,
hour = hour, min = minute))
```
With a little work, we can also create arrival times for each flight in flights. I'll then clean up the data a little.
```{r}
(datetimes <- datetimes %>%
mutate(arrival = make_datetime(year = year, month = month, day = day,
hour = str_sub(arr_time, end = -3),
min = str_sub(arr_time, start = -2))) %>%
filter(!is.na(departure), !is.na(arrival)) %>%
select(departure, arrival, dep_delay, arr_delay, carrier, tailnum,
flight, origin, dest, air_time, distance))
```
To parse character strings as dates, identify the order in which the year, month, and day appears in your dates. Now arrange "y", "m", and "d" in the same order. This is the name of the function in lubridate that will parse your dates. For example,
```{r}
ymd("20170131")
mdy("January 31st, 2017")
dmy("31-1-2017")
```
If your date contains hours, minutes, or seconds, add an underscore and then one or more of "h", "m", and "s" to the name of the parsing function.
```{r}
ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 08:01")
```
Lubridate's parsing functions handle a wide variety of formats and separators, which simplifies the parsing process.
For both `make_difftime()` and the y,m,d,h,m,s parsing functions, you can set the time zone of a date when you create it with a tz argument. As a general rule, I recommend that you do not use time zones unless you have to. I'll cover time zones and the idiosyncracies that come with them later in the chapter. If you do not set a time zone, lubridate will supply the Coordinated Universal Time zone, a very easy time zone to work in.
```{r}
ymd_hms("2017-01-31 20:11:59", tz = "America/New_York")
```
#### The structure of dates and times
What have we accomplished by parsing our datetimes? R now recognizes that our departure and arrival variables contain datetime information, and it saves the variables in the POSIXct format, a common way of representing dates and times.
```{r}
class(datetimes$departure[1])
```
In POSIXct form, each datetime is saved as the number of seconds that passed between the datetime and midnight January 1st, 1970 in the Coordinated Universal Time zone. Under this system, the very first moment of January 1st, 1970 gets the number zero. Earlier moments get a negative number.
```{r}
unclass(datetimes$departure[1])
unclass(ymd_hms("1970-01-01 00:00:00"))
```
The POSIXct format has many advantages. You can display the same date time in any time zone by changing its tzone attribute (more on that later), and R can recognize when two times displayed in two different time zones refer to the same moment.
```{r warning = FALSE}
(zero_hour <- ymd_hms("1970-01-01 00:00:00"))
attr(zero_hour, "tzone") <- "America/Chicago"
zero_hour
ymd_hms("1970-01-01 00:00:00") == ymd_hms("1970-01-01 00:00:00", tz = "America/Denver")
```
Best of all, you can change a datetime by adding or subtracting seconds from it.
```{r}
ymd_hms("1970-01-01 00:00:00") 1
```
This gives us a way to calculate the scheduled departure and arrival times of each flight in flights.
```{r}
datetimes %>%
mutate(scheduled_departure = departure - dep_delay * 60,
scheduled_arrival = arrival - arr_delay * 60) %>%
select(scheduled_departure, dep_delay, departure,
scheduled_arrival, arr_delay, arrival)
```
If you work only with dates, and not times, you can also use R's Date class. R saves Dates as the number of days since January 1st, 1970. The easiest way to create a Date is to parse with lubridate's y, m, d functions. These will return a Date class object whenever you do not supply an hour, minutes, or seconds component.
```{r}
(zero_day <- mdy("January 1st, 1970"))
class(zero_day)
zero_day - 1
```
R can also save datetimes in the POSIXlt form, a list based date structure. Working with POSIXlt dates can be much slower than working with POSIXct dates, and I don't recommend it. Lubridate's parse functions will always return a POSIXct date when you supply an hour, minutes, or seconds component.
## Arithmetic with dates
Did you see how I calculated the scheduled departure and arrival times for our flights? I added the appropriate number of seconds to the actual departure and arrival times. You can take this approach even farther by adding hours, days, weeks, and more.
```{r eval = FALSE}
datetimes %>%
transmute(second_lag = departure 1,
minute_lag = departure 1 * 60,
hour_lag = departure 1 * 60 * 60,
day_lag = departure 1 * 60 * 60 * 24,
week_lag = departure 1 * 60 * 60 * 24 * 7)
```
However, the conversion to seconds becomes tedious and introduces a chance for error. To simplify the process, use difftimes or durations. Each represents a span of time in R.
### Difftimes
A difftime class object records a span of time in one of seconds, minutes, hours, days, or weeks. R creates a difftime whenever you subtract two dates or two datetimes.
```{r}
(day1 <- ymd("2000-01-01") - ymd("1999-12-31"))
```
You can also create a difftime with `as.difftime()`. Pass it the length of the difftime as well as the units to use.
```{r}
(day2 <- as.difftime(24, units = "hours"))
```
Difftimes come with base R, but they have some rough edges. For example, the value of a difftime depends on the difftime's units attribute. If this attribute is dropped, as it is when you combine difftimes with `c()`, the value becomes uninterpretable. Consider what happens when I combine these two difftimes that have the same length.
```{r}
c(day1, day2)
```
You can avoid these rough edges by using lubridate's version of difftimes, known as durations.
### Durations
Durations behave like difftimes, but are a little more user friendly. To make a duration, choose a units of time, make it plural, and then place a "d" in front of it. This is the name of the funtion in lubridate that will make your duration, i.e.
```{r}
dseconds(1)
dminutes(1)
dhours(1)
ddays(1)
dweeks(1)
dyears(1)
```
To make a duration that lasts multiple units, pass the number of units as the argument of the duration function. So for example, you can make a duration that lasts three minutes with
```{r}
dminutes(3)
```
This syntax provides a very clean way to do arithmetic with datetimes. For example, we can recreate our scheduled departure and arrival times with
```{r}
(datetimes <- datetimes %>%
mutate(scheduled_departure = departure - dminutes(dep_delay),
scheduled_arrival = arrival - dminutes(arr_delay)) %>%
select(scheduled_departure, dep_delay, departure,
scheduled_arrival, arr_delay, arrival,
carrier, tailnum, flight, origin, dest, air_time, distance))
```
Durations always contain a time span measured in seconds. Larger units are estimated by converting minutes, hours, days, weeks, and years to seconds at the standard rate. This makes durations very precise, but it can lead to unexpected results when the timeline progresses at a non-standard rate.
For example, Daylight Savings Time can result in this sort of surprise.
```{r}
ymd_hms("2016-03-13 00:00:00", tz = "America/New_York") ddays(1)
```
Luckily, the UTC time zone does not use Daylight Savings Time, so if you keep your datetimes in UTC you can avoid this type of complexity. But what if you do need to work with Daylight Savings Time (or leap years or months, two other places where the time line can misbehave [^1])?
[^1]: Technically, the timeline also misbehaves during __leap seconds__, extra seconds that are added to the timeline to account for changes in the Earth's movement. In practice, most operating systems ignore leap seconds, and R follows the behavior of the operating system. If you are curious about when leap seconds occur, R lists them under `.leap.seconds`.
### Periods
You can use lubridate's period class to handle irregularities in the timeline. Periods are time spans that are generalized to work with clock times, the "name" of a datetime that you would see on a clock, like "2016-03-13 00:00:00." Periods have no fixed length, which lets them work in an intuitive, human friendly way. When you add a one day period to "2000-03-13 00:00:00" the result will be "2000-03-14 00:00:00" whether there were 86400 seconds in March 13, 2000 or 82800 seconds (due to Daylight Savings Time).
To make a period object, call the name of the unit you wish to use, make it plural, and pass it the number of units to use as an argument.
```{r}
seconds(1)
minutes(1)
hours(1)
days(1)
weeks(1)
months(1)
years(1)
```
You can add periods together to make larger periods.
```{r}
days(50) hours(25) minutes(2)
```
To see how periods work, compare the performance of durations and periods during Daylight Savings Time and a leap year.
```{r}
# Daylight Savings Time
ymd_hms("2016-03-13 00:00:00", tz = "America/New_York") days(1)
ymd_hms("2016-03-13 00:00:00", tz = "America/New_York") ddays(1)
# A leap year
ymd_hms("2016-01-01 00:00:00") years(1)
ymd_hms("2016-01-01 00:00:00") dyears(1)
```
The period always returns the "expected" clock time, as if the irregularity had not happened. The duration always returns the time that is exactly 86,400 seconds (in the case of a day) or 31,536,000 seconds later (in the case of a year).
When the timeline behaves normally, the results of a period and a duration will agree.
```{r}
# Not Daylight Savings Time
ymd_hms("2016-03-14 00:00:00") days(1)
ymd_hms("2016-03-14 00:00:00") ddays(1)
```
When should you use a period and when should you use a duration?
* Use durations whenever you need to calculate physical properties or compare exact timespans, such as the life of two different batteries.
* Use periods whenever you need to model human events, such as the opening of the stock market, or the close of the business day.
Periods also let you model datetimes that reoccur on a monthly basis in a way that would be impossible with durations. Consider that some of the months below are 31 days, some have 30, and one has 29.
```{r}
mdy("January 1st, 2016") months(0:11)
```
Let's use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination _before_ they departed from New York City.
```{r}
datetimes %>%
filter(arrival < departure)
```
These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding `days(1)` to the arrival time of each overnight flight. Then we will recalculate each scheduled arrival time.
```{r}
overnight <- datetimes$arrival < datetimes$departure
datetimes$arrival[overnight] <- datetimes$arrival[overnight] days(1)
(datetimes <- datetimes %>%
mutate(scheduled_arrival = arrival - dminutes(arr_delay)))
```
Now all of our flights obey the laws of physics.
```{r}
datetimes %>%
filter(arrival < departure)
```
### Rolling back and rounding dates
The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation, `January 31st one month`. Should the answer be
1. `February 31st` (which doesn't exist)
2. `March 4th` (31 days after January 31), or
3. `February 28th` (assuming its not a leap year)
A basic property of arithmetic is that `a b - b = a`. Only solution 1 obeys this property, but it is an invalid date. Lubridate tries to make arithmetic as consistent as possible by invoking the following rule *if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA*.
If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special `%m %` and `%m-%` operators. `%m %` and `%m-%` automatically roll dates back to the last day of the month, should that be necessary.
```{r}
ymd("2016-01-31") months(0:11)
ymd("2016-01-31") %m % months(0:11)
```
Notice that this will only affect arithmetic with months (and arithmetic with years if your start date is Feb 29).
You can use lubridate's functions `floor_date()`, `round_date()`, and `ceiling_date()` to round (or move) a date to a nearby unit of time. Each function takes a vector of dates to adjust and then the name of the time unit to floor, ceiling, or round them to.
```{r}
floor_date(ymd_hms("2016-01-01 12:34:56"), unit = "hour")
ceiling_date(ymd_hms("2016-01-01 12:34:56"), unit = "hour")
round_date(ymd_hms("2016-01-01 12:34:56"), unit = "day")
```
`floor_date()` would help you calculate the days that occur exactly 31 days after the start of each month (Solution 2 above).
```{r}
floor_date(ymd("2016-01-31"), unit = "month") months(0:11) days(31)
```
## Extracting and setting date components
Now that we have the scheduled arrival and departure times for each flight in flights, let's examine when flights are scheduled to depart. We could plot a histogram of flights throughout the year, but that's not very informative.
```{r}
datetimes %>%
ggplot(aes(scheduled_departure))
geom_histogram(binwidth = 86400) # 86400 seconds = 1 day
```
Let's instead group flights by day of the week, to see which week days are the busiest, and by hour to see which times of the day are busiest. To do this we will need to extract the day of the week and hour that each flight was scheduled to depart.
You can extract the year, month, day of the year (yday), day of the month (mday), day of the week (wday), hour, minute, second, and time zone (tz) of any date or datetime with lubridate's accessor functions. Use the function that has the name of the unit you wish to extract. Accessor function names are singular, period function names are plural.
```{r}
(datetime <- ymd_hms("2007-08-09 12:34:56", tz = "America/Los_Angeles"))
year(datetime)
month(datetime)
yday(datetime)
mday(datetime)
wday(datetime)
hour(datetime)
minute(datetime)
second(datetime)
tz(datetime)
```
For both `month()` and `wday()` you can set `label = TRUE` to return the name of the month or day of the week. Set `abbr = TRUE` to return an abbreviated version of the name, which can be helpful in plots.
```{r}
month(datetime, label = TRUE)
wday(datetime, label = TRUE, abbr = TRUE)
```
We can use the `wday()` accessor to see that more flights depart on weekdays than weekend days.
```{r}
datetimes %>%
transmute(weekday = wday(scheduled_departure, label = TRUE)) %>%
filter(!is.na(weekday)) %>%
ggplot(aes(x = weekday))
geom_bar()
```
The `hour()` accessor reveals that scheduled departures follow a bimodal distribution throughout the day. There is a morning and evening peak in departures.
```{r}
datetimes %>%
transmute(hour = hour(scheduled_departure)) %>%
filter(!is.na(hour)) %>%
ggplot(aes(x = hour))
geom_bar()
```
When should you depart if you want to minimize your chance of delay? The results are striking. On average, flights that left on a Saturday arrived ahead of schedule.
```{r}
datetimes %>%
mutate(weekday = wday(scheduled_departure, label = TRUE)) %>%
filter(!is.na(weekday)) %>%
group_by(weekday) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = weekday, y = avg_delay))
geom_bar(stat = "identity")
```
On average, flights that departed before 10:00 arrived early. Average arrival delays increased throughout the day.
```{r}
datetimes %>%
mutate(hour = hour(scheduled_departure)) %>%
filter(!is.na(hour)) %>%
group_by(hour) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = avg_delay))
geom_bar(stat = "identity")
```
You can also use the `yday()` accessor to see that average delays fluctuate throughout the year.
```{r fig.height=3, warning = FALSE}
datetimes %>%
mutate(yearday = yday(scheduled_departure)) %>%
filter(!is.na(yearday), year(scheduled_departure) == 2013) %>%
group_by(yearday) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = yearday, y = avg_delay))
geom_bar(stat = "identity")
```
### Setting dates
You can also use each accessor funtion to set the components of a date or datetime.
```{r}
datetime
year(datetime) <- 2001
datetime
month(datetime) <- 01
datetime
yday(datetime) <- 01
datetime
mday(datetime) <- 02
datetime
wday(datetime) <- 02
datetime
hour(datetime) <- 01
datetime
minute(datetime) <- 01
datetime
second(datetime) <- 01
datetime
tz(datetime) <- "UTC"
datetime
```
You can set more than one component at once with `update()`.
```{r}
update(datetime, year = 2002, month = 2, mday = 2, hour = 2,
minute = 2, second = 2, tz = "America/Anchorage")
```
## Time zones
R records the time zone of each datetime as an attribute of the datetime object. This makes time zones tricky to work with. For example, a vector of datetimes can only contain one time zone attribute, so every datetime in the vector must share the same time zone.
```{r}
(firsts <- ymd_hms("2000-01-01 12:00:00") months(0:11))
unclass(firsts)
attr(firsts, "tzone") <- "Pacific/Honolulu"
unclass(firsts)
firsts
```
Operations that drop attributes, such as `c()` will drop the time zone attribute from your datetimes. In that case, the datetimes will display in your local time zone (mine is "America/New_York", i.e. Eastern Time).
```{r}
(jan_day <- ymd_hms("2000-01-01 12:00:00"))
(july_day <- ymd_hms("2000-07-01 12:00:00"))
c(jan_day, july_day)
unclass(c(jan_day, july_day))
```
Moreover, R relies on your operating system to interpret time zones. As a result, R will be able to recognize some time names on some computers but not on others. Throughout this chapter we use time zone names in the Olson Time Zone Database, as these time zones are recognized by most operating systems. You can find a list of Olson time zone names at <http://en.wikipedia.org/wiki/List_of_tz_database_time_zones>.
You can set the time zone of a date with the tz argument when you parse the date.
```{r}
ymd_hms("2016-01-01 00:00:01", tz = "Pacific/Auckland")
```
If you do not set the time zone, lubridate will automatically assign the datetime to Coordinated Universal Time (UTC). Coordinated Universal Time is the standard time zone used by the scientific community and roughly equates to its predecessor, Greenwich Meridian Time. Since Coordinated Universal time does not follow Daylight Savings Time, it is straightforward to work with times saved in this time zone.
You can change the time zone of a date time in two ways. First, you can display the same instant of time in a different time zone with lubridate's `with_tz()` function.
```{r}
jan_day
with_tz(jan_day, tz = "Australia/Sydney")
```
`with_tz()` changes the time zone attribute of an instant, which changes the clock time displayed for the instant. But `with_tz()` _does not_ change the underlying instant of time represented by the clock time. You can verify this by checking the POSIXct form of the instant. The updated time occurs the same number of seconds after January 1st, 1970 as the original time.
```{r warning = FALSE}
unclass(jan_day)
unclass(with_tz(jan_day, tz = "Australia/Sydney"))
jan_day == with_tz(jan_day, tz = "Australia/Sydney")
```
Contrast this with the second way to change a time zone. You can display the same clock time with a new time zone with lubridate's `force_tz()` function.
```{r}
jan_day
force_tz(jan_day, tz = "Australia/Sydney")
```
Unlike `with_tz()`, `force_tz()` creates a new instant of time. Twelve o'clock in Greenwich, UK is not the same time as twelve o'clock in Sydney, AU. you can verify this by looking at the POSIXct structure of the new date. It occurs at a different number of seconds after January 1st, 1970 than the original date.
```{r warning = FALSE}
unclass(jan_day)
unclass(force_tz(jan_day, tz = "Australia/Sydney"))
jan_day == force_tz(jan_day, tz = "Australia/Sydney")
```
When should you use `with_tz()` and when should you use `force_tz()`? Use `with_tz()` when you wish to discover what the current time is in a different time zone. Use `force_tz()` when you want to make a new time in a new time zone.
### Daylight Savings Time
In computing, time zones do double duty. They record where on the planet a time occurs as well as whether or not that location follows Daylight Savings Time. Different areas within the same "time zone" make different decisions about whether or not ot follow Daylight Savings Time. As a result, places like Phoenix, AZ and Denver, CO have the same times for part of the year, but different times for the rest of the year.
```{r}
with_tz(c(jan_day, july_day), tz = "America/Denver")
with_tz(c(jan_day, july_day), tz = "America/Phoenix")
```
This is because Denver follows Daylight Savings Time, but Phoenix does not. R encodes this by giving each location its own time zone that follows its own rules.
You can check whether or not a time has been adjusted locally for Daylight Savings Time with lubridate's `dst()` function.
```{r}
dst(with_tz(c(jan_day, july_day), tz = "America/Denver"))
dst(with_tz(c(jan_day, july_day), tz = "America/Phoenix"))
```
R will display times that are adjusted for Daylight Savings Time with a "D" in the time zone. Hence, MDT stands for Mountain Daylight Savings Time. MST stands for Mountain Standard Time. Notice that R displays an abbreviation for each time zone that does not directly map to the full name of the time zone. Many time zones share the same abbreviations. For example, America/Phoeniz and America/Denver both appear as MST.
```{r include = FALSE}
# TIME ZONES and DAYLIGHT SAVINGS
# How long was each flight scheduled to be?
# First convert scheduled times to NYC time zone
datetimes2 <- airports %>%
select(faa, name, tz, dst) %>%
right_join(datetimes, by = c("faa" = "dest")) %>%
mutate(NYC_scheduled_arrival = scheduled_arrival - hours(5 tz),
NYC_arrival = arrival - hours(5 tz))
datetimes2 <- datetimes2 %>%
mutate(scheduled_departure = force_tz(scheduled_departure, tz = "America/New_York"),
departure = force_tz(departure, tz = "America/New_York"),
NYC_scheduled_arrival = force_tz(NYC_scheduled_arrival, tz = "America/New_York"),
NYC_arrival = force_tz(NYC_arrival, tz = "America/New_York"))
# Then adjust for places that do not use DST
datetimes2 %>%
filter(dst != "A") %>%
select(faa, name, dst) %>%
unique()
adjust_for_dst <- datetimes2$faa %in% c("PHX", "HNL") &
dst(datetimes2$NYC_scheduled_arrival) &
!is.na(dst(datetimes2$NYC_scheduled_arrival))
datetimes2$NYC_scheduled_arrival[adjust_for_dst] <- datetimes2$NYC_scheduled_arrival[adjust_for_dst] hours(1)
datetimes2$NYC_arrival[adjust_for_dst] <- datetimes2$NYC_arrival[adjust_for_dst] hours(1)
datetimes2 %>%
select(scheduled_arrival, NYC_scheduled_arrival, tz)
# Let's check that we did some correctly
datetimes2 %>%
filter(faa == "HNL") %>%
transmute(HNL_scheduled_arrival = with_tz(NYC_scheduled_arrival, tz = "Pacific/Honolulu"),
scheduled_arrival = force_tz(scheduled_arrival, tz = "Pacific/Honolulu")) %>%
filter(HNL_scheduled_arrival != scheduled_arrival)
datetimes2 %>%
filter(faa == "PHX") %>%
transmute(PHX_scheduled_arrival = with_tz(NYC_scheduled_arrival, tz = "America/Phoenix"),
scheduled_arrival = force_tz(scheduled_arrival, tz = "America/Phoenix")) %>%
filter(PHX_scheduled_arrival != scheduled_arrival)
# Do some carriers schedule different times relative to distance?
datetimes2 %>%
select(-name) %>%
left_join(airlines, by = "carrier") %>%
transmute(estimate = as.numeric(NYC_scheduled_arrival - scheduled_departure),
distance = distance,
name = name) %>%
lm(estimate ~ distance name, data = .) %>%
broom::tidy() %>%
arrange(estimate)
```
## Intervals of time
An interval of time is a specific period of time, such as midnight April 13, 2013 to midnight April 23, 2013. You can make an interval of time with lubridate's `interval()` function. Pass it the start and end datetimes of the interval. Use the tzone argument to select a time zone to display the interval in (if you wish to display the interval in a different time zone than that of the start date).
```{r}
apr13 <- mdy("4/13/2013", tz = "America/New_york")
apr23 <- mdy("4/23/2013", tz = "America/New_york")
interval(apr13, apr23)
```
You can also make an interval with the `%--%` operator.
```{r}
(spring_break <- apr13 %--% apr23)
```
These dates align exactly with New York City Public school's 2013 Spring Recess. Do you think flight delays increased during this interval? Let's check.
You can test whether or not a date falls within an interval with lubridate's `%within% operator, e.g.
```{r}
mdy(c("4/20/2013", "5/1/2013")) %within% spring_break
```
Using this operator, we see that 7853 flights departed during spring break.
```{r}
# What flights occurred during spring break?
datetimes %>%
filter(scheduled_departure %within% spring_break)
```
A further query shows that flights during spring break arrived 6.65 minutes later on average than flights during the rest of the year.
```{r}
datetimes %>%
mutate(sbreak = scheduled_departure %within% spring_break) %>%
group_by(sbreak) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = sbreak, y = avg_delay)) geom_bar(stat = "identity")
```
Lubridate lets you do quite a bit with intervals. You can access the start or end dates of an interval with `int_start()` and `int_shift()`.
```{r}
int_start(spring_break)
int_end(spring_break)
```
You can chnge the direction of an interval with `int_flip()`. Use `int_shift()` to shift an interval forwards or backwards along the timeline. Give `int_shift()` a period or duration object to shift the interval by.
```{r}
int_flip(spring_break)
int_shift(spring_break, days(1))
int_shift(spring_break, months(-1))
```
You can use `int_overlaps()` to test whether an interval overlaps with another interval. So for example, we can represent each week in April 2013 with its own interval and then see which weeks overlap with spring break.
```{r}
(april_sundays <- mdy("3/31/2013", tz = "America/New_york") weeks(0:4))
(april_saturdays <- mdy("4/6/2013", tz = "America/New_york") weeks(0:4))
(april_weeks <- april_sundays %--% april_saturdays) # a vector of intervals
int_overlaps(april_weeks, spring_break)
```
You can perform set operations on intervals with `intersect()`, `union()` and `setdiff()` to create new intervals.
Finally, you can get a sense of how long an interval is in several ways.
1. Turn the interval into a period
```{r}
as.period(spring_break)
```
2. Divide the interval by a duration
```{r}
spring_break / dweeks(1)
```
3. Integer divide the interval by a period. Then modulo the interval by a period for the remainder.
```{r}
spring_break %/% weeks(1)
spring_break %% weeks(1)
```
4. Retrieve the interval length in seconds with `int_length()`
```{r}
int_length(spring_break)
```