forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tibble.qmd
266 lines (194 loc) · 7.57 KB
/
tibble.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# Tibbles {#sec-tibbles}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
Tibbles *are* data frames, but they tweak some older behaviors to make your life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them `data.frame`s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.
### Prerequisites
In this chapter we'll explore the **tibble** package, part of the core tidyverse.
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Creating tibbles
If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
`tibble()` works by assembling individual vectors:
```{r}
x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
```
You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
```{r}
tibble(
x1 = x,
x2 = c(10, 15, 25),
y = sqrt(x1^2 x2^2)
)
```
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
```{r}
#| error: true
tibble(
x = c(1, 5),
y = c("a", "b", "c")
)
```
As the error suggests, individual values will be recycled to the same length as everything else:
```{r}
tibble(
x = 1:5,
y = "a",
z = TRUE
)
```
Another way to create a tibble is with `tribble()`, which short for **tr**ansposed tibble.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
```{r}
as_tibble(mtcars)
```
The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
## Non-syntactic names
It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
For example, the variables might not start with a letter or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
```
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
## Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with `as.data.frame()`.
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature inspired by `str()`:
```{r}
tibble(
a = lubridate::now() runif(1e3) * 86400,
b = lubridate::today() runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
```
Where possible, tibbles also use color to draw your eye to important differences.
One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
```{r}
tibble(x = c("NA", NA))
```
Tibbles are designed to avoid overwhelming your console when you print large data frames.
But sometimes you need more output than the default display.
There are a few options that can help.
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display.
`width = Inf` will display all columns:
```{r}
library(nycflights13)
flights |>
print(n = 10, width = Inf)
```
You can also control the default print behavior by setting options:
- `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` rows, print only `m` rows.
Use `options(tibble.print_min = Inf)` to always show all rows.
- Use `options(tibble.width = Inf)` to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with `package?tibble`.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
```{r}
#| eval: false
flights |> View()
```
### Extracting variables
So far all the tools you've learned have worked with complete data frames.
If you want to pull out a single variable, you can use `dplyr::pull()`:
```{r}
tb <- tibble(
id = LETTERS[1:5],
x1 = 1:5,
y1 = 6:10
)
tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in [Chapter -@sec-vectors].
```{r}
tb |> pull(x1, name = id)
```
You can also use the base R tools `$` and `[[`.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
# Extract by name
tb$x1
tb[["x1"]]
# Extract by position
tb[[1]]
```
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
```{r}
# Tibbles complain a lot:
tb$x
tb$z
# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
df$x
df$z
```
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
### Subsetting
Lastly, there are some important differences when using `[`.
With `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector, which is a common source of bugs.
With tibbles, `[` always returns another tibble.
This can sometimes cause problems when working with older code.
If you hit one of those functions, just use `as.data.frame()` to turn your tibble back to a `data.frame`.
### Exercises
1. How can you tell if an object is a tibble?
(Hint: try printing `mtcars`, which is a regular `data.frame`).
2. Compare and contrast the following operations on a `data.frame` and equivalent tibble.
What is different?
Why might the default `data.frame` behaviors cause you frustration?
```{r}
#| eval: false
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
```
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
4. Practice referring to non-syntactic names in the following data frame by:
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 rnorm(length(`1`))
)
```
5. What does `tibble::enframe()` do?
When might you use it?
6. What option controls how many additional column names are printed at the footer of a tibble?