forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprog-strings.Rmd
294 lines (212 loc) · 9.02 KB
/
prog-strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# Programming with strings
```{r, results = "asis", echo = FALSE}
status("drafting")
```
```{r}
library(stringr)
library(tidyr)
library(tibble)
```
### Encoding
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
And typically the problem is that the declaring encoding is wrong.
The tidyverse follows best practices[^prog-strings-1] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
It's still possible to have problems, but they'll typically arise during data import.
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
[^prog-strings-1]: <http://utf8everywhere.org>
### Length and subsetting
This seems like a straightforward computation if you're only familiar with English, but things get complex quick when working with other languages.
Four most common are Latin, Chinese, Arabic, and Devangari, which represent three different systems of writing systems:
- Latin uses an alphabet, where each consonant and vowel gets its own letter.
- Chinese.
Logograms.
Half width vs full width.
English letters are roughly twice as high as they are wide.
Chinese characters are roughly square.
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
> --- <http://utf8everywhere.org>
```{r}
# But
str_split("check", boundary("character", locale = "cs_CZ"))
```
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
```{r}
x <- c("á", "x́")
str_length(x)
# str_width(x)
str_sub(x, 1, 1)
# stri_width(c("全形", "ab"))
# 0, 1, or 2
# but this assumes no font substitution
```
```{r}
cyrillic_a <- "А"
latin_a <- "A"
cyrillic_a == latin_a
stringi::stri_escape_unicode(cyrillic_a)
stringi::stri_escape_unicode(latin_a)
```
### str_c
`NULL`s are silently dropped.
This is particularly useful in conjunction with `if`:
```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
```
### `str_dup()`
Closely related to `str_c()` is `str_dup()`.
`str_c(a, a, a)` is like `a a a`, what's the equivalent of `3 * a`?
That's `str_dup()`:
```{r}
str_dup(letters[1:3], 3)
str_dup("a", 1:3)
```
## Performance
`fixed()`: matches exactly the specified sequence of bytes.
It ignores all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping and can be much faster than regular expressions.
The following microbenchmark shows that it's about 3x faster for a simple example.
```{r}
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
```
As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions:
```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
```
### Extract
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
more <- sentences[str_count(sentences, colour_match) > 1]
str_extract_all(more, colour_match)
```
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
We don't talk about matrices here, but they are useful elsewhere.
### Exercises
1. From the Harvard sentences data, extract:
1. The first word from each sentence.
2. All words ending in `ing`.
3. All plurals.
## Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
You can also use parentheses to extract parts of a complex match.
For example, imagine we want to extract nouns from the sentences.
As a heuristic, we'll look for any word that comes after "a" or "the".
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ] )"
has_noun <- sentences |>
str_subset(noun) |>
head(10)
has_noun |>
str_extract(noun)
```
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
has_noun |>
str_match(noun)
```
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
## Spitting
Use `str_split()` to split a string up into pieces.
For example, we could split sentences into words:
```{r}
sentences |>
head(5) |>
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list.
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
```{r}
str_split("a|b|c|d", "\\|")[[1]]
```
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
```{r}
sentences |>
head(5) |>
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields |> str_split(": ", n = 2, simplify = TRUE)
```
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
Show how `separate_rows()` is a special case of `str_split()` `summarise()`.
## Replace with function
## Locations
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
These are particularly useful when none of the other functions does exactly what you want.
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## stringi
stringr is built on top of the **stringi** package.
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
stringi, on the other hand, is designed to be comprehensive.
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
The main difference is the prefix: `str_` vs. `stri_`.
### Exercises
1. Find the stringi functions that:
a. Count the number of words.
b. Find duplicated strings.
c. Generate random text.
2. How do you control the language that `stri_sort()` uses for sorting?
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) |>
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) |>
separate(x, c("one", "two", "three"))
```
2. Both `unite()` and `separate()` have a `remove` argument.
What does it do?
Why would you set it to `FALSE`?
3. Compare and contrast `separate()` and `extract()`.
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
events <- tribble(
~month, ~day,
1 , 20,
1 , 21,
1 , 22
)
events |>
unite("date", month:day, sep = "-", remove = FALSE)
```
5. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
Think carefully about what it should do if given a vector of length 0, 1, or 2.