diff --git a/data-structures.Rmd b/data-structures.Rmd index 7b8c550ce..cfd4fc9fe 100644 --- a/data-structures.Rmd +++ b/data-structures.Rmd @@ -7,53 +7,82 @@ library(dplyr) As you start to write more functions, and as you want your functions to work with more types of inputs, it's useful to have some grounding in the underlying data structures that R is built on. This chapter will dive deeper into the objects that you've already used, helping you better understand how things work. -The most important class of objects in R is the __vector__. Every vector has two key properties: +The most important family of objects in R are __vectors__. Vectors are broken down into __atomic__ vectors, and __lists__. There are six types of atomic vector, but only four are in common use: logical, integer, double, and character. The chief difference between atomic vectors and lists is that atomic atomic vectors are homogeneous (every element is the same type) and lists are heterogeneous (each element can be a different type). -1. Its type, whether it's logical, numeric, character, and so on. You - can determine the type of any R object with `typeof()`. +```{r, echo = FALSE} +knitr::include_graphics("diagrams/data-structures-overview.png") +``` + +The two key properties of a vector are its type, which you can determine with `typeof()`, and its length, `length()`. + +```{r} +typeof(letters) +typeof(1:10) +x <- list("a", "b", 1:10) +length(x) +``` -2. Its length, which you can retrieve with `length()`. +There are four common data types build on top of these foundations: -Vectors are broken down into __atomic__ vectors, and __lists__. I call factors, dates, and date times __augmented vectors__ because they're built on top of atomic vectors. Data frames are also augmented vectors as they built on top of lists. +* Factors and dates are built on top of integers. +* Date times (POSIXct) are built on of doubles. +* Data frames and tibbles are built on top of lists. -Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That's why, for example, you can write `1:10 + 10:1`. +I these __augmented vectors__ because each is a vector augmented with some special behaviour through R's S3 objected oriented system. ## Atomic vectors -There are four important types of atomic vector: +There are four important types of atomic vector: logical, integer, double, and character. Collectively, integer and double vectors are known as __numeric vectors__, and most of the time the distinction is not important, so we'll discuss them together. There are two rarer types of atomic vectors: raw and complex. They're beyond the scope of this book because they are rarely needed to do data analysis. The following sections describe each type in turn. + +Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works: -* logical -* integer -* double -* character +```{r} +1:10 + 2:11 +``` +In R, basic mathematical operations work with vectors, not scalars like in most programming languages. -Collectively, integer and double vectors are known as numeric vectors. Most of the time the distinction between integers and doubles is not important in R, so we'll discuss them together. +There are four types of missing value, one for each type of atomic vector: -(There are also two rarer atomic vectors: raw and complex. They're beyond the scope of this book because they are rarely needed to do data analysis) +```{r} +NA # logical +NA_integer_ # integer +NA_real_ # double +NA_character_ # character +``` + +It is not usually necessary to know about these different types because in most cases `NA` is automatically converted to the type that you need. However, there are some functions that are strict about their inputs, and you'll need to give them an missing value of the correct type. ### Logical -Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons]. +Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons]. You can also create them by hand using `c()`: + +```{r} +c(TRUE, TRUE, FALSE, NA) +``` + +You can convert another type of atomic vector to logiacl using `as.logical()`. However, before doing so, you should carefully consider whether you can make the fix upstream, so that the vector never had the wrong type in the first place. -In numeric contexts, `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues. +One of the most useful properties of logical vectors is how they behave in numeric contexts: `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues. ```{r} x <- sample(20, 100, replace = TRUE) y <- x > 10 -sum(y) -mean(y) +sum(y) # how many are greater than 10? +mean(y) # what proportion are greater than 10? ``` ### Numeric -Numeric vectors encompasses both integers and doubles (real numbers). For large data, there is some small advantage to using the integer data type if you really have integers, but in most cases the differences are immaterial. In R, numbers are doubles by default. To make an integer, use a `L` after the number: +Numeric vectors include both integers vectors and doubles vectors (real numbers). In R, numbers are doubles by default. To make an integer, use a `L` after the number: ```{r} typeof(1) typeof(1L) ``` -There are two cases where you need to be aware of the differences between doubles and integers. Firstly, never test for equality on a double. There can be very small differences that don't print out by default. These differences arise because a double is represented using a fixed number of (binary) digits. For example, what should you get if you square the square-root of two? +There are two important differences between integers and doubles: doubles are approximations, and they have three extra special values. + +Never test for equality on a double. There can be very small differences that don't print out by default. These differences arise because a double is represented using a fixed number of (binary) digits. For example, what should you get if you square the square-root of two? ```{r} x <- sqrt(2) ^ 2 @@ -67,19 +96,19 @@ x == 2 x - 2 ``` -The number we've computed is actually slightly different to 2. To avoid this sort of comparison difficulty, you can use the `near()` function from dplyr (available in 0.5). +The number we've computed is actually slightly different to 2 because computers only store a finite number of numbers after the decimal point. This means that most calculations include some approximation error: never compare a double to a fixed value using `==`. Instead, use the `near()` function from dplyr (available in 0.5) which includes some numerical tolerance. ```{r, eval = packageVersion("dplyr") >= "0.4.3.9000"} dplyr::near(x, 2) ``` -The other important thing to know about doubles is that they have three special values in addition to `NA`: +Doubles also have three special values in addition to `NA`: ```{r} c(-1, 0, 1) / 0 ``` -Like with missing values, you should avoid using `==` to check for these other special values. Instead use `is.finite()`, `is.infinite()`, and `is.nan()`: +Avoid using `==` to check for these other special values. Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`: | | 0 | Inf | NA | NaN | |------------------|-----|-----|-----|-----| @@ -92,16 +121,11 @@ Note that `is.finite(x)` is not the same as `!is.infinite(x)`. ### Character -Each element of a character vector is a string. +Character vectors are the most complex of atomic vectors, because each element of a character vector is a string, and a string can contain an arbitrary amount of data. Strings are such an important data type, they have their own chapter: [strings]. -```{r} -x <- c("abc", "def", "ghijklmnopqrs") -typeof(x) -``` - -You learned how to manipulate these vectors in [strings]. +Here I wanted to mention one important feature of the underlying string implementation: it uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings. -R uses a global string pool. This reduces the amount of memory strings take up because +You can see this behaviour in practice by using `pryr::object_size()`: ```{r} x <- "This is a reasonably long string." @@ -113,117 +137,10 @@ pryr::object_size(y) `y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string. A pointer is 8 bytes, so 1000 pointers to a 136 B string is about 8.13 kB. -### Missing values - -There are four types of missing value, one for each type of atomic vector: - -```{r} -NA # logical -NA_integer_ # integer -NA_real_ # double -NA_character_ # character -``` - -It is not usually necessary to know about these different types because in most cases `NA` is automatically converted to the type that you need. However, there are some functions that are strict about their inputs, and you'll need to give them an missing value of the correct type. - -## Subsetting - - - -## Augmented vectors - -There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`. - -```{r} -x <- 1:10 -attr(x, "greeting") -attr(x, "greeting") <- "Hi!" -attr(x, "farewell") <- "Bye!" -attributes(x) -``` - -There are three very important attributes that are used to implement fundamental parts of R: - -* "names" are used to name the elements of a vector. -* "dims" make a vector behave like a matrix or array. -* "class" is used to implemenet the S3 object oriented system. - -Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like: - -```{r} -as.Date -``` - -The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`: - -```{r} -methods("as.Date") -``` - -And you can see the specific implementation of a method with `getS3method()`: - -```{r} -getS3method("as.Date", "default") -getS3method("as.Date", "numeric") -``` - -The most important S3 generic is `print()`: it controls how the object is printed when you type its name on the console. Other important generics are the subsetting functions `[`, `[[`, and `$`. - -A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at . - -### Factors - -Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute: - -```{r} -x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef")) -typeof(x) -attributes(x) -``` - -Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow" - -The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor. - -```{r} -x <- factor(letters[1:5]) -is.factor(x) -as.factor(letters[1:5]) -``` - -### Dates and date times - -Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970. - -```{r} -x <- as.Date("1971-01-01") -unclass(x) - -typeof(x) -attributes(x) -``` - -Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970: - -```{r} -x <- lubridate::ymd_hm("1970-01-01 01:00") -unclass(x) - -typeof(x) -attributes(x) -``` - -The `tzone` is optional, and only controls the way the date is printed not what it means. - -There is another type of datetimes called POSIXlt. These are built on top of named lists. +### Exercises -```{r} -y <- as.POSIXlt(x) -typeof(y) -attributes(y) -``` +1. Read the source code for `dplyr::near()`. How does it work? -If you use the packages outlined in this book, you should never encounter a POSIXlt. They do crop up in base R, because they are used extract specific components of a date (like the year or month). However, lubridate provides helpers for you to do this instead. Otherwise POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a POSIXct with `as.POSIXct()`. ## Recursive vectors (lists) @@ -357,71 +274,123 @@ knitr::include_graphics("images/pepper-3.jpg") 1. What happens if you subset a data frame as if you're subsetting a list? What are the key differences between a list and a data frame? -## Data frames -Data frames are augmented lists: they have class "data.frame", and `names` (column) and `row.names` attributes: +## Augmented vectors + +There are four important types of vector that are built on top of atomic vectors: factors, dates, date times, and data frames. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`. ```{r} -df1 <- data.frame(x = 1:5, y = 5:1) -typeof(df1) -attributes(df1) +x <- 1:10 +attr(x, "greeting") +attr(x, "greeting") <- "Hi!" +attr(x, "farewell") <- "Bye!" +attributes(x) ``` -The difference between a data frame and a list is that all the elements of a data frame must be the same length. All functions that work with data frames enforce this constraint. +There are three very important attributes that are used to implement fundamental parts of R: -Generally, I recommend using `dplyr::data_frame()` instead of `data.frame`. It creates an object that "extends" the data frame. That means it has all the existing behaviour of a data frame: +* "names" are used to name the elements of a vector. +* "dims" make a vector behave like a matrix or array. +* "class" is used to implemenet the S3 object oriented system. + +Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like: ```{r} -df2 <- dplyr::data_frame(x = 1:5, y = 5:1) -typeof(df2) -attributes(df2) +as.Date ``` -The additional `tbl_df` class makes the print method more informative (and only prints the first 10 rows, not the first 10,000), and makes the subsetting methods more strict: +The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`: + +```{r} +methods("as.Date") +``` -```{r, error = TRUE} -df1 -df2 +And you can see the specific implementation of a method with `getS3method()`: -df1$z -df2$z +```{r} +getS3method("as.Date", "default") +getS3method("as.Date", "numeric") ``` -There are a few other ways in `data_frame()` behaves differently to `data.frame()` +The most important S3 generic is `print()`: it controls how the object is printed when you type its name on the console. Other important generics are the subsetting functions `[`, `[[`, and `$`. - * `data.frame()` does a number of transformations to its inputs. For example, - unless you `stringsAsFactors = FALSE` it always converts character vectors - to factors. `data_frame()` does not conversion: - - ```{r} - data.frame(x = letters) %>% sapply(class) - data_frame(x = letters) %>% sapply(class) - ``` - - * `data.frame()` automatically transforms names, `data_frame()` does not. - - ```{r} - data.frame(`crazy name` = 1) %>% names() - data_frame(`crazy name` = 1) %>% names() - ``` +A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at . - * In `data_frame()` you can refer to variables that you just created: - - ```{r} - data_frame(x = 1:5, y = x ^ 2) - ``` +### Factors - * It never uses row names. The whole point of tidy data is to store variables - in a consistent way. Row names are a variable stored in a unique way, - so I don't recommend using them. +Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute: - * It only recycles vectors of length 1. Recycling vectors of greater lengths - is a frequent source of silent mistakes. - - ```{r, error = TRUE} - data.frame(x = 1:2, y = 1:4) - data_frame(x = 1:2, y = 1:4) - ``` +```{r} +x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef")) +typeof(x) +attributes(x) +``` + +Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow" + +The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor. + +```{r} +x <- factor(letters[1:5]) +is.factor(x) +as.factor(letters[1:5]) +``` + +### Dates and date times + +Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970. + +```{r} +x <- as.Date("1971-01-01") +unclass(x) + +typeof(x) +attributes(x) +``` + +Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970: + +```{r} +x <- lubridate::ymd_hm("1970-01-01 01:00") +unclass(x) + +typeof(x) +attributes(x) +``` + +The `tzone` is optional, and only controls the way the date is printed not what it means. + +There is another type of datetimes called POSIXlt. These are built on top of named lists. + +```{r} +y <- as.POSIXlt(x) +typeof(y) +attributes(y) +``` + +If you use the packages outlined in this book, you should never encounter a POSIXlt. They do crop up in base R, because they are used extract specific components of a date (like the year or month). However, lubridate provides helpers for you to do this instead. Otherwise POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a POSIXct with `as.POSIXct()`. + +### Data frames and tibbles + +Data frames are augmented lists: they have class "data.frame", and `names` (column) and `row.names` attributes: + +```{r} +df1 <- data.frame(x = 1:5, y = 5:1) +typeof(df1) +attributes(df1) +``` + +The difference between a data frame and a list is that all the elements of a data frame must be the same length. All functions that work with data frames enforce this constraint. + +In this book, we use tibbles, rather than data frames. Tibbles are identical to data frames, except that they have two additional components in the class: + +```{r} +df2 <- dplyr::data_frame(x = 1:5, y = 5:1) +typeof(df2) +attributes(df2) +``` + +These extra components give tibbles the helpful behaviours defined in [tibbles]. ## Predicates diff --git a/diagrams/data-structures-overview.png b/diagrams/data-structures-overview.png new file mode 100644 index 000000000..77de9c498 Binary files /dev/null and b/diagrams/data-structures-overview.png differ diff --git a/diagrams/data-structures.graffle b/diagrams/data-structures.graffle new file mode 100644 index 000000000..6a2f14386 Binary files /dev/null and b/diagrams/data-structures.graffle differ diff --git a/functions.Rmd b/functions.Rmd index 686685d31..a07bb8ced 100644 --- a/functions.Rmd +++ b/functions.Rmd @@ -1,4 +1,4 @@ -```{r, include = FALSE} +```{r setup, include = FALSE} library(stringr) ``` diff --git a/import.Rmd b/import.Rmd index b8d839872..f29f90d7e 100644 --- a/import.Rmd +++ b/import.Rmd @@ -7,12 +7,11 @@ library(readr) ## Overview -You can't apply any of the tools you've applied so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to import: +You can't apply any of the tools you've applied so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to: -* Flat files (like csv) with readr. -* Database queries with DBI. -* Data from web APIs with httr. -* Binary file formats (like excel or sas), with haven and readxl. +* Import flat files (like csv) with readr. +* +* Cache intermediate results in a fast file format like feather or RDS. The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it. @@ -245,10 +244,28 @@ The settings you are most like to need to change are: * Parse these example files. * Parse this fixed width file. -## Databases +## Other file formats -## Web APIs +* Excel: readxl +* SPSS: haven +* Stata: haven +* SAS: haven -## Binary files +Databases. All powered by the DBI package which provides a common interface. -Needs to discuss how data types in different languages are converted to R. Similarly for missing values. +* RPostgres +* RMySQL +* RSQLite +* Avoid JDBC un + +Hierarchical: + +* XML: xml2 +* JSON: jsonlite + + + +## Binary file formats + +Feather. +RDS. diff --git a/index.rmd b/index.rmd index 2b85b13f0..ff49a48c4 100644 --- a/index.rmd +++ b/index.rmd @@ -1,8 +1,12 @@ --- knit: "bookdown::render_book" title: "R for Data Science" -output: - - bookdown::gitbook +author: ["Garrett Grolemund", "Hadley Wickham"] +description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualizing, and exploring data." +url: 'http\://r4ds.had.co.nz/' +github-repo: hadley/r4ds +twitter-handle: hadley +cover-image: cover.png --- # Welcome diff --git a/variation.Rmd b/variation.Rmd index b8ae842fb..6e5d4e30b 100644 --- a/variation.Rmd +++ b/variation.Rmd @@ -26,7 +26,7 @@ Rectangular data provides a clear record of variation, but that doesn't mean it mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10)) -knitr::kable(mat, caption = "*The speed of light is* the *universal constant, but variation obscures its value, here demonstrated by Albert Michelson in 1879. Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = c("\\s", "\\s", "\\s", "\\s", "\\s", "\\s", "\\s", "\\s", "\\s", "\\s")) +knitr::kable(mat, caption = "*The speed of light is* the *universal constant, but variation obscures its value, here demonstrated by Albert Michelson in 1879. Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat))) ``` diff --git a/work.Rmd b/work.Rmd index 9e5796f75..9a539e30c 100644 --- a/work.Rmd +++ b/work.Rmd @@ -8,7 +8,7 @@ Throughout this book we work with "tibbles" instead of the traditional data fram library(tibble) ``` -## Creating tibbles +## Creating tibbles {#tibbles} The majority of the functions that you'll use in this book already produce tibbles. But if you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_data_frame()`: