Seevasant Indran
09 October, 2018
Table of contents
Packages required
Install by running
install.packages("packageName", dependencies = TRUE)
The fundamental processes to follow to understand the knowledge and insight a data provides are:
- Data manipulation
- Data visualization
- Statistical analysis/modeling
- Organization of results
80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Makes data compatible for processing such as mathematical functions, visualization, hence reveals information and insights.
According to Hadley:-
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. Different strategies to extract different variables are needed for untidy data. This slows analysis and invites errors. If you consider how many data analysis operations involve all of the values in a variable (every aggregation function), you can see how important it is to extract these values in a simple, standard way. Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired. *Source
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Notice the differences:-
- dimensions
- observations
- variables
gapminder2 dataset is in ending credits section
tbl_df
filter
select
mutate
group_by
summarise
arrange
rename
%>%
tbl_df
works similar to data.table
in that it prints sensibly. Depreceated, use as_tibble()
instead.
List of dplyr functions and the {base}R
functions they"re related to:
Base Function | dplyr Function(s) | Special Powers |
---|---|---|
subset |
filter & select |
filter rows & select columns |
transform |
mutate |
operate with columns not yet created |
split |
group_by |
splits without cutting |
lapply + do.call |
summarise |
apply and bind in a single bound |
order + with |
arrange |
"I only have to specify dataframe once?" |
%>%
works similiarly to the unix
pipe |
and the +
in ggplot2
.
> conclusion <- import(obeservation) %>%
results %>%
group_by(headache) %>%
discssion() %>%
plot() + common_plot_someting(aes(x = STAT545, y = hours_not_sleeping))
print(conclusion)
Basically previous input in chain supplied as argument 1 to function on right side.
- Return Rows With Matching Conditions
- Useful Filter Functions
==
,>
,>=
&
,|
,!
,xor()
is.na()
,!is.na()
between()
,%in%
- Useful Filter Functions
Ussage - filter(.data, ...)
- Use
filter()
find rows/cases where conditions are true. Unlike base subsetting with[
, rows where the condition evaluates toNA
are dropped.
- Add New Variables.
Ussage - mutate_(.data, ...)
- Mutate adds new variables and preserves existing; transmute drops existing variables.
- Reduces Multiple Values Down To A Single Value
- Useful Filter Functions
- Center:
mean()
,median()
- Spread:
sd()
,IQR()
,mad()
- Range:
min()
,max()
,quantile()
- Position:
first()
,last()
,nth()
- Count:
n()
,n_distinct()
- Logical:
any()
- Center:
- Useful Filter Functions
Ussage - summarise(.data, ...)
summarise()
is typically used on grouped data created bygroup_by()
. The output will have one row for each group.
- Modify Names By Name, Not Position.
Ussage - rename(x, replace, warn_missing = TRUE, warn_duplicated = TRUE)
- warn_missing = TRUE, print a message if any of the old names are not actually present in x. warn_duplicated - TRUE print a message if any name appears more than once in x after the operation.
options(readr.num_columns = 0)
# Import using the read_csv(), assign to `gapminder_school` variable. Lets call this school dataset
gapminder_school <- read.csv("https://query.data.world/s/bpbbjyj7t6k2u6owizb7tr4fm4h4fq", header = TRUE, check.names = FALSE)
#gapminder_mortality <- read_csv(file.path(getwd(), "Infant mortality rate per 1 000 births.csv"), col_names = TRUE) # Aditional data, not used
dim(gapminder_school) # dimension of the data, 175 countries and 41 years
## [1] 175 41
gapminder_school_df <- as.data.frame(gapminder_school) # change to dataframe for example
head(gapminder_school_df, n = 3) # displays (n = 3) the top dataset
## Row Labels 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## 1 Afghanistan 1.0 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5 1.6 1.6 1.7
## 2 Albania 6.5 6.7 6.9 7.0 7.2 7.3 7.5 7.7 7.8 8.0 8.1 8.3
## 3 Algeria 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.7 2.8 2.9 3.1 3.3
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## 1 1.8 1.8 1.9 2.0 2.0 2.1 2.1 2.2 2.2 2.3 2.3 2.4 2.4 2.5
## 2 8.4 8.5 8.7 8.8 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## 3 3.4 3.6 3.8 3.9 4.1 4.3 4.5 4.7 4.8 5.0 5.2 5.3 5.5 5.6
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 1 2.6 2.6 2.7 2.7 2.8 2.9 2.9 3.0 3.0 3.1 3.1 3.2 3.3 3.3
## 2 10.0 10.1 10.2 10.3 10.4 10.4 10.5 10.6 10.7 10.7 10.8 10.9 10.9 11.0
## 3 5.8 5.9 6.1 6.2 6.3 6.5 6.6 6.7 6.8 7.0 7.1 7.2 7.3 7.3
This prints okay but the next one looks better, as explained above rownames are dropped, to preserve, convert to an explicit variable with rownames_to_column()
gapminder_school_tbl <- tbl_df(gapminder_school_df) # convert data into tibble diff
head(rownames_to_column(gapminder_school_tbl), n = 10)
## # A tibble: 10 x 42
## rowname `Row Labels` `1970` `1971` `1972` `1973` `1974` `1975` `1976`
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Afghanistan 1 1.1 1.1 1.2 1.3 1.3 1.4
## 2 2 Albania 6.5 6.7 6.9 7 7.2 7.3 7.5
## 3 3 Algeria 1.9 2 2.1 2.2 2.3 2.4 2.5
## 4 4 Angola 2.3 2.4 2.5 2.7 2.8 2.9 3
## 5 5 Antigua and… 8 8.2 8.4 8.6 8.7 8.9 9.1
## 6 6 Argentina 7 7.1 7.3 7.4 7.6 7.7 7.8
## 7 7 Armenia 7.8 7.9 8.1 8.2 8.4 8.5 8.7
## 8 8 Australia 9.9 10 10.2 10.3 10.4 10.5 10.6
## 9 9 Austria 9.3 9.4 9.5 9.6 9.8 9.9 10
## 10 10 Azerbaijan 7.6 7.8 8 8.2 8.4 8.6 8.8
## # ... with 33 more variables: `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## # `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## # `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## # `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>
Use the dplyr::rename
to rename the "Row Labels"" column to "country""
## The dplyr way, rename "Row Labels" to "country" in school dataset
gapminder_school <- gapminder_school %>%
rename(country = "Row Labels")
## The base R way is more complicated, rename column `Row Lables` to `country` AND rename the remaining column minus the first column as it was previously.
# names(gapminder_school) <- c("country", names(gapminder_school)[-1])
# or
# colnames(gapminder_school)[colnames(gapminder_school)=="Row Labels"] <- "country"
head(gapminder_school, n= 2)
## country 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## 1 Afghanistan 1.0 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5 1.6 1.6 1.7
## 2 Albania 6.5 6.7 6.9 7.0 7.2 7.3 7.5 7.7 7.8 8.0 8.1 8.3
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## 1 1.8 1.8 1.9 2.0 2 2.1 2.1 2.2 2.2 2.3 2.3 2.4 2.4 2.5
## 2 8.4 8.5 8.7 8.8 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## 1 2.6 2.6 2.7 2.7 2.8 2.9 2.9 3.0 3.0 3.1 3.1 3.2 3.3 3.3
## 2 10.0 10.1 10.2 10.3 10.4 10.4 10.5 10.6 10.7 10.7 10.8 10.9 10.9 11.0
## School dataset has more countries (175) that the gapminder dataset, filter country from "gapminder_school" dataset with gapminder, to removed countries not in both dataset
gapminder_school_filtered <- gapminder_school %>%
filter(country %in% gapminder$country)
## How many country are filtered ?
(nrow(gapminder_school) - nrow(gapminder_school_filtered))
## [1] 42
## The gapminder dataset has year till 2007, however the school dataset has year till 2009. Use select to filter the dates till 2007
gapminder_school_filtered <- gapminder_school_filtered %>%
select("country",as.character(unique(gapminder$year)[-c(1:4)])) # selects the first column and the years present from the gapminder dataset.
head(gapminder_school_filtered)
## country 1972 1977 1982 1987 1992 1997 2002 2007
## 1 Afghanistan 1.1 1.4 1.8 2.1 2.3 2.6 2.9 3.2
## 2 Albania 6.9 7.7 8.4 9.1 9.6 10.1 10.5 10.9
## 3 Algeria 2.1 2.7 3.4 4.3 5.2 5.9 6.6 7.2
## 4 Angola 2.5 3.1 3.8 4.4 4.8 5.3 5.7 6.0
## 5 Argentina 7.3 8.0 8.6 9.1 9.6 10.1 10.6 11.0
## 6 Australia 10.2 10.7 11.1 11.4 11.7 11.9 12.1 12.3
Now we have a matching year but we have a problem, our dataset is not in tidy format. We have to fix that later its difficult to work with a messydataset. For example.
gapminder_school_filtered %>%
arrange (`2007`) %>% # arrange mean by lowest to highest
head (n = 5)
## country 1972 1977 1982 1987 1992 1997 2002 2007
## 1 Niger 0.5 0.6 0.8 1.1 1.4 1.8 2.2 2.5
## 2 Mali 0.9 1.1 1.4 1.7 1.9 2.2 2.4 2.6
## 3 Burkina Faso 0.7 0.9 1.2 1.5 1.8 2.1 2.5 2.8
## 4 Afghanistan 1.1 1.4 1.8 2.1 2.3 2.6 2.9 3.2
## 5 Somalia 1.2 1.5 1.8 2.2 2.5 2.8 3.0 3.2
Some live commetary.. Look at Niger, mean years in school for people aged between 25 - 34 is just 2.5 years!! in 2007.
## {base} R way to filter countries that are in both gapminder and gapminder school dataset and store into country
cntry <- unique(gapminder$country)[unique(gapminder$country) %in% # %in% same as match()
gapminder_school_filtered$country] # This returns a logical vector. It is then used to subsets the gapminder country dataset
## This subsets the gapminder dataset
gpmd_cont <- gapminder %>%
filter(country %in% cntry) %>%
subset(!duplicated(country)) %>%
select("country","continent")
gapminder_school_filtered <- gapminder_school_filtered %>%
mutate(continent = gpmd_cont$continent)
head(gapminder_school_filtered)
## country 1972 1977 1982 1987 1992 1997 2002 2007 continent
## 1 Afghanistan 1.1 1.4 1.8 2.1 2.3 2.6 2.9 3.2 Asia
## 2 Albania 6.9 7.7 8.4 9.1 9.6 10.1 10.5 10.9 Europe
## 3 Algeria 2.1 2.7 3.4 4.3 5.2 5.9 6.6 7.2 Africa
## 4 Angola 2.5 3.1 3.8 4.4 4.8 5.3 5.7 6.0 Africa
## 5 Argentina 7.3 8.0 8.6 9.1 9.6 10.1 10.6 11.0 Americas
## 6 Australia 10.2 10.7 11.1 11.4 11.7 11.9 12.1 12.3 Oceania
gapminder_school_filtered %>%
summarise("1972" = mean(gapminder_school_filtered$`1972`),
"1977" = mean(gapminder_school_filtered$`1977`),
"1982" = mean(gapminder_school_filtered$`1982`),
"1987" = mean(gapminder_school_filtered$`1987`),
"1992" = mean(gapminder_school_filtered$`1992`),
"1997" = mean(gapminder_school_filtered$`1997`),
"2002" = mean(gapminder_school_filtered$`2002`),
"2007" = mean(gapminder_school_filtered$`2007`),
)
## 1972 1977 1982 1987 1992 1997 2002 2007
## 1 5.02782 5.745113 6.435338 7.076692 7.630075 8.153383 8.638346 9.06015
gapminder_school_filtered %>%
group_by(continent) %>%
summarise("1972" = mean(gapminder_school_filtered$`1972`),
"1977" = mean(gapminder_school_filtered$`1977`),
"1982" = mean(gapminder_school_filtered$`1982`),
"1987" = mean(gapminder_school_filtered$`1987`),
"1992" = mean(gapminder_school_filtered$`1992`),
"1997" = mean(gapminder_school_filtered$`1997`),
"2002" = mean(gapminder_school_filtered$`2002`),
"2007" = mean(gapminder_school_filtered$`2007`),
)
## # A tibble: 5 x 9
## continent `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 2 Americas 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 3 Asia 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 4 Europe 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 5 Oceania 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
read.csv("https://query.data.world/s/bpbbjyj7t6k2u6owizb7tr4fm4h4fq",
header = TRUE, check.names = FALSE) %>%
rename(country = "Row Labels") %>%
filter(country %in% gapminder$country) %>%
select("country",as.character(unique(gapminder$year)[-c(1:4)])) %>%
mutate(continent = gpmd_cont$continent) %>%
group_by(continent) %>%
summarise("1972" = mean(gapminder_school_filtered$`1972`),
"1977" = mean(gapminder_school_filtered$`1977`),
"1982" = mean(gapminder_school_filtered$`1982`),
"1987" = mean(gapminder_school_filtered$`1987`),
"1992" = mean(gapminder_school_filtered$`1992`),
"1997" = mean(gapminder_school_filtered$`1997`),
"2002" = mean(gapminder_school_filtered$`2002`),
"2007" = mean(gapminder_school_filtered$`2007`),
)
## # A tibble: 5 x 9
## continent `1972` `1977` `1982` `1987` `1992` `1997` `2002` `2007`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 2 Americas 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 3 Asia 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 4 Europe 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
## 5 Oceania 5.03 5.75 6.44 7.08 7.63 8.15 8.64 9.06
#
# gapminder_school_filtered %>% ggplot(aes (x = continent , y = gapminder_school_filtered$`1972`) + geom_bar(position = "dodge", stat = "identity")
List of tidyr functions and the relationship to the reshape2 functions:
reshape2 Function | tidyr Function | Special Powers |
---|---|---|
melt |
gather |
long format* |
dcast |
spread |
wide format* |
- Gather Columns Into Key-Value Pairs
- Useful Filter Functions
- data expression like
x
or an expression likex:y
orc(x, y)
. In a data expression, you can only refer to columns from the data frame. Everything else is a context expression in which you can only refer to objects that you have defined with<-
. -col1:col3
is a data expression that refers to data columns, whileseq(start, end)
is a context expression that refers to objects from the contexts. c(x, !! x)
selects the x column within the data frame and the column referred to by the object x defined in the context (which can contain either a column name as string or a column position)
- data expression like
- Useful Filter Functions
Ussage - gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
key, value
, names of new key and value of "key" columns, as strings or symbols....
, A selection of columns. If empty, all variables are selected, ect.. between x and z with x:z, exclude y with -y.na.rm
, If TRUE, remove rows from output where the value column isNA
-convert
, If TRUE, automatically runs type.convert() on the key column. This is useful if the column names are actually numeric, integer, or logical.factor_key
, If FALSE, default, the key values will be stored as a character vector. If TRUE, will stored as a factor, which preserves the original ordering of the columns.
- Spread A Key-Value Pair Across Multiple Columns
- Useful Filter Functions
- same with gather (see above) Ussage -
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
- same with gather (see above) Ussage -
- Useful Filter Functions
fill
, If set, missing values will be replaced with that value. Note 2 types of missingness in the input: explicit missing values (NA
), and implicit missings, rows that simply aren"t present. Both types will be replaced by fill.convert
, If TRUE, type.convert() with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.drop
, If FALSE, will keep factor levels that don"t appear in the data, filling in missing combinations with fill.sep
, If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by "<key_name><key_value>".
# Look at the messydata
head(gapminder_school_filtered)
## country 1972 1977 1982 1987 1992 1997 2002 2007 continent
## 1 Afghanistan 1.1 1.4 1.8 2.1 2.3 2.6 2.9 3.2 Asia
## 2 Albania 6.9 7.7 8.4 9.1 9.6 10.1 10.5 10.9 Europe
## 3 Algeria 2.1 2.7 3.4 4.3 5.2 5.9 6.6 7.2 Africa
## 4 Angola 2.5 3.1 3.8 4.4 4.8 5.3 5.7 6.0 Africa
## 5 Argentina 7.3 8.0 8.6 9.1 9.6 10.1 10.6 11.0 Americas
## 6 Australia 10.2 10.7 11.1 11.4 11.7 11.9 12.1 12.3 Oceania
gapminder_tidyschool <- gapminder_school_filtered %>%
gather(year, meanSchool, -c("country", "continent"))
# Some sanity check
if (nrow(gapminder_school_filtered) * 8 == nrow(gapminder_tidyschool)) { # there are 8 years
head(gapminder_tidyschool)
} else {
stop("n() rows dont match table")
}
## country continent year meanSchool
## 1 Afghanistan Asia 1972 1.1
## 2 Albania Europe 1972 6.9
## 3 Algeria Africa 1972 2.1
## 4 Angola Africa 1972 2.5
## 5 Argentina Americas 1972 7.3
## 6 Australia Oceania 1972 10.2
# Convert year into interger
gapminder_tidyschool$year <- as.integer(gapminder_tidyschool$year)
We have a tidy dataset of gapminder mean years in school datase.
gapminder_school_filtered %>%
gather(year, meanSchool, -c("country", "continent")) %>%
spread(year, meanSchool) %>%
head()
## country continent 1972 1977 1982 1987 1992 1997 2002 2007
## 1 Afghanistan Asia 1.1 1.4 1.8 2.1 2.3 2.6 2.9 3.2
## 2 Albania Europe 6.9 7.7 8.4 9.1 9.6 10.1 10.5 10.9
## 3 Algeria Africa 2.1 2.7 3.4 4.3 5.2 5.9 6.6 7.2
## 4 Angola Africa 2.5 3.1 3.8 4.4 4.8 5.3 5.7 6.0
## 5 Argentina Americas 7.3 8.0 8.6 9.1 9.6 10.1 10.6 11.0
## 6 Australia Oceania 10.2 10.7 11.1 11.4 11.7 11.9 12.1 12.3
inner_join()
semi_join()
left_join()
anti_join()
right_join()
full_join()
union
intersect
setdiff
inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
gapminder2 <- inner_join(gapminder, gapminder_tidyschool)
head(gapminder2)
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap meanSchool
## <chr> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1972 36.1 13079460 740. 1.1
## 2 Afghanistan Asia 1977 38.4 14880372 786. 1.4
## 3 Afghanistan Asia 1982 39.9 12881816 978. 1.8
## 4 Afghanistan Asia 1987 40.8 13867957 852. 2.1
## 5 Afghanistan Asia 1992 41.7 16317921 649. 2.3
## 6 Afghanistan Asia 1997 41.8 22227415 635. 2.6
Match and join gapminder
dataset which has 142 country into gapminder_tidyschool
which has 133 country which should only match 1064, which looks correct.
semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.
semi_join(gapminder, gapminder_tidyschool)
## # A tibble: 1,064 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1972 36.1 13079460 740.
## 2 Afghanistan Asia 1977 38.4 14880372 786.
## 3 Afghanistan Asia 1982 39.9 12881816 978.
## 4 Afghanistan Asia 1987 40.8 13867957 852.
## 5 Afghanistan Asia 1992 41.7 16317921 649.
## 6 Afghanistan Asia 1997 41.8 22227415 635.
## 7 Afghanistan Asia 2002 42.1 25268405 727.
## 8 Afghanistan Asia 2007 43.8 31889923 975.
## 9 Albania Europe 1972 67.7 2263554 3313.
## 10 Albania Europe 1977 68.9 2509048 3533.
## # ... with 1,054 more rows
Notice there is no meanSchool column, this returns all matches of x and y whist retaining the same column of original x. Like filter().
left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
left_join(gapminder, gapminder_tidyschool)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap meanSchool
## <chr> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. NA
## 2 Afghanistan Asia 1957 30.3 9240934 821. NA
## 3 Afghanistan Asia 1962 32.0 10267083 853. NA
## 4 Afghanistan Asia 1967 34.0 11537966 836. NA
## 5 Afghanistan Asia 1972 36.1 13079460 740. 1.1
## 6 Afghanistan Asia 1977 38.4 14880372 786. 1.4
## 7 Afghanistan Asia 1982 39.9 12881816 978. 1.8
## 8 Afghanistan Asia 1987 40.8 13867957 852. 2.1
## 9 Afghanistan Asia 1992 41.7 16317921 649. 2.3
## 10 Afghanistan Asia 1997 41.8 22227415 635. 2.6
## # ... with 1,694 more rows
anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.
anti_join(gapminder, gapminder_tidyschool)
## # A tibble: 640 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Albania Europe 1952 55.2 1282697 1601.
## 6 Albania Europe 1957 59.3 1476505 1942.
## 7 Albania Europe 1962 64.8 1728137 2313.
## 8 Albania Europe 1967 66.2 1984060 2760.
## 9 Algeria Africa 1952 43.1 9279525 2449.
## 10 Algeria Africa 1957 45.7 10270856 3014.
## # ... with 630 more rows
right_join(x, y): Return all rows from y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.
right_join(gapminder, gapminder_tidyschool)
## # A tibble: 1,064 x 7
## country continent year lifeExp pop gdpPercap meanSchool
## <chr> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1972 36.1 13079460 740. 1.1
## 2 Albania Europe 1972 67.7 2263554 3313. 6.9
## 3 Algeria Africa 1972 54.5 14760787 4183. 2.1
## 4 Angola Africa 1972 37.9 5894858 5473. 2.5
## 5 Argentina Americas 1972 67.1 24779799 9443. 7.3
## 6 Australia Oceania 1972 71.9 13177000 16789. 10.2
## 7 Austria Europe 1972 70.6 7544201 16662. 9.5
## 8 Bahrain Asia 1972 63.3 230800 18269. 4.5
## 9 Bangladesh Asia 1972 45.3 70759295 630. 2.6
## 10 Belgium Europe 1972 71.4 9709100 16672. 8.9
## # ... with 1,054 more rows
full_join(x, y): Return all rows from x and y, and all columns from x and y.
full_join(gapminder, gapminder_tidyschool)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap meanSchool
## <chr> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. NA
## 2 Afghanistan Asia 1957 30.3 9240934 821. NA
## 3 Afghanistan Asia 1962 32.0 10267083 853. NA
## 4 Afghanistan Asia 1967 34.0 11537966 836. NA
## 5 Afghanistan Asia 1972 36.1 13079460 740. 1.1
## 6 Afghanistan Asia 1977 38.4 14880372 786. 1.4
## 7 Afghanistan Asia 1982 39.9 12881816 978. 1.8
## 8 Afghanistan Asia 1987 40.8 13867957 852. 2.1
## 9 Afghanistan Asia 1992 41.7 16317921 649. 2.3
## 10 Afghanistan Asia 1997 41.8 22227415 635. 2.6
## # ... with 1,694 more rows
rows that appear in both
x
andy
# print all rows gapminder$country vs gapminder_school$country
union(gapminder$country, gapminder_school$country) %>%
tbl_df() # not required, when used, prints nice
## # A tibble: 184 x 1
## value
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Angola
## 5 Argentina
## 6 Australia
## 7 Austria
## 8 Bahrain
## 9 Bangladesh
## 10 Belgium
## # ... with 174 more rows
rows that appear in both
x
andy
# rows gapminder$country vs gapminder_school$country, prints difference
intersect(gapminder$country, gapminder_school$country) %>%
tbl_df() # not required, when used, prints nice
## # A tibble: 133 x 1
## value
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Angola
## 5 Argentina
## 6 Australia
## 7 Austria
## 8 Bahrain
## 9 Bangladesh
## 10 Belgium
## # ... with 123 more rows
intersect(gapminder$country, gapminder_school$country) %>%
tbl_df() # not required, when used, prints nice
## # A tibble: 133 x 1
## value
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Angola
## 5 Argentina
## 6 Australia
## 7 Austria
## 8 Bahrain
## 9 Bangladesh
## 10 Belgium
## # ... with 123 more rows
rows that appear in
x
but noty
# rows gapminder$country vs gapminder_school$country, prints differences
setdiff(gapminder$country, gapminder_school$country) %>%
tbl_df() # not required, when used, prints nice
## # A tibble: 9 x 1
## value
## <chr>
## 1 Central African Republic
## 2 Czech Republic
## 3 Dominican Republic
## 4 Hong Kong, China
## 5 Iceland
## 6 Korea, Dem. Rep.
## 7 Libya
## 8 Puerto Rico
## 9 Reunion
# gapminder_school$country vs gapminder$country diffences
setdiff(gapminder_school$country, gapminder$country) %>%
tbl_df()
## # A tibble: 42 x 1
## value
## <chr>
## 1 Antigua and Barbuda
## 2 Armenia
## 3 Azerbaijan
## 4 Bahamas
## 5 Belarus
## 6 Belize
## 7 Cape Verde
## 8 Central African Rep.
## 9 Cyprus
## 10 Czech Rep.
## # ... with 32 more rows
# Use gather and create a column year and meanschool and use columb 2 to 9 as key
gapminder_school_filtered %>%
gather(key = year, value = meanSchool, 2:9) %>%
arrange(country) %>%
head()
## country continent year meanSchool
## 1 Afghanistan Asia 1972 1.1
## 2 Afghanistan Asia 1977 1.4
## 3 Afghanistan Asia 1982 1.8
## 4 Afghanistan Asia 1987 2.1
## 5 Afghanistan Asia 1992 2.3
## 6 Afghanistan Asia 1997 2.6
gapminder_school_filtered %>%
gather(key = year, value = meanSchool,
names(gapminder_school_filtered)[2:9]) %>%
head()
## country continent year meanSchool
## 1 Afghanistan Asia 1972 1.1
## 2 Albania Europe 1972 6.9
## 3 Algeria Africa 1972 2.1
## 4 Angola Africa 1972 2.5
## 5 Argentina Americas 1972 7.3
## 6 Australia Oceania 1972 10.2
gapminder_school_filtered %>%
gather(year, meanSchool, -c("country", "continent")) %>%
spread(continent, meanSchool) %>%
head()
## country year Africa Americas Asia Europe Oceania
## 1 Afghanistan 1972 NA NA 1.1 NA NA
## 2 Afghanistan 1977 NA NA 1.4 NA NA
## 3 Afghanistan 1982 NA NA 1.8 NA NA
## 4 Afghanistan 1987 NA NA 2.1 NA NA
## 5 Afghanistan 1992 NA NA 2.3 NA NA
## 6 Afghanistan 1997 NA NA 2.6 NA NA
inner_join(gapminder_tidyschool, gapminder) %>%
head()
## country continent year meanSchool lifeExp pop gdpPercap
## 1 Afghanistan Asia 1972 1.1 36.088 13079460 739.9811
## 2 Albania Europe 1972 6.9 67.690 2263554 3313.4222
## 3 Algeria Africa 1972 2.1 54.518 14760787 4182.6638
## 4 Angola Africa 1972 2.5 37.928 5894858 5473.2880
## 5 Argentina Americas 1972 7.3 67.065 24779799 9443.0385
## 6 Australia Oceania 1972 10.2 71.930 13177000 16788.6295
Although, it looks similiar to the inner_join()
from above but this one does not have the meanSchool
column as it is not present in the gapminder dataset.
semi_join(gapminder_tidyschool, gapminder) %>%
head()
## country continent year meanSchool
## 1 Afghanistan Asia 1972 1.1
## 2 Albania Europe 1972 6.9
## 3 Algeria Africa 1972 2.1
## 4 Angola Africa 1972 2.5
## 5 Argentina Americas 1972 7.3
## 6 Australia Oceania 1972 10.2
Only retains column from the gapminder_tidyschool
and all matching row in gapminder
.
left_join(gapminder_tidyschool, gapminder) %>%
head()
## country continent year meanSchool lifeExp pop gdpPercap
## 1 Afghanistan Asia 1972 1.1 36.088 13079460 739.9811
## 2 Albania Europe 1972 6.9 67.690 2263554 3313.4222
## 3 Algeria Africa 1972 2.1 54.518 14760787 4182.6638
## 4 Angola Africa 1972 2.5 37.928 5894858 5473.2880
## 5 Argentina Americas 1972 7.3 67.065 24779799 9443.0385
## 6 Australia Oceania 1972 10.2 71.930 13177000 16788.6295
In contrast to the left_join()
doest have the meanScool
column and it does not contain all the rows from the gapminder
dataset.
anti_join(gapminder_tidyschool, gapminder) %>%
head()
## [1] country continent year meanSchool
## <0 rows> (or 0-length row.names)
No rows indicate all of the rows ingapminder_tidyschool
matches all the rows in the gapminder dataset.
gapminder %>% tbl_df()
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
gapminder_school %>% tbl_df()
## # A tibble: 175 x 41
## country `1970` `1971` `1972` `1973` `1974` `1975` `1976` `1977` `1978`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… 1 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5
## 2 Albania 6.5 6.7 6.9 7 7.2 7.3 7.5 7.7 7.8
## 3 Algeria 1.9 2 2.1 2.2 2.3 2.4 2.5 2.7 2.8
## 4 Angola 2.3 2.4 2.5 2.7 2.8 2.9 3 3.1 3.2
## 5 Antigu… 8 8.2 8.4 8.6 8.7 8.9 9.1 9.3 9.5
## 6 Argent… 7 7.1 7.3 7.4 7.6 7.7 7.8 8 8.1
## 7 Armenia 7.8 7.9 8.1 8.2 8.4 8.5 8.7 8.8 9
## 8 Austra… 9.9 10 10.2 10.3 10.4 10.5 10.6 10.7 10.8
## 9 Austria 9.3 9.4 9.5 9.6 9.8 9.9 10 10.1 10.2
## 10 Azerba… 7.6 7.8 8 8.2 8.4 8.6 8.8 8.9 9.1
## # ... with 165 more rows, and 31 more variables: `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## # `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## # `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## # `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>
gapminder2 %>% tbl_df()
## # A tibble: 1,064 x 7
## country continent year lifeExp pop gdpPercap meanSchool
## <chr> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1972 36.1 13079460 740. 1.1
## 2 Afghanistan Asia 1977 38.4 14880372 786. 1.4
## 3 Afghanistan Asia 1982 39.9 12881816 978. 1.8
## 4 Afghanistan Asia 1987 40.8 13867957 852. 2.1
## 5 Afghanistan Asia 1992 41.7 16317921 649. 2.3
## 6 Afghanistan Asia 1997 41.8 22227415 635. 2.6
## 7 Afghanistan Asia 2002 42.1 25268405 727. 2.9
## 8 Afghanistan Asia 2007 43.8 31889923 975. 3.2
## 9 Albania Europe 1972 67.7 2263554 3313. 6.9
## 10 Albania Europe 1977 68.9 2509048 3533. 7.7
## # ... with 1,054 more rows