Equivalent of SQL CASE #631

edwindj · 2014-09-26T04:49:26Z

In data cleaning data often contains values that one would like to impute/replace.

For example:

In a survey for smokers, one would not ask a person his or her cigarette consumption when he/she has indicated not to be a smoker. So technically the number of cigarretes is not measured, but it can be deduced to be zero.

smoke <- read.csv(textConnection(
"name,is_smoker,cigarrettes,cigares
 John,TRUE,15, 0
 Adam,FALSE,NA,NA
 Phil,TRUE,0,1
"
))

#standard R way
smoke$cigarrettes[!smoke$is_smoker & is.na(smoke$cigarrettes)] <- 0

How does one do this elegantly in dplyr?

My 2 cents:

#dplyr option 1: split the data set, fix one part and combine the two
smoker_true <- smoke %>% filter(!is_smoker, is.na(smoke$cigarettes))
# complement
smoker_false <- smoke %>% filter(is_smoker | !is.na(smoke$cigarettes))

smoke2 <- rbind( smoker_true %>% mutate(cigarrettes=0)
               , smoker_false
)

which I don't find elegant dplyr code.

So try a little harder

# dplyr option 2: replace the whole variable and do the filtering in the assignment
smoke2 <- 
  smoke %>%
  mutate(cigarrettes=ifelse(!is_smoker & is.na(cigarrettes), 0, cigarrettes))

It works, but this is not efficient/workable for the database backends and I find it unelegant to
do the filtering in the assignment.

I suggest a new method:
mutate_where that allows for changing values of existing variables of a subset of a tbl.
So something like

smoke2 <-
   smoke %>% mutate_where(.where=!is_smoker&is.na(cigarrettes), cigarettes=0)

Or another option (in analogy with group and grouped_<tbl>) to change filter or introduce filtered_<tbl>/add_filter: setting a filter lets one work on a subset of the data, but one can remove_filter.

smoke2 <-
   smoke %>% 
   add_filter(is_smoker, is.na(cigarrettes) %>%
   mutate(cigarrettes=0) %>%
   remove_filter()

Any better ideas or suggestions?

The text was updated successfully, but these errors were encountered:

hadley · 2014-09-26T11:20:17Z

I really prefer the vector based solution, but it would be much simpler if ifelse had a branch for NA:

smoke2 <- 
  smoke %>%
  mutate(cigarrettes = ifelse(!is_smoker, 0, cigarrettes, 0)
)

The problem is that I don't see how any other approach could work with databases - you can not assume that you have write access to the underlying table.

edwindj · 2014-09-26T13:41:43Z

Agreed :-)

While I think it is possible to implement it in DBs using temporary tables and multiple SQL statements, I think it is wise not to do so: it will complicate the translate_sql code and furthermore it is rather DB specific how TEMP tables work.

hadley · 2014-09-26T13:43:50Z

I'd like to approach this problem by figuring out what the 90% of most common in-place mutates do, and then figuring out how to make elegant vectorised replacement functions.

hadley · 2014-10-01T12:17:33Z

I'm thinking of adapting something like derivedFactor() from mosaic: https://github.com/rpruim/mosaic/blob/master/R/derivedFactor.R - not quite sure how it should look for this case yet.

@rpruim any thoughts?

rpruim · 2014-10-02T00:45:34Z

I've been thinking about this today and have an idea to float. We could use formulas of the form

expression ~ condition     # use expression when condition is TRUE
expression ~ .             # use expression when no other conditions are TRUE (i.e., default)

to specify the desired variable/vector. This is modeled on mathematical definitions by cases, so below I'll call the function cases(). The smoking example would turn into

smoke %>% mutate( 
  cigs2 = cases( 
    0 ~ ! is_smoker & is.na(cigarrettes), 
    cigarettes ~ .)
  )

And fairly complicated definitions by cases could be constructed similarly.

As with derivedFactor(), we could have options to take the first true condition, the last true condition, or to require that exactly one condition be true. This could subsume derivedFactor if there were an argument that declared the result to be a factor and used expression as labels.

Using formulas makes it easy to have arguments with natural names (rather than prepending with . or .. or _ to avoid colliding with names in a data frame). The formulas could all be part of ... and named arguments could control the overall behavior.

cases <- function( ..., match=c("unique", "first", "last"), factor=FALSE, ordered=FALSE) {
}

What do you think?

kevinykuo · 2014-10-02T01:28:13Z

Haven't given any thought on the viability for implementation, but here's an idea, inspired by @rpruim and pattern matching:

smoke %>% mutate( 
  cigs2 = cigarettes %>% match_with(
     NA ~ 0,
     . ~ cigarettes)
  )

rpruim · 2014-10-02T01:52:18Z

Hmm. Note sure you really want this. Rather, I think its use case would be too restricted.

In your case, you only work with one variable, but in general both the conditions and the expressions might refer to multiple variables. Remember that %>% is just an abbreviation, so you are really proposing

cigs 2 = match_with(cigarettes, NA ~ 0, . ~ cigarettes)

What I'm proposing would work on vectors (which could be evaluated in a data frame or elsewhere) and would make it simple to create either expressions or conditions from multiple vectors. mutate() would take care of evaluating them in the object that is its first argument.

I also think I prefer my original order for the formulas. The expression is more like a response and the condition more like an explanatory variable, so it seems to fit the model of other uses of formulas better.

hadley · 2014-10-02T12:55:18Z

I think I'd prefer having the result on the RHS:

cases( 
    !is_smoker & is.na(cigarettes) ~ 0,
    TRUE ~ cigarettes)
  )

Using TRUE as a catch all for the default seems reasonable. Or you could use a one-sided formula: ~ cigarettes). I like the use of the formula to do explicit quoting. It also means that you'll be able to program withcases()` because it wouldn't need to use NSE.

Taking an example from `derivedFactor():

cases(
  i1 == 0 ~ "abstinent",
  (i1>0 & i1<=1 & i2<=3 & sex=='female') | (i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate"
  ((i1>1 | i2>3) & sex=='female') | ((i1>2 | i2>4) & sex=='male') = "highrisk"
)

# Or by moving our the "or"s
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  (i1>1 | i2>3) & sex=='female') ~ "highrisk"
  (i1>2 | i2>4) & sex=='male') ~ "highrisk"
)

# You could also nest cases to make the male/female distinction clearer
cases(
  i1 == 0 ~ "abstinent",
  sex == 'female' ~ cases(
    i1>0 & i1<=1 & i2<=3 ~ "moderate",
    i1>1 | i2>3  ~ "highrisk"
  ),
  sex == 'male' ~ cases(
    i1>0 & i1<=2 & i2<=4 ~ "moderate",
    i1>2 | i2>4 ~ "highrisk"
  )
)

# Or with a default
cases(
  i1 == 0 ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
  TRUE ~ "highrisk"
)

matthieugomez · 2014-10-02T14:42:33Z

Creating a new variable (as in hadley's last post) is conceptually quite different from modifying some value of a variable (as in the OP's post) - for instance, default behaviors for what happens with non matched rows should differ (resp NA vs original value). There could two different verbs instead of cases depending on the situation, similarly to Stata recode vs replace

hadley · 2014-10-02T14:44:57Z

@matthieugomez yes, agreed. In SQL there are two forms of CASE: one works on the values of a single variable, the other takes a list of conditionals (analogous to the case function proposed here).

But the challenge is NA: the usual semantics for equality won't work here. (And there's a general problem for floating point values).

rpruim · 2014-10-02T14:57:10Z

One advantage of the expression ~ condition syntax (besides mirroring the way cases are presented notationally in books) is that often the expressions are simpler and more similar to each other, so the code looks more elegant. Also, knowing what the expression is often makes it clearer why the condition is defined the way it is. But the expressions get a bit buried in the condition ~ expression order.

cases(
  "abstitnent" ~ i1 == 0,
  "moderate"   ~ i1>0 & i1<=1 & i2<=3 & sex=='female'),
  "moderate    ~ i1>0 & i1<=2 & i2<=4 & sex=='male'),
  "highrisk"   ~ (i1>1 | i2>3) & sex=='female'),
  "highrisk"   ~ (i1>2 | i2>4) & sex=='male')
)

Whichever way the formulas are used, I like the nesting idea.

hadley · 2014-10-02T15:10:30Z

@rpruim That is true, but I think it makes nesting less natural, and you can achieve the nice alignment with lots of whitespace

cases(
  i1 == 0                              ~ "abstinent",
  i1>0 & i1<=1 & i2<=3 & sex=='female' ~ "moderate",
  i1>0 & i1<=2 & i2<=4 & sex=='male'   ~ "moderate",
  (i1>1 | i2>3) & sex=='female'        ~ "highrisk"
  (i1>2 | i2>4) & sex=='male'          ~ "highrisk"
)

@matthieugomez I don't see the advantage of that over the filter based approach. Sure it save a little memory, but at the expense of making the API non-orthogonal. The goal of dplyr isn't to squeeze the last bit of performance out of R, it's to have provide a elegant and efficient API for data manipulation.

rpruim · 2014-10-02T15:17:00Z

Hmm... The nesting argument is probably reason enough to prefer condition ~ expression because in that case, the condition becomes simpler and the expression becomes more involved.

hadley · 2014-10-02T15:21:05Z

To make the semantics clear, I think each vector referenced in cases() would need to be either a scalar, or of the same length. Doing recycling in this scenario is going to be quite complex.

I'm also reasonably certain this will be straightforward to translate to SQL.

hadley · 2014-10-22T21:21:19Z

Here's a start on an implementation: https://gist.github.com/hadley/2751ba61d1c7f4eaacab

Currently it requires you pass the data frame as the first argument, but I think I can work around that. It's not very efficient, but it's a starting point.

dholstius · 2015-08-05T17:24:23Z

For what it's worth: here's a patch verb that I'm starting to use in my own work:

https://gist.github.com/holstius/cbc4ec40057fbc2d9f4b

In my workflows, I like to separate "patching" from "recoding" or "labeling". Patching is the real-world consequence of deadlines busy upstream providers — it's not supposed to be there, but it is.

hadley · 2016-03-14T16:24:32Z

With the new if_else and coalesce functions, you could now write the original motivating example as

mutate(smoke, cigarettes = if_else(coalesce(is_smoker, FALSE), cigarettes, 0L))

And note that this discussions is really about a sql CASE equivalent for R, not a vectorised switch. The discussion for vectorised switch is now taking place in #1710.

I've also deleted a number of comments in this thread that are no longer pertinent.

hadley · 2016-03-14T19:31:21Z

And now you can do:

mutate(smoke, cigarettes = if_else(is_smoker, cigarettes, 0L, 0L))

hadley · 2016-03-14T19:53:31Z

I'm pretty happy with this implementation, but I'd love to get everyone's feedback on the name and how it works.

lionel- · 2016-03-14T20:54:12Z

These new functions are awesome!

case_when()

It could be named match_cases() or match_patterns().

I find the tilde a bit hard to parse visually with a long list of cases. Maybe allow the := operator? The only thing to do would be to export :=` <- `~.

match_cases(
  x %% 35 == 0 := "fizz buzz",
  x %% 5 == 0 := "fizz",
  x %% 7 == 0 := "buzz",
  TRUE := as.character(x)
)

recode()

Would it be too unsafe to coerce the default value automatically?

recode(x, 5, 4, 3, default = x)
#> Error: `default` has type 'integer' not 'double'

recode(x, "a", "b", "c", default = x)
#> Error: `default` has type 'integer' not 'character'

recode(x, "a", "b", "c", default = NA)
#> Error: `default` has type 'logical' not 'character'

against the more verbose:

recode(x, 5, 4, 3, default = as.double(x))
recode(x, "a", "b", "c", default = as.character(x))
recode(x, "a", "b", "c", default = NA_character_)

hadley · 2016-03-14T22:26:16Z

I think I prefer ~ for now - you can always add more spaces around it to make it more obvious.

I'm also in an explicit mood at the moment, so I'd prefer to require the default to be correctly typed. Also the logic starts to get a bit hairy if there are no arguments (just the default).

lionel- · 2016-03-14T23:04:26Z

I think recode() will be very hard to use with numeric data. It would be great to be able to do:

recode(x, `1` = 5, `2` = 4)

This would ideally work with both integers and integer-like doubles (as often integer data is stored in a double).

gergness · 2016-03-14T23:37:39Z

Agreed, looks great!

What about using the formula notation from case_when for recode? This would help with lionel's point about numeric data, and also could prevent name conflicts for x, default and missing.

x <- c("x", "y", "z")
recode(x, x = "a", y = "b", z = "c")
#>Error: All replacements must be named

@lionel-

@lionel- if this is going to work for doubles, it also needs a switch from == to near(). But that seems a bit perilous to me

#631

kevinushey · 2016-03-15T17:11:18Z

I also don't like using named arguments for recode() as the names are forced to be / become symbols, whereas ~ or := preserve the LHS type. And recode(x, 1 = 2) is pretty ugly.

hadley · 2016-03-15T18:53:15Z

@kevinushey yeah, but I don't see that being a particularly common use case. And I think using ~ means that we'll need to do some work to make case_when() work with the hybrid evaluator.

krlmlr · 2016-03-16T09:23:37Z

In 4920c8b, I need to use .$ to access columns when using case_when() in mutate() -- is this intended?

> mtcars %>% tbl_df %>%  mutate(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(cyl > 6 ~ "big", TRUE ~ "small"))
Error: object 'cyl' not found
> mtcars %>% tbl_df %>%  transmute(size = case_when(.$cyl > 6 ~ "big", TRUE ~ "small"))
Source: local data frame [32 x 1]

    size
   (chr)
1  small
2  small
3  small
4  small
5    big
6  small
7    big
8  small
9  small
10 small
..   ...

(Is this meant by "some work with the hybrid evaluator"?)

krlmlr · 2016-03-16T10:32:44Z

Re formula notation for recode(): It's also a question of API consistency.

hadley · 2016-03-16T12:57:26Z

@krlmlr see #1719

I think the point of if_else, recode and case_when is not to have exactly equivalent interfaces but to solve a series of increasingly more complicated/general situations that each require progressively more syntax.

krlmlr · 2016-03-16T13:49:56Z

I agree to "more syntax", but to me this doesn't necessary imply that each syntax should be different.

A different implementation of case_when(), that is also consistent with if_else() and perhaps doesn't suffer from #1719, would be

case_when <- function(condition, result, ...) {
  if (missing(condition)) NULL
  else
    if_else(condition, result, case_when(...))
}

(replace the if_else() call with its body).

sfirke · 2016-04-20T01:02:04Z

re: @krlmlr at #631 (comment), is .$ indeed the intended syntax? I ask because I was going to suggest adding an example of case_when with mutate to the documentation. Use with mutate will be common and some users might need a pointer toward .$. since in their current nested ifelse calls they don't need it.

Example:

mtcars %>% mutate(
  efficiency = case_when(
    .$mpg < 22 ~ "low",
    .$mpg <= 26  ~ "moderate",
    TRUE ~ "high"
  )
)

hadley · 2016-04-20T12:39:15Z

@sfirke no - the goals is for case_when() to work as is inside mutate().

bergsmat · 2016-05-09T22:34:35Z

Loving case_when, but encountering results I don't fully understand (dplyr 3074c).

x <- expand.grid(y=5,z = c(T,F))
x <-  group_by(x,z)
mutate(          x, y = case_when(   z ~ n()         )) # (no error)
mutate(          x, y = case_when(   z ~ n(),!z ~ 0L )) # (no error)
mutate( ungroup(x), y = case_when(   z ~ n(),!z ~ 0L )) # object 'z' not found
mutate(          x, y = case_when(   z ~ 1L          )) # object 'z' not found
mutate(          x, y = case_when( .$z ~ 1L          )) # object 'z' not found

hadley · 2016-05-10T03:49:19Z

Yes, it doesn't yet work inside mutate.

mvabl-steve · 2017-09-12T14:33:46Z

I've been teaching myself R from scratch, basically by undertaking something, then reading posts like these, and trial-and-error based on that. Sometimes I hit a wall and reach out.

I've hit a wall. I have dplyr 0.7 installed. I have a tibble with a column - call it contract_key - I added by applying mutate(coalesce()) to three other columns in the tibble. Here's sample data, which I'll of course refer to later as df:

product,supplier,vendor,quantity,gross,contract_key 168145850127,person1,org1,7,8.14E-04,person1 655393265191,person5,org2,7,0.041861099,person5 655393265191,person3,org3,5,0.019268984,person3 168145850127,person4,org2,4,0.023245343,org2 928893912852,person5,org2,2,0.002285355,person5 NA,person6,org1,3,NA,person6 168145850127,person7,org1,1,NA,person7 350468621217,person8,org2,2,0.004212372,person8 977939797847,person9,org1,1,0.001832018,person9 168145850127,person10,org3,6,0.043907375,168145850127 168145850127,person5,org2,3,0.020960582,168145850127 168145850127,person5,org2,1,0.006923272,person5 350468621217,person11,org1,2,0.002730578,350468621217 350468621217,person5,org2,1,0.005505893,person5 977939797847,person11,org1,18,0.023260808,org1 350468621217,person12,org2,1,0.003525162,org2 168145850127,person13,org4,1,0.001932924,person13 655393265191,person5,org2,7,0.016847417,person5 NA,person14,org5,2,NA,NA 928893912852,person5,org2,2,4.22E-04,person5 168145850127,person5,org2,1,5.82E-04,person5 168145850127,person5,org2,1,0.007305813,168145850127 866586597461,person5,org2,8,0.051104768,866586597461 977939797847,person5,org2,6,0.007783279,person5 866586597461,person5,org2,4,0.0225,866586597461

Here's how I generated contract_key:

df <- df %>% mutate(contract_key = coalesce(product,supplier,vendor))

I now want to add another column that categorizes the contents of contract_key based on which of the three columns provided the content (through coalesce()). So if contract_key ="person5", for example, the new column, contract_level, would be "supplier". And contract_key="org2" would map to contract_level = "vendor", etc.

Essentially, I'll be using contract_level as a join variable to another tibble.

I'm stumped. I've tried if_else, and I see that I shouldn't bother trying case_when (because it's inside mutate()).

It's probably basic R syntax that I just don't know. But if someone supplies the answer, I will backtrace until I figure out what you did. (And I'll have learned a new lesson in R!)

Thanks!

mvabl-steve · 2017-09-12T15:10:52Z

thanks @friep. i didn't realize the unwritten rules for SO v git. (Seriously, I didn't. Sincere thanks.)

I'll try SO, and all its attendant downvoting and judging! :)

edwindj changed the title ~~Conditional mutate: mutate_where?~~ Conditional mutate: mutate_where? Or add_filter/remove_filter? Sep 26, 2014

hadley added this to the 0.4 milestone Sep 30, 2014

hadley added the feature a feature request or enhancement label Sep 30, 2014

hadley self-assigned this Sep 30, 2014

hadley mentioned this issue Dec 3, 2014

idea: trigger / elicit / when tidyverse/purrr#13

Closed

hadley mentioned this issue May 19, 2015

enable replacement of values #1078

Closed

jennybc mentioned this issue Oct 4, 2015

Can I filter within mutate? Or how can I select specific rows within mutate STAT545-UBC/Discussion#134

Open

hadley changed the title ~~Conditional mutate: mutate_where? Or add_filter/remove_filter?~~ Vectorised switch() Oct 22, 2015

hadley added the data frame label Oct 22, 2015

romainfrancois mentioned this issue Nov 7, 2015

Better syntax for mutate() than nested "ifelse" functions #1518

Closed

jennybc mentioned this issue Dec 8, 2015

non overwriting update set where #1573

Closed

vergilcw mentioned this issue Dec 9, 2015

dplyr mutate to replace a subset of rows #1577

Closed

hadley mentioned this issue Dec 9, 2015

choices: new function to be used from within mutate and transmute functions #1557

Closed

hadley mentioned this issue Mar 1, 2016

Conditional mutate #1665

Closed

datalove mentioned this issue Mar 13, 2016

Vectorised switch (aka SQL DECODE) #1710

Closed

hadley changed the title ~~Vectorised switch()~~ Equivalent of SQL CASE Mar 14, 2016

hadley closed this as completed in b6b794f Mar 14, 2016

hadley added a commit that referenced this issue Mar 15, 2016

Also use names for numeric vars. #631

3afd493

@lionel- if this is going to work for doubles, it also needs a switch from == to near(). But that seems a bit perilous to me

hadley added a commit that referenced this issue Mar 15, 2016

Reduce name clashes in recode.

4920c8b

#631

krlmlr mentioned this issue Mar 16, 2016

More verbose error messages in case_when() #1718

Merged

jennybc mentioned this issue May 16, 2016

Dataset patching tools tidyverse/tidyr#183

Closed

kanaugust mentioned this issue Jun 25, 2016

Support case_when function inside mutate without .$ #1965

Closed

TanyaMurphy mentioned this issue Jan 19, 2017

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries ropensci/auunconf#46

Open

lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018

Equivalent of SQL CASE #631

Equivalent of SQL CASE #631

Comments

edwindj commented Sep 26, 2014

hadley commented Sep 26, 2014

edwindj commented Sep 26, 2014

hadley commented Sep 26, 2014

hadley commented Oct 1, 2014

rpruim commented Oct 2, 2014

kevinykuo commented Oct 2, 2014

rpruim commented Oct 2, 2014

hadley commented Oct 2, 2014

matthieugomez commented Oct 2, 2014

hadley commented Oct 2, 2014

rpruim commented Oct 2, 2014

hadley commented Oct 2, 2014

rpruim commented Oct 2, 2014

hadley commented Oct 2, 2014

hadley commented Oct 22, 2014

dholstius commented Aug 5, 2015

hadley commented Mar 14, 2016

hadley commented Mar 14, 2016

hadley commented Mar 14, 2016

lionel- commented Mar 14, 2016

case_when()

recode()

hadley commented Mar 14, 2016

lionel- commented Mar 14, 2016

gergness commented Mar 14, 2016

kevinushey commented Mar 15, 2016

hadley commented Mar 15, 2016

krlmlr commented Mar 16, 2016

krlmlr commented Mar 16, 2016

hadley commented Mar 16, 2016

krlmlr commented Mar 16, 2016

sfirke commented Apr 20, 2016

hadley commented Apr 20, 2016

bergsmat commented May 9, 2016

hadley commented May 10, 2016

mvabl-steve commented Sep 12, 2017 • edited Loading

mvabl-steve commented Sep 12, 2017

mvabl-steve commented Sep 12, 2017 •

edited

Loading