-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equivalent of SQL CASE #631
Comments
I really prefer the vector based solution, but it would be much simpler if smoke2 <-
smoke %>%
mutate(cigarrettes = ifelse(!is_smoker, 0, cigarrettes, 0)
) The problem is that I don't see how any other approach could work with databases - you can not assume that you have write access to the underlying table. |
Agreed :-) While I think it is possible to implement it in DBs using temporary tables and multiple SQL statements, I think it is wise not to do so: it will complicate the translate_sql code and furthermore it is rather DB specific how TEMP tables work. |
I'd like to approach this problem by figuring out what the 90% of most common in-place mutates do, and then figuring out how to make elegant vectorised replacement functions. |
I'm thinking of adapting something like @rpruim any thoughts? |
I've been thinking about this today and have an idea to float. We could use formulas of the form
to specify the desired variable/vector. This is modeled on mathematical definitions by cases, so below I'll call the function
And fairly complicated definitions by cases could be constructed similarly. As with Using formulas makes it easy to have arguments with natural names (rather than prepending with
What do you think? |
Haven't given any thought on the viability for implementation, but here's an idea, inspired by @rpruim and pattern matching:
|
Hmm. Note sure you really want this. Rather, I think its use case would be too restricted. In your case, you only work with one variable, but in general both the conditions and the expressions might refer to multiple variables. Remember that
What I'm proposing would work on vectors (which could be evaluated in a data frame or elsewhere) and would make it simple to create either expressions or conditions from multiple vectors. I also think I prefer my original order for the formulas. The expression is more like a response and the condition more like an explanatory variable, so it seems to fit the model of other uses of formulas better. |
I think I'd prefer having the result on the RHS: cases(
!is_smoker & is.na(cigarettes) ~ 0,
TRUE ~ cigarettes)
) Using Taking an example from `derivedFactor(): cases(
i1 == 0 ~ "abstinent",
(i1>0 & i1<=1 & i2<=3 & sex=='female') | (i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate"
((i1>1 | i2>3) & sex=='female') | ((i1>2 | i2>4) & sex=='male') = "highrisk"
)
# Or by moving our the "or"s
cases(
i1 == 0 ~ "abstinent",
i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
(i1>1 | i2>3) & sex=='female') ~ "highrisk"
(i1>2 | i2>4) & sex=='male') ~ "highrisk"
)
# You could also nest cases to make the male/female distinction clearer
cases(
i1 == 0 ~ "abstinent",
sex == 'female' ~ cases(
i1>0 & i1<=1 & i2<=3 ~ "moderate",
i1>1 | i2>3 ~ "highrisk"
),
sex == 'male' ~ cases(
i1>0 & i1<=2 & i2<=4 ~ "moderate",
i1>2 | i2>4 ~ "highrisk"
)
)
# Or with a default
cases(
i1 == 0 ~ "abstinent",
i1>0 & i1<=1 & i2<=3 & sex=='female') ~ "moderate",
i1>0 & i1<=2 & i2<=4 & sex=='male') ~ "moderate",
TRUE ~ "highrisk"
) |
Creating a new variable (as in hadley's last post) is conceptually quite different from modifying some value of a variable (as in the OP's post) - for instance, default behaviors for what happens with non matched rows should differ (resp NA vs original value). There could two different verbs instead of |
@matthieugomez yes, agreed. In SQL there are two forms of But the challenge is NA: the usual semantics for equality won't work here. (And there's a general problem for floating point values). |
One advantage of the
Whichever way the formulas are used, I like the nesting idea. |
@rpruim That is true, but I think it makes nesting less natural, and you can achieve the nice alignment with lots of whitespace cases(
i1 == 0 ~ "abstinent",
i1>0 & i1<=1 & i2<=3 & sex=='female' ~ "moderate",
i1>0 & i1<=2 & i2<=4 & sex=='male' ~ "moderate",
(i1>1 | i2>3) & sex=='female' ~ "highrisk"
(i1>2 | i2>4) & sex=='male' ~ "highrisk"
) @matthieugomez I don't see the advantage of that over the filter based approach. Sure it save a little memory, but at the expense of making the API non-orthogonal. The goal of dplyr isn't to squeeze the last bit of performance out of R, it's to have provide a elegant and efficient API for data manipulation. |
Hmm... The nesting argument is probably reason enough to prefer |
To make the semantics clear, I think each vector referenced in I'm also reasonably certain this will be straightforward to translate to SQL. |
Here's a start on an implementation: https://gist.github.com/hadley/2751ba61d1c7f4eaacab Currently it requires you pass the data frame as the first argument, but I think I can work around that. It's not very efficient, but it's a starting point. |
For what it's worth: here's a https://gist.github.com/holstius/cbc4ec40057fbc2d9f4b In my workflows, I like to separate "patching" from "recoding" or "labeling". Patching is the real-world consequence of deadlines busy upstream providers — it's not supposed to be there, but it is. |
With the new mutate(smoke, cigarettes = if_else(coalesce(is_smoker, FALSE), cigarettes, 0L)) And note that this discussions is really about a sql CASE equivalent for R, not a vectorised switch. The discussion for vectorised switch is now taking place in #1710. I've also deleted a number of comments in this thread that are no longer pertinent. |
And now you can do: mutate(smoke, cigarettes = if_else(is_smoker, cigarettes, 0L, 0L)) |
I'm pretty happy with this implementation, but I'd love to get everyone's feedback on the name and how it works. |
These new functions are awesome! case_when()It could be named I find the tilde a bit hard to parse visually with a long list of cases. Maybe allow the match_cases(
x %% 35 == 0 := "fizz buzz",
x %% 5 == 0 := "fizz",
x %% 7 == 0 := "buzz",
TRUE := as.character(x)
) recode()Would it be too unsafe to coerce the default value automatically? recode(x, 5, 4, 3, default = x)
#> Error: `default` has type 'integer' not 'double'
recode(x, "a", "b", "c", default = x)
#> Error: `default` has type 'integer' not 'character'
recode(x, "a", "b", "c", default = NA)
#> Error: `default` has type 'logical' not 'character' against the more verbose: recode(x, 5, 4, 3, default = as.double(x))
recode(x, "a", "b", "c", default = as.character(x))
recode(x, "a", "b", "c", default = NA_character_) |
I think I prefer I'm also in an explicit mood at the moment, so I'd prefer to require the default to be correctly typed. Also the logic starts to get a bit hairy if there are no arguments (just the default). |
I think recode(x, `1` = 5, `2` = 4) This would ideally work with both integers and integer-like doubles (as often integer data is stored in a double). |
Agreed, looks great! What about using the formula notation from x <- c("x", "y", "z")
recode(x, x = "a", y = "b", z = "c")
#>Error: All replacements must be named |
@lionel- if this is going to work for doubles, it also needs a switch from == to near(). But that seems a bit perilous to me
I also don't like using named arguments for |
@kevinushey yeah, but I don't see that being a particularly common use case. And I think using |
In 4920c8b, I need to use
(Is this meant by "some work with the hybrid evaluator"?) |
Re formula notation for recode(): It's also a question of API consistency. |
I agree to "more syntax", but to me this doesn't necessary imply that each syntax should be different. A different implementation of case_when(), that is also consistent with
(replace the if_else() call with its body). |
re: @krlmlr at #631 (comment), is Example:
|
@sfirke no - the goals is for |
Loving case_when, but encountering results I don't fully understand (dplyr 3074c).
|
Yes, it doesn't yet work inside mutate. |
I've been teaching myself R from scratch, basically by undertaking something, then reading posts like these, and trial-and-error based on that. Sometimes I hit a wall and reach out. I've hit a wall. I have dplyr 0.7 installed. I have a tibble with a column - call it
Here's how I generated
I now want to add another column that categorizes the contents of Essentially, I'll be using I'm stumped. I've tried It's probably basic R syntax that I just don't know. But if someone supplies the answer, I will backtrace until I figure out what you did. (And I'll have learned a new lesson in R!) Thanks! |
thanks @friep. i didn't realize the unwritten rules for SO v git. (Seriously, I didn't. Sincere thanks.) I'll try SO, and all its attendant downvoting and judging! :) |
In data cleaning data often contains values that one would like to impute/replace.
For example:
In a survey for smokers, one would not ask a person his or her cigarette consumption when he/she has indicated not to be a smoker. So technically the number of cigarretes is not measured, but it can be deduced to be zero.
How does one do this elegantly in dplyr?
My 2 cents:
which I don't find elegant dplyr code.
So try a little harder
It works, but this is not efficient/workable for the database backends and I find it unelegant to
do the filtering in the assignment.
I suggest a new method:
mutate_where
that allows for changing values of existing variables of a subset of atbl
.So something like
Or another option (in analogy with
group
andgrouped_<tbl>
) to changefilter
or introducefiltered_<tbl>
/add_filter: setting a filter lets one work on a subset of the data, but one canremove_filter
.Any better ideas or suggestions?
The text was updated successfully, but these errors were encountered: