Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent data.table assignment by reference behaviour #759

Closed
mchen402 opened this issue Aug 6, 2014 · 18 comments
Closed

Inconsistent data.table assignment by reference behaviour #759

mchen402 opened this issue Aug 6, 2014 · 18 comments
Assignees
Milestone

Comments

@mchen402
Copy link

mchen402 commented Aug 6, 2014

Original SO post: http://stackoverflow.com/questions/25145112/inconsistent-data-table-assignment-by-reference-behaviour


When assigning by reference with a data.table using a column from a second data.table, the results are inconsistent. When there are no matches by the key columns of both data.tables, it appears the assigment expression y := y is totally ignored - not even NAs are returned.

library(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
print(dt1[dt2, y := y])
##    id x     # Would have also expected column:   y
## 1:  1 3     #                                   NA
## 2:  2 4     #                                   NA

However, when there is a partial match, non-matching columns have a placeholder NA.

dt2[, id := 2:3]
print(dt1[dt2, y := y])
##    id x  y
## 1:  1 3 NA    # <-- placeholder NA here
## 2:  2 4  5

This wreaks havoc on later code that assumes a y column exists in all cases. Otherwise I keep having to write cumbersome additional checks to take into account both cases.

Is there an elegant way around this inconsistency?

@arunsrinivasan
Copy link
Member

Thanks @tunaaa once again for the report.

Side note: It's okay to just link to the SO issue. In fact it'd be nice if you edited your question to include the original link to your issues so that we can update the post once a decision is made and the issue is closed.

@arunsrinivasan
Copy link
Member

Updated SO post.

@jangorecki
Copy link
Member

I wonder if it makes sense to extend this feature for grouping operations?

library(data.table
dt = data.table(a=1)
dt[a > 2, x1 := "col will appear"]
#   a x1
#1: 1 NA
dt[a > 2, x2 := "col wont appear", by=a]
#   a x1
#1: 1 NA

It would be probably quite costly to determine data type for NA columns.

mattdowle added a commit that referenced this issue Sep 29, 2016
…Col==3] where DT is one row and someCol is NA, returns no rows for consistency with nrow>1 cases.  with=FALSE with := now warns.  nomatch with := now warns.  logical i no longer recycles unless length 1 or nrow. #1001 #1002 #759 #354 #166 and closes #1252.
@UweBlock
Copy link
Contributor

UweBlock commented Oct 4, 2016

Now, the inconsistency is the other way around, unfortunately.

As a matter of fact I have to fix a bunch of code which is relying on the old behaviour of version 1.9.6 and 1.9.7 (until recently).

library(data.table)
dt <- data.table(a = LETTERS[1:5])
dt[a == "Z", b := 1]

used to yield in

dt
   a
1: A
2: B
3: C
4: D
5: E

Now, with the latests commit it returns

> dt
   a  b
1: A NA
2: B NA
3: C NA
4: D NA
5: E NA

Please, note that my j expressions aren't as simple as in the example. It uses conversion from character to date which now breaks because the assignment operation gets executed even if the i expression doesn't select any row.

If this feature will stay in data.table I'm not sure how to get around this. Perhaps,

  • surrounding all affected statements by if-clauses which check if dt[i, ] has a non-empty result (which would require to execute the same ì expression twice),
  • adding an intermediate column to check if there is at least one non-NA value
  • adding an additional ifelse call to j

I'm happy to hear about any other suggestions which will save me from the additional work to fix this.

Potential solution

Perhaps, we have to distinguish here between a simple select statement in ì and joining another table as requested by @mchen402. If the select in i would return an empty table it could stay with the old behaviour not to execute the j and not add a column with all NAs.
Or, do we need another parameter or option here?

At least, this needs to be added to the list of POTENTIALLY BREAKING CHANGES in NEWS.md.

@MichaelChirico
Copy link
Member

I'm sorry, what is your expected output? I may have missed it

@jangorecki
Copy link
Member

jangorecki commented Oct 4, 2016

@UweBlock current approach is more consistent, it always add a column, and fills its will values depending on i match. IMO is more correct to have this extra check when you want to conditionally add a column. This shouldn't cost much if you are using index, otherwise you can use if(length(ii<-dt[..., which=TRUE])) dt[ii, b:=1] to avoid vector scan twice.

@mattdowle
Copy link
Member

mattdowle commented Oct 4, 2016

@jangorecki Agree. That was the thinking. To make the shape and type of the output not depend on data (whether the i expression returns all-FALSE). Assuming by 'current approach' you mean 'new approach'.

@UweBlock Two questions. On :

It uses conversion from character to date which now breaks because the assignment operation gets executed even if the i expression doesn't select any row.

What is this expression please (roughly) and why does it break on empty input? I'm thinking we should focus on your RHS of := so that it returns numeric(), integer() or character() on empty input.

And you must have switches in your current code after this statement then? To cope with the new b column existing or not depending on whether i returns all-FALSE or not?

@mattdowle
Copy link
Member

mattdowle commented Oct 4, 2016

Here's an example of old behaviour we're thinking is undesirable :

> DT = data.table(a=1:3, b=c("2012-03-01","2012-03-02","2012-03-03"))
> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>8,newCol:=as.Date(b)]   # no column added
> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>2,newCol:=as.Date(b)]   # column added
> DT
   a          b     newCol
1: 1 2012-03-01       <NA>
2: 2 2012-03-02       <NA>
3: 3 2012-03-03 2012-03-03

New behaviour :

> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>8,newCol:=as.Date(b)]   # column added
> DT
   a          b newCol
1: 1 2012-03-01   <NA>
2: 2 2012-03-02   <NA>
3: 3 2012-03-03   <NA>
> DT[, newCol:=NULL]
> DT[a>2,newCol:=as.Date(b)]   # column added
> DT
   a          b     newCol
1: 1 2012-03-01       <NA>
2: 2 2012-03-02       <NA>
3: 3 2012-03-03 2012-03-03
> 

as.Date() works on empty character() here which is why I'm asking for more info about your RHS please @UweBlock.

@mattdowle mattdowle reopened this Oct 4, 2016
@UweBlock
Copy link
Contributor

UweBlock commented Oct 4, 2016

Thank you for your immediate reactions and suggestions, @MichaelChirico, @jangorecki, @mattdowle.

Two updates:

  1. I've tried to create a minimal working example but wasn't able to reproduce the strange behaviour of the production code.
  2. In the production code, I replaced the complex date conversion stuff on the RHS by calls to lubridate::dmy which apparently is more robust to missing or NA input. Fortunately, the fixes required less effort than suspected and I was able to run all reports in time.

Now, with the fixes in place I feel comfortable with the new approach. However, I will spent some more time on the MWE.

@mattdowle
Copy link
Member

Thanks for the update @UweBlock. We'll keep the change then unless anything else comes to light and move the news item up to the 'potentially breaking changes' section as you suggested.

@brianbolt
Copy link

brianbolt commented Apr 27, 2017

I did something very similar to this in data.table 1.9.6:

> data.table(test=1)[test != 1, 'newCol' := stop("some error"), by = test]

now (since 1.9.8), the same code gives:

> data.table(test=1)[test != 1, 'newCol' := stop("some error"), by = test] Error in [.data.table(data.table(test = 1), test != 1, :=("newCol", : some error

In hundreds of different piece of code I used the 1.9.6 inconsistency as a no-op case, expecting that the stop would never be evaluated if the RHS had a length of 0.

I understand the purpose of the change in data.table and I am looking for the best way to remediate my code and possibly an option to restore old functionality to give me time to remediate my code. Would it be possible add an option to restore this "inconsistency", like what was done for the other potentially breaking changes introduced in 1.9.8?

@jangorecki
Copy link
Member

jangorecki commented Apr 28, 2017

@brianbolt all breaking changes were listed in news file so be sure to check there. Also be aware if your application works fine you don't have to update package and your code. Multiple versions of same package can be managed by using R libraries. This is pretty old change already, I don't think there is easy way to suppress evaluation here.

@brianbolt
Copy link

brianbolt commented Apr 28, 2017

@jangorecki I don't feel Nov 2016 (when 1.9.8 was released) is that old and other possibly breaking changes were given options (to be removed in 2 years) for using the old behavior in a deprecated way. That being said, I am aware that installing the old package in a local library is possible and it is what I am doing as a workaround for now until I can remediate my code.

For anyone else that might need to do this. Here is 1 way to install 1.9.6:

# if you want to install the package somewhere else then run the next line
.libPaths('/path/to/local/r_libs')

# install data.table 1.9.6 from the last day it was available on cran
install.packages("data.table", repos = "http://mran.revolutionanalytics.com/snapshot/2016-11-25")

@brianbolt
Copy link

I find this part of the documentation very confusing now as it seems to indicate that I am using data.table correctly. Essentially as a for loop.

The general form of data.table syntax is:
DT[ i, j, by ] # extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?

In my case, on which rows = 0 so I would expect (like in a normal for loop) that j would not be evaluated. I really do think this should be an option if not the default.

@franknarf1
Copy link
Contributor

@brianbolt I see your point, but another way of thinking about it is that evaluation of j is appropriate even when i leads to an empty table. For example, I often use .N even when there are zero rows. One fix that may help in your use case (which isn't really clear to me), is adding if (.N > 0L) ahead of these stop statements. I sometimes use...

stopifnot(DT[ !(expected), .N ] == 0L)

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 8, 2017

just got bit by this... maybe worth adding to the jan's suggestion to the FAQ?

e.g., I just tried:

DT[(some_condition), new_col := if (.N) value else NULL]

And was (somewhat) surprised to see the error:

Error in [.data.table(X, start_date <= test_start, :=(sprintf("train_d", :
When deleting columns, i should not be provided

@mattdowle
Copy link
Member

mattdowle commented Mar 16, 2018

@MichaelChirico Late reply by 7 months (!) but are you still surprised by that error? It looks good and safe behaviour to return error in that case, to me. new_col to me suggests the intention is to add a new column, or not, based on the condition. Assigning NULL to a new column is a warning anyway (Adding new column 'new_col' then assigning NULL (deleting it)). And if some_condition was accidentally not a single TRUE or FALSE, but a length > 2 logical vector, then the new_col could conist of a mixture of value and NA.
How about if (DT[,any(some_condition)]) DT[,new_col:=value] instead.

@MichaelChirico
Copy link
Member

Hmm. Can't say I remember what exactly I was trying to do. Certainly my syntax looks weird. := NULL should always be expected to try and delete a column... so yes I agree with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants