Inconsistent data.table assignment by reference behaviour #759

mchen402 · 2014-08-06T13:54:52Z

Original SO post: http://stackoverflow.com/questions/25145112/inconsistent-data-table-assignment-by-reference-behaviour

When assigning by reference with a data.table using a column from a second data.table, the results are inconsistent. When there are no matches by the key columns of both data.tables, it appears the assigment expression y := y is totally ignored - not even NAs are returned.

library(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
print(dt1[dt2, y := y])
##    id x     # Would have also expected column:   y
## 1:  1 3     #                                   NA
## 2:  2 4     #                                   NA

However, when there is a partial match, non-matching columns have a placeholder NA.

dt2[, id := 2:3]
print(dt1[dt2, y := y])
##    id x  y
## 1:  1 3 NA    # <-- placeholder NA here
## 2:  2 4  5

This wreaks havoc on later code that assumes a y column exists in all cases. Otherwise I keep having to write cumbersome additional checks to take into account both cases.

Is there an elegant way around this inconsistency?

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2014-08-06T14:18:51Z

Thanks @tunaaa once again for the report.

Side note: It's okay to just link to the SO issue. In fact it'd be nice if you edited your question to include the original link to your issues so that we can update the post once a decision is made and the issue is closed.

arunsrinivasan · 2016-03-08T21:04:41Z

Updated SO post.

jangorecki · 2016-07-07T16:45:28Z

I wonder if it makes sense to extend this feature for grouping operations?

library(data.table
dt = data.table(a=1)
dt[a > 2, x1 := "col will appear"]
#   a x1
#1: 1 NA
dt[a > 2, x2 := "col wont appear", by=a]
#   a x1
#1: 1 NA

It would be probably quite costly to determine data type for NA columns.

…Col==3] where DT is one row and someCol is NA, returns no rows for consistency with nrow>1 cases. with=FALSE with := now warns. nomatch with := now warns. logical i no longer recycles unless length 1 or nrow. #1001 #1002 #759 #354 #166 and closes #1252.

UweBlock · 2016-10-04T09:20:21Z

Now, the inconsistency is the other way around, unfortunately.

As a matter of fact I have to fix a bunch of code which is relying on the old behaviour of version 1.9.6 and 1.9.7 (until recently).

library(data.table)
dt <- data.table(a = LETTERS[1:5])
dt[a == "Z", b := 1]

used to yield in

dt
   a
1: A
2: B
3: C
4: D
5: E

Now, with the latests commit it returns

> dt
   a  b
1: A NA
2: B NA
3: C NA
4: D NA
5: E NA

Please, note that my j expressions aren't as simple as in the example. It uses conversion from character to date which now breaks because the assignment operation gets executed even if the i expression doesn't select any row.

If this feature will stay in data.table I'm not sure how to get around this. Perhaps,

surrounding all affected statements by if-clauses which check if dt[i, ] has a non-empty result (which would require to execute the same ì expression twice),
adding an intermediate column to check if there is at least one non-NA value
adding an additional ifelse call to j

I'm happy to hear about any other suggestions which will save me from the additional work to fix this.

Potential solution

Perhaps, we have to distinguish here between a simple select statement in ì and joining another table as requested by @mchen402. If the select in i would return an empty table it could stay with the old behaviour not to execute the j and not add a column with all NAs.
Or, do we need another parameter or option here?

At least, this needs to be added to the list of POTENTIALLY BREAKING CHANGES in NEWS.md.

MichaelChirico · 2016-10-04T15:54:58Z

I'm sorry, what is your expected output? I may have missed it

jangorecki · 2016-10-04T16:48:50Z

@UweBlock current approach is more consistent, it always add a column, and fills its will values depending on i match. IMO is more correct to have this extra check when you want to conditionally add a column. This shouldn't cost much if you are using index, otherwise you can use if(length(ii<-dt[..., which=TRUE])) dt[ii, b:=1] to avoid vector scan twice.

mattdowle · 2016-10-04T18:17:40Z

@jangorecki Agree. That was the thinking. To make the shape and type of the output not depend on data (whether the i expression returns all-FALSE). Assuming by 'current approach' you mean 'new approach'.

@UweBlock Two questions. On :

It uses conversion from character to date which now breaks because the assignment operation gets executed even if the i expression doesn't select any row.

What is this expression please (roughly) and why does it break on empty input? I'm thinking we should focus on your RHS of := so that it returns numeric(), integer() or character() on empty input.

And you must have switches in your current code after this statement then? To cope with the new b column existing or not depending on whether i returns all-FALSE or not?

mattdowle · 2016-10-04T18:30:49Z

Here's an example of old behaviour we're thinking is undesirable :

> DT = data.table(a=1:3, b=c("2012-03-01","2012-03-02","2012-03-03"))
> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>8,newCol:=as.Date(b)]   # no column added
> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>2,newCol:=as.Date(b)]   # column added
> DT
   a          b     newCol
1: 1 2012-03-01       <NA>
2: 2 2012-03-02       <NA>
3: 3 2012-03-03 2012-03-03

New behaviour :

> DT
   a          b
1: 1 2012-03-01
2: 2 2012-03-02
3: 3 2012-03-03
> DT[a>8,newCol:=as.Date(b)]   # column added
> DT
   a          b newCol
1: 1 2012-03-01   <NA>
2: 2 2012-03-02   <NA>
3: 3 2012-03-03   <NA>
> DT[, newCol:=NULL]
> DT[a>2,newCol:=as.Date(b)]   # column added
> DT
   a          b     newCol
1: 1 2012-03-01       <NA>
2: 2 2012-03-02       <NA>
3: 3 2012-03-03 2012-03-03
>

as.Date() works on empty character() here which is why I'm asking for more info about your RHS please @UweBlock.

UweBlock · 2016-10-04T23:24:34Z

Thank you for your immediate reactions and suggestions, @MichaelChirico, @jangorecki, @mattdowle.

Two updates:

I've tried to create a minimal working example but wasn't able to reproduce the strange behaviour of the production code.
In the production code, I replaced the complex date conversion stuff on the RHS by calls to lubridate::dmy which apparently is more robust to missing or NA input. Fortunately, the fixes required less effort than suspected and I was able to run all reports in time.

Now, with the fixes in place I feel comfortable with the new approach. However, I will spent some more time on the MWE.

mattdowle · 2016-10-11T05:45:04Z

Thanks for the update @UweBlock. We'll keep the change then unless anything else comes to light and move the news item up to the 'potentially breaking changes' section as you suggested.

brianbolt · 2017-04-27T18:27:31Z

I did something very similar to this in data.table 1.9.6:

> data.table(test=1)[test != 1, 'newCol' := stop("some error"), by = test]

now (since 1.9.8), the same code gives:

> data.table(test=1)[test != 1, 'newCol' := stop("some error"), by = test] Error in [.data.table(data.table(test = 1), test != 1, :=("newCol", : some error

In hundreds of different piece of code I used the 1.9.6 inconsistency as a no-op case, expecting that the stop would never be evaluated if the RHS had a length of 0.

I understand the purpose of the change in data.table and I am looking for the best way to remediate my code and possibly an option to restore old functionality to give me time to remediate my code. Would it be possible add an option to restore this "inconsistency", like what was done for the other potentially breaking changes introduced in 1.9.8?

jangorecki · 2017-04-28T12:53:37Z

@brianbolt all breaking changes were listed in news file so be sure to check there. Also be aware if your application works fine you don't have to update package and your code. Multiple versions of same package can be managed by using R libraries. This is pretty old change already, I don't think there is easy way to suppress evaluation here.

brianbolt · 2017-04-28T14:45:06Z

@jangorecki I don't feel Nov 2016 (when 1.9.8 was released) is that old and other possibly breaking changes were given options (to be removed in 2 years) for using the old behavior in a deprecated way. That being said, I am aware that installing the old package in a local library is possible and it is what I am doing as a workaround for now until I can remediate my code.

For anyone else that might need to do this. Here is 1 way to install 1.9.6:

# if you want to install the package somewhere else then run the next line
.libPaths('/path/to/local/r_libs')

# install data.table 1.9.6 from the last day it was available on cran
install.packages("data.table", repos = "http://mran.revolutionanalytics.com/snapshot/2016-11-25")

brianbolt · 2017-04-28T16:43:42Z

I find this part of the documentation very confusing now as it seems to indicate that I am using data.table correctly. Essentially as a for loop.

The general form of data.table syntax is:
DT[ i, j, by ] # extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?

In my case, on which rows = 0 so I would expect (like in a normal for loop) that j would not be evaluated. I really do think this should be an option if not the default.

franknarf1 · 2017-04-28T17:02:51Z

@brianbolt I see your point, but another way of thinking about it is that evaluation of j is appropriate even when i leads to an empty table. For example, I often use .N even when there are zero rows. One fix that may help in your use case (which isn't really clear to me), is adding if (.N > 0L) ahead of these stop statements. I sometimes use...

stopifnot(DT[ !(expected), .N ] == 0L)

MichaelChirico · 2017-08-08T02:04:12Z

just got bit by this... maybe worth adding to the jan's suggestion to the FAQ?

e.g., I just tried:

DT[(some_condition), new_col := if (.N) value else NULL]

And was (somewhat) surprised to see the error:

Error in [.data.table(X, start_date <= test_start, :=(sprintf("train_d", :
When deleting columns, i should not be provided

mattdowle · 2018-03-16T07:13:00Z

@MichaelChirico Late reply by 7 months (!) but are you still surprised by that error? It looks good and safe behaviour to return error in that case, to me. new_col to me suggests the intention is to add a new column, or not, based on the condition. Assigning NULL to a new column is a warning anyway (Adding new column 'new_col' then assigning NULL (deleting it)). And if some_condition was accidentally not a single TRUE or FALSE, but a length > 2 logical vector, then the new_col could conist of a mixture of value and NA.
How about if (DT[,any(some_condition)]) DT[,new_col:=value] instead.

MichaelChirico · 2018-03-16T07:51:33Z

Hmm. Can't say I remember what exactly I was trying to do. Certainly my syntax looks weird. := NULL should always be expected to try and delete a column... so yes I agree with you

eantonya added the feature request label Aug 6, 2014

jangorecki mentioned this issue Jul 17, 2015

inconsistent output of add column on join #1166

Closed

jangorecki mentioned this issue Oct 10, 2015

DT[FALSE,v:=1] #820

Closed

jangorecki mentioned this issue Nov 28, 2015

[R-Forge #4996] provide a message when x[y,newcol:=my_ycol] does not create a new column #354

Closed

arunsrinivasan added this to the v1.9.8 milestone Nov 30, 2015

jangorecki assigned jangorecki and unassigned jangorecki Mar 6, 2016

arunsrinivasan self-assigned this Mar 8, 2016

arunsrinivasan closed this as completed in ee7c247 Mar 8, 2016

arunsrinivasan mentioned this issue Mar 9, 2016

Behaviour when j has undefined columns and i-operation leads to 0-length. #1002

Closed

mattdowle reopened this Oct 4, 2016

mattdowle closed this as completed in df9cd4d Oct 19, 2016

brianbolt mentioned this issue Mar 7, 2018

Is there any reason for j to evaluate when i returns 0 rows? #2662

Closed

MichaelChirico mentioned this issue Mar 20, 2018

:= together with by=.EACHI doesn't add new column if no rows match #2690

Open

MarkusBonsch mentioned this issue May 10, 2018

DT[j ,c("COL1", "NEW_COL"):=list(COL2, newValue)] returns error if j does not match #2829

Closed

MichaelChirico mentioned this issue Jul 29, 2019

data.table(double(0), 1) returns a zero-row table #3727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent data.table assignment by reference behaviour #759

Inconsistent data.table assignment by reference behaviour #759

mchen402 commented Aug 6, 2014

arunsrinivasan commented Aug 6, 2014

arunsrinivasan commented Mar 8, 2016

jangorecki commented Jul 7, 2016

UweBlock commented Oct 4, 2016 •

edited

Loading

MichaelChirico commented Oct 4, 2016

jangorecki commented Oct 4, 2016 •

edited

Loading

mattdowle commented Oct 4, 2016 •

edited

Loading

mattdowle commented Oct 4, 2016 •

edited

Loading

UweBlock commented Oct 4, 2016 •

edited

Loading

mattdowle commented Oct 11, 2016

brianbolt commented Apr 27, 2017 •

edited

Loading

jangorecki commented Apr 28, 2017 •

edited

Loading

brianbolt commented Apr 28, 2017 •

edited

Loading

brianbolt commented Apr 28, 2017

franknarf1 commented Apr 28, 2017

MichaelChirico commented Aug 8, 2017 •

edited

Loading

mattdowle commented Mar 16, 2018 •

edited

Loading

MichaelChirico commented Mar 16, 2018

Inconsistent data.table assignment by reference behaviour #759

Inconsistent data.table assignment by reference behaviour #759

Comments

mchen402 commented Aug 6, 2014

arunsrinivasan commented Aug 6, 2014

arunsrinivasan commented Mar 8, 2016

jangorecki commented Jul 7, 2016

UweBlock commented Oct 4, 2016 • edited Loading

Potential solution

MichaelChirico commented Oct 4, 2016

jangorecki commented Oct 4, 2016 • edited Loading

mattdowle commented Oct 4, 2016 • edited Loading

mattdowle commented Oct 4, 2016 • edited Loading

UweBlock commented Oct 4, 2016 • edited Loading

mattdowle commented Oct 11, 2016

brianbolt commented Apr 27, 2017 • edited Loading

jangorecki commented Apr 28, 2017 • edited Loading

brianbolt commented Apr 28, 2017 • edited Loading

brianbolt commented Apr 28, 2017

franknarf1 commented Apr 28, 2017

MichaelChirico commented Aug 8, 2017 • edited Loading

mattdowle commented Mar 16, 2018 • edited Loading

MichaelChirico commented Mar 16, 2018

UweBlock commented Oct 4, 2016 •

edited

Loading

jangorecki commented Oct 4, 2016 •

edited

Loading

mattdowle commented Oct 4, 2016 •

edited

Loading

mattdowle commented Oct 4, 2016 •

edited

Loading

UweBlock commented Oct 4, 2016 •

edited

Loading

brianbolt commented Apr 27, 2017 •

edited

Loading

jangorecki commented Apr 28, 2017 •

edited

Loading

brianbolt commented Apr 28, 2017 •

edited

Loading

MichaelChirico commented Aug 8, 2017 •

edited

Loading

mattdowle commented Mar 16, 2018 •

edited

Loading