Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread() memory access violation crash during type bumping on 64-bit R (Win32 Mac) #796

Closed
adamkennedy opened this issue Sep 3, 2014 · 2 comments

Comments

@adamkennedy
Copy link

This is a bug we have been hitting in production, that is either due to data.table or a bug in R code being used by it. I've tried to include as much material as I can for replication and will update as I can get more.

Update: Confirmed in 1.9.3 LGL -> INT -> REAL -> STR (output below is still for 1.9.2)

When fread() loads large CSV files containing many sparse character columns (with nothing other than NA inside the sample range) it will crash R during INT -> REAL -> STR type bumping.

I cannot produce the crash with only INT -> REAL type bumping, and the location of the crash (while seemingly deterministic) does not shift consistently with changes to the number of columns or the number of rows.

I have been able to factor out integer64 as a contributing factor (although the bug does still occur with INT -> INT64 -> REAL -> STR). I have not been able to trigger it with just REAL -> STR but that may be a failure of mine, rather than confirmed ruled out.

The location is pseudo-random and instinct says it's potentially a memory allocation problem of some kind (although I can't establish this). Nor can I determine if the problem is data.table or R itself. The Access Violation occurs in R.dll, but either 1 or 2 stack frames below data.table's DLL.

The primary platform observed is Win32 (see below) however we also have reports on crashes on Mac (I do not have a specific build at this time, but it is confirmed as 64-bit on Mac too)

R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Platform: x86_64-w64-mingw32/x64 (64-bit)

I've managed to come up with a rough replication test script, but the crash is pseudo-random and doesn't always seem to fail the same on different machines, some tweaking of the row and column counts may be needed.

# test-64BitCrash.R
# This test script demonstrates a fatal crash under 64-bit R we originally observed in
# production code,
#
# It seems to occurs in sparse CSV files where a moderately large number of NA values
# are converted from INT -> REAL -> STRING.

library(testthat)
library(data.table)

# Causing the failure can be somewhat random.
# Sometimes tweaking these values helps stimulate it.
# Once a case is failing on a machine, it MOSTLY seems to fail at the same place.
rows    <- 10000
columns <- 100





#####################################################################
# Generate the synthetic file which stimulates the crash

file <- file.path("test-64BitCrash.csv")

if (file.exists(file)) {
  file.remove(file)
}

test_that("The test file is generated", {
  frame <- data.frame(
    c1 = c( rep(NA,rows), "foo", rep(NA,10) )
  )

  for (i in 2:columns) {
    frame[,paste0("c",i)] <- c( rep(NA,rows), "foo", rep(NA,10) )
  }

  write.table(
    frame,
    file      = file,
    sep       = ",",
    row.names = FALSE,
    qmethod   = "double"
  )

  expect_that(file.exists(file), is_true())
})





#####################################################################
# Trigger the crash using the synthetic file

# This does not always crash consistently on the first attempt.
test_that("Run the fread() command a number of times until it fails", {
  for (i in seq(10)) {
    fread(file, header = TRUE, verbose = TRUE, integer64 = "double")
  }
})

# Clean up afterwards if we don't crash
if (file.exists(file)) {
  file.remove(file)
}

The failure mode looks like the following:

[1] TRUE
Input contains no \n. Taking this to be a filename to open
File opened, filesize is  0.00281B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 100 columns
First row with 100 fields occurs on line 1 (either column names or first row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 10012
Subtracted 1 for last eol and any trailing empty lines, leaving 10011 data rows
Type codes: 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 (first 5 rows)
Type codes: 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 ( middle 5 rows)
Type codes: 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 ( last 5 rows)
Type codes: 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 (after applying colClasses and integer64)
Type codes: 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 (after applying drop or select (if supplied)
Allocating 100 column slots (100 - 0 NULL)
Bumping column 1 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 1 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 2 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 2 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 3 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 3 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 4 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 4 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 5 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 5 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 6 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 6 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 7 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 7 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 8 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 8 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 9 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 9 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 10 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 10 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 11 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 11 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 12 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 12 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 13 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 13 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 14 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 14 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 15 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 15 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 16 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 16 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 17 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 17 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 18 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 18 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 19 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 19 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 20 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 20 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 21 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 21 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 22 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 22 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 23 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 23 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 24 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 24 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 25 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 25 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 26 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 26 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 27 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 27 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 28 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 28 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 29 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 29 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 30 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 30 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 31 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 31 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 32 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 32 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 33 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 33 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 34 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 34 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 35 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 35 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 36 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 36 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 37 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 37 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 38 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 38 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 39 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 39 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 40 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 40 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 41 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 41 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 42 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 42 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 43 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 43 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 44 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 44 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 45 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 45 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 46 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 46 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 47 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 47 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 48 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 48 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 49 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 49 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 50 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 50 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 51 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 51 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 52 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 52 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 53 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 53 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 54 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 54 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 55 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 55 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 56 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 56 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 57 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 57 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 58 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 58 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 59 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 59 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 60 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 60 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 61 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 61 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 62 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 62 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 63 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 63 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 64 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 64 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 65 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 65 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 66 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 66 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 67 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 67 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 68 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 68 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 69 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 69 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 70 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 70 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 71 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 71 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 72 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 72 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 73 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 73 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 74 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 74 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 75 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 75 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 76 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 76 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 77 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 77 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 78 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 78 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 79 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 79 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 80 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 80 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 81 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 81 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 82 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 82 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 83 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 83 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 84 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 84 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 85 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 85 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 86 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 86 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 87 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 87 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 88 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 88 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 89 from INT to REAL on data row 10001, field contains '"foo"'
Bumping column 89 from REAL to STR on data row 10001, field contains '"foo"'
Bumping column 90 from INT to REAL on data row 10001, field contains '"foo"'

Again, the column it fails on does not appear to be consistent or move linearly with the number of rows or columns in the file, although it does seem to trend roughly consistently.

The smallest case I can make consistently fail is 10,000 rows and 10 columns, the test above is 100 columns to be more certain of triggering the failure.

In many cases, the script can often survive the first complete fread() call and will fail the second, third or fourth call to fread(). Generally if it survives that far it seems to continue successfully at least out to 10 or even 100 fread() calls.

Most of my testing is with RScript. Under RStudio I do get somewhat different behavior at times, but then I believe R Studio is reusing R sessions, so I don't trust it for this.

@arunsrinivasan
Copy link
Member

Reproducible on my Mac, OS X 10.8.5, R v3.1.1, v1.9.3 commit 1401.

@mattdowle
Copy link
Member

Fixed by 83a5b18 in v1.9.3.
Reproduced seg fault in v1.9.2 using your example and confirmed now ok in v1.9.3.
Thanks for the great report!
(Need to revisit this and do bumping better. Just a quick fix for now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants