Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread memory leak when small number bumped up to character #918

Closed
alexdeng opened this issue Oct 25, 2014 · 16 comments
Closed

fread memory leak when small number bumped up to character #918

alexdeng opened this issue Oct 25, 2014 · 16 comments
Labels
Milestone

Comments

@alexdeng
Copy link

I have a 300m file with some small numbers in scientific notation (E-300) and after using fread it used more than 30gig of my memory. After manually use colClass to specify the column to be character the problem is gone and the memory usage is normal. I suspect there is memory leak somewhere when handling small number.

@mattdowle
Copy link
Member

Which version of data.table please. In particular is it v1.9.4 or higher?

Fixed seg fault in sparse data files when bumping to character, #796 and #722. Thanks to Adam Kennedy and Richard Cotton for the detailed reproducible reports.

Although that was a crash rather than a memory leak, it could manifest itself as a memory leak as well. I'm guessing that the scientific notation is a red herring since reading of scientific notation is pretty stable.

Please provide sessionInfo() whenever you file issues. It saves time. Also please provide the output: fread(...,verbose=TRUE).

@mattdowle mattdowle added this to the v1.9.6 milestone Oct 25, 2014
@alexdeng
Copy link
Author

Here is the sessioninfo

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.9.4

loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.3 reshape2_1.4 stringr_0.6.2 tools_3.1.1

Here is the output for a 130m file. And it took 5G of my memory when fread finished. When I tried fread on another 300m file without colClass to preset the column to character, it ended up took 30G of my memory.


Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.127929 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 30 (the last non blank line in the first 'autostart') ... found ok
Found 13 columns
First row with 13 fields occurs on line 1 (either column names or first row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 742833
Subtracted 2 for last eol and any trailing empty lines, leaving 742831 data rows
Type codes ( first 5 rows): 4444441444333
Type codes ( middle 5 rows): 4444441444333
Type codes ( last 5 rows): 4444441444333
Type codes: 4444441444333 (after applying colClasses and integer64)
Type codes: 4444441444333 (after applying drop or select (if supplied)
Allocating 13 column slots (13 - 0 dropped)
Read 742831 rows and 13 (of 13) columns from 0.128 GB file in 00:00:21
0.016s ( 0%) Memory map (rerun may be quicker)
0.001s ( 0%) sep and header detection
0.757s ( 4%) Count rows (wc -l)
0.003s ( 0%) Column type detection (first, middle and last 5 rows)
0.892s ( 4%) Allocation of 742831x13 result (xMB) in RAM
18.681s ( 92%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.012s ( 0%) Changing na.strings to NA
20.362s Total
Warning message:
In fread("d:/largeData/FDOC/allmetrics_NoSeg_NoSRM_Subset.csv", :
C function strtod() returned ERANGE for one or more fields. The first was string input '1.97626258336499E-323'. It was read using (double)strtold() as numeric value 1.9762625833649862E-323 (displayed here using %.16E); loss of accuracy likely occurred. This message is designed to tell you exactly what has been done by fread's C code, so you can search yourself online for many references about double precision accuracy and these specific C functions. You may wish to use colClasses to read the column as character instead and then coerce that column using the Rmpfr package for greater accuracy.

@mattdowle
Copy link
Member

Many thanks. No type bumps are happening then. That warning is telling you about loss of accuracy - I can't see why that would result in larger memory usage. Can you also provide the commands you run and the output of str(DT) (where DT is the data.table it creates)? How are you measuring its memory use? If you can provide str(DT) on both the large allocaton one and the ok one then hopefully that will help (including the different fread command you type so I can see which column you pass to colClasses). If you get the same problem with the file cut down to 1MB file, then can you email it to me? Basically, provide as much information as you can. Thanks.

@alexdeng
Copy link
Author

I shared my data
https://dl.dropboxusercontent.com/u/7004672/debug.csv

Just run
dat = fread('debug.csv', verbose=TRUE)

Aad I monitor the memory usage in window's task manager and I can see how much memory hold by R session. In my machine It took 13G of memory after read 8.9% of the 219m file and it is very slow. I had to stop the R session otherwise it could take all my 36G memory and crash my machine.

If I specify both column as character, then

dat = fread('debug.csv', verbose=TRUE, colClasses = list(character=c('V1','V2')))
If finished in about 90 seconds, taking 900mb of my memory(the size of the dat returned by fread is 806.4mb). But after as.numeric the size went down to 96mb as expected.
dat[,:=(c('V1','V2'),list(as.numeric(V1),as.numeric(V2)))]

@alexdeng
Copy link
Author

Matt could you repro this issue?

@mattdowle
Copy link
Member

Thanks. Have downloaded and run but it works fine here on Linux using the same CRAN version (v1.9.4).

How strange! It's a really simple file (2 numeric columns). Maybe it is to do with the ERANGE warning then and since it works fine here on Linux maybe it's a Windows-only problem. Could you try removing some digits from the numbers in the file and see if it then works?

$ R
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(data.table)
Loading required package: data.table
data.table 1.9.4  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.

> DT = fread("debug.csv", verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.219189 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 6313724
Subtracted 2 for last eol and any trailing empty lines, leaving 6313722 data rows
Type codes (   first 5 rows): 33
Type codes (  middle 5 rows): 33
Type codes (    last 5 rows): 33
Type codes: 33 (after applying colClasses and integer64)
Type codes: 33 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 6313722 rows and 2 (of 2) columns from 0.219 GB file in 00:00:12
   0.051s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   1.166s ( 10%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.201s (  2%) Allocation of 6313722x2 result (xMB) in RAM
   9.927s ( 88%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
  11.344s        Total
Warning message:
In fread("debug.csv", verbose = TRUE) :
  C function strtod() returned ERANGE for one or more fields. The first was string input '2.32741124362878E-309'. It was read using (double)strtold() as numeric value 2.3274112436287792E-309 (displayed here using %.16E); loss of accuracy likely occurred. This message is designed to tell you exactly what has been done by fread's C code, so you can search yourself online for many references about double precision accuracy and these specific C functions. You may wish to use colClasses to read the column as character instead and then coerce that column using the Rmpfr package for greater accuracy.
> print(DT)
                   V1        V2
      1:   0.04007285 0.8010419
      2:   0.04898210 0.7638770
      3:  -0.07365259 0.8425065
      4: -59.74810854 0.1918155
      5: -39.97517367 0.3965352
     ---                       
6313718:   1.94000883 0.5497541
6313719:   0.11001585 0.5206822
6313720:   0.11050033 0.4940505
6313721:   0.06749008 0.3739320
6313722:   0.04381130 0.2168884
> sapply(DT,class)
       V1        V2 
"numeric" "numeric" 
> system("ls -lh debug.csv")
-rw-r----- 1 mdowle mdowle 225M Nov  1 09:23 debug.csv
> system("head debug.csv")
V1,V2
0.0400728476107927,0.801041935693612
0.0489820969563939,0.763877007211593
-0.0736525895727923,0.842506514813372
-59.7481085409406,0.191815544728967
-39.9751736658287,0.396535177786151
-0.0464704021731438,0.405312124283196
0.313994450640944,0.163031903374044
0.402107788498037,0.0678644932003186
0.21551724137931,0.177449963851469

> system("tail debug.csv")
-0.0350671334109038,0.84996384428055
0.148874493837568,0.851469131093495
-0.132688742284806,0.858676881345481
-0.0635038868127799,0.988171874680027
1.94000883097708,0.549754083953011
0.110015845109051,0.520682247269614
0.110500329886343,0.494050546140977
0.0674900801548572,0.373931998470873
0.043811304035503,0.216888407882552

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices datasets  stats     utils     methods   base     

other attached packages:
[1] data.table_1.9.4 bit64_0.9-4      bit_1.1-12      

loaded via a namespace (and not attached):
[1] chron_2.3-45  plyr_1.8.1    Rcpp_0.11.3   reshape2_1.4  stringr_0.6.2
> tables()
     NAME      NROW NCOL MB COLS  KEY
[1,] DT   6,313,722    2 97 V1,V2    
Total: 97MB
> 

@alexdeng
Copy link
Author

Did you have a chance to try it on a windows machine?

@mattdowle mattdowle changed the title fread memory leak when small number bumpped up to character fread memory leak when small number bumped up to character Nov 15, 2014
@mattdowle
Copy link
Member

I hunted online for similar issues but didn't get lucky.
Would have to debug it on Windows it seems, which I don't have. Postponing to next release for now.

@mattdowle mattdowle modified the milestones: v1.9.8, v1.9.6 Nov 15, 2014
@arunsrinivasan
Copy link
Member

Yes, I can confirm the memory leak on Windows.

@braidm
Copy link

braidm commented Jul 9, 2017

I'd love to see this fixed. Reading my 30MB file which contains some super small numbers leaves me with 5 GB of leaked memory on Windows, causing me to have to restart between executions. I could treat as character then coerce, but what a pain.

@st-pasha
Copy link
Contributor

st-pasha commented Jul 10, 2017

@braidm Can you provide some more context please? How small are "super small numbers"? Say, can you post few lines of your file as an example? Also, do you see the leak in CRAN version, or in the latest dev (or both)?
The OP's test file is no longer accessible...

@braidm
Copy link

braidm commented Jul 10, 2017

I cannot share my original file which has 160,000 rows, 28 fields, tab delimited, and leaks over 7GB whenever I read it (on disk it is 30MB). The leak does not happen if all numerics are larger than the machine limit of 2.225074e-308.

However I created a test file which leaks about 300MB each time it is read. With repeated executions it will leak multiple GB. Here is the message from data.table

> leak <- fread("data/leak.txt",header=TRUE,sep="\t",quote="",stringsAsFactors = FALSE,select=c("pvalue")) Warning message: In fread("data/leak.txt", header = TRUE, sep = "\t", quote = "", : C function strtod() returned ERANGE for one or more fields. The first was string input '0.58E-2141'. It was read using (double)strtold() as numeric value 0.0000000000000000E 00 (displayed here using %.16E); loss of accuracy likely occurred. This message is designed to tell you exactly what has been done by fread's C code, so you can search yourself online for many references about double precision accuracy and these specific C functions. You may wish to use colClasses to read the column as character instead and then coerce that column using the Rmpfr package for greater accuracy.

leak.txt

and here is the sessionInfo
`> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] tools_3.3.2`

@st-pasha
Copy link
Contributor

@braidm Thanks for providing the test data. I was able to verify that the "leak" occurs on a Windows machine with data.table 1.10.4. Well, actually I found that when I ran fread("leak.txt") in a loop, the process eats the memory at an enormous speed, quickly reaching 100% RAM utilization, at which point everything comes to a crawling speed. However, after the loop ends, the memory will slowly be reclaimed, eventually reaching the normal level. I suspect what happens is that on Windows the warning messages are somehow not being processed at the right time, which results in this observed behavior.

Notably, the same problem does not occur on a Linux machine, or a MacOS machine. The good news is that the problem doesn't occur with latest development version of data.table on Windows either (simply because these ERANGE warnings are not emitted). You can install the latest version via

remove.packages("data.table")
install.packages("https://ci.appveyor.com/api/buildjobs/e27ehri93y0tkj6f/artifacts/data.table_1.10.5.zip", repos=NULL)

Please let me know if this solves your problem

@braidm
Copy link

braidm commented Jul 10, 2017

I'm grateful for you taking a look. I cannot try the dev version of data.tables at this time. You seem to be suggesting the memory is not truly leaked, because it does and can get garbage collected. What is strange then is that in RStudio when the memory has ballooned up to 8GB after reading a 30MB file, that a subsequent run pushes the memory demand yet higher into the teens of GB when in fact if it only did gc we could be OK. And if it's not a gc issue, how else would the memory be reclaimed. Thanks again!

@mattdowle
Copy link
Member

Talking to Pasha we've realized where the 'leak' arises. Each and every warning is slightly different because it includes the string value in this part of the warning message: "The first was string input '0.58E-2141'." R adds each unique string to its global string cache regardless of what each string is needed for. So R's global string cache is growing with the huge number of warning messages being constructed here.

dev is much better and the problem has gone away.

Just a test needed then to be added to the test suite, and this is closed.

@mattdowle mattdowle modified the milestones: v1.10.6, Candidate Jul 13, 2017
@MichaelChirico
Copy link
Member

Closed by #2451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants