Iterating over zip_disk.frame to beat 'Possible truncation of >= 4GB file' #345

lime-n · 2021-07-01T08:17:18Z

I keep getting the issue when unzipping a 174gb .zip file:

possible truncation of >= 4GB file

I know that it's an R problem, however, is there a way around this? otherwise, can something like looping over the .zip decompression until the until file is complete? Something similar to the in_chunks part of the function.

Where 4gb is decompressed, stored as a .tz and another 4gb decompressed and stored as a .tz etc ... while continuing from the previous decompression, until complete?

xiaodaigh · 2021-07-01T09:44:25Z

I am not quite sure what is the issue based on your description. The code for zip_to_disk.frame is quite short. If you figure out what's wrong, can you try to make a PR?

lime-n · 2021-07-01T09:53:48Z

Here is the code that I have used:

library(disk.frame)
library(tidyverse)
library(data.table)

data_dir <- "data"
if (!dir.exists(data_dir)) {
  dir.create(data_dir)
}

setup_disk.frame(workers = 10)
options(future.global.maxSize = Inf)

zi.fl <- zip_to_disk.frame(zipfile = "species.zip", outdir = data_dir)  %>%
  rbindlist.disk.frame() %>% filter(year > 2010)

dat <- as.data.frame(zi.fl)

My .zip file is 174gb and the file inside is about 800gb. My assumption is that it does not retrieve all the data, and stops after decompressing the zip at 4GB, am I right here? Because the dataset I get IS not that large for a 800gb file, which has about 1.5bn occurrences, I would retrieve about 5million. Which definitely is not the case as bird data would give me 5m for this range, so it must be cutting off at 4GB right?
If so, I was wondering if there was a way to overcome this

I downloaded the simple dataset from here: https://www.gbif.org/occurrence/download?occurrence_status=present

xiaodaigh · 2021-07-01T10:04:37Z

This is the code for the function. See if you can run it line by line and figure out what's wrong. You might want to do

setup_disk.frame(workers = 1)

for debugging

zip_to_disk.frame = function(zipfile, outdir, ..., validation.check = FALSE, overwrite = TRUE) {
  files = unzip(zipfile, list=TRUE)
  
  fs::dir_create(outdir)
  
  tmpdir = tempfile(pattern = "tmp_zip2csv")
  
  dotdotdots = list(...)
  
  dfs = future.apply::future_lapply(files$Name, function(fn) {
  #dfs = lapply(files$Name, function(fn) {
    outdfpath = file.path(outdir, fn)
    overwrite_check(outdfpath, TRUE)
    unzip(zipfile, files = fn, exdir = tmpdir)
    
    # lift the domain of csv_to_disk.frame so it accepts a list
    cl = purrr::lift(csv_to_disk.frame)
    
    ok = c(
      list(infile = file.path(tmpdir, fn), outdir = outdfpath, overwrite = overwrite),
      dotdotdots)
    
    #csv_to_disk.frame(, outdfpath, overwrite = overwrite, ...)
    cl(ok)
  }, future.seed=TRUE)

  dfs  
}

lime-n · 2021-07-01T10:45:13Z

I think it is an inherent problem with R where >4gb of decompressing a file is truncated instead, therefore, I was thinking of how to overcome this. I had thought that the solution was iterating over
files = unzip(zipfile, list = TRUE), in such a way that after each 4GB truncation, it decompresses the next one until. completion.

Something like, n 1 iterations though I wouldn't be sure how to tell the programme that 'this portion of data has already been decompressed - skip this and do the next 4gb'

lime-n · 2021-07-04T10:05:54Z

This is the code for the function. See if you can run it line by line and figure out what's wrong. You might want to do

setup_disk.frame(workers = 1)

for debugging

zip_to_disk.frame = function(zipfile, outdir, ..., validation.check = FALSE, overwrite = TRUE) {
  files = unzip(zipfile, list=TRUE)
  
  fs::dir_create(outdir)
  
  tmpdir = tempfile(pattern = "tmp_zip2csv")
  
  dotdotdots = list(...)
  
  dfs = future.apply::future_lapply(files$Name, function(fn) {
  #dfs = lapply(files$Name, function(fn) {
    outdfpath = file.path(outdir, fn)
    overwrite_check(outdfpath, TRUE)
    unzip(zipfile, files = fn, exdir = tmpdir)
    
    # lift the domain of csv_to_disk.frame so it accepts a list
    cl = purrr::lift(csv_to_disk.frame)
    
    ok = c(
      list(infile = file.path(tmpdir, fn), outdir = outdfpath, overwrite = overwrite),
      dotdotdots)
    
    #csv_to_disk.frame(, outdfpath, overwrite = overwrite, ...)
    cl(ok)
  }, future.seed=TRUE)

  dfs  
}

I know that this function can bypass the unzip limit:

vroom::vroom(archive::archive_read("my_file.zip"))

So passing this instead of unzip may provide a better alternative for those facing data greater than 4gb in a .zip format? Perhaps something to think about?

xiaodaigh · 2021-07-04T10:34:19Z

nice. I will keep that in mind

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over zip_disk.frame to beat 'Possible truncation of >= 4GB file' #345

Iterating over zip_disk.frame to beat 'Possible truncation of >= 4GB file' #345

lime-n commented Jul 1, 2021 •

edited

Loading

xiaodaigh commented Jul 1, 2021

lime-n commented Jul 1, 2021 •

edited

Loading

xiaodaigh commented Jul 1, 2021

lime-n commented Jul 1, 2021

lime-n commented Jul 4, 2021

xiaodaigh commented Jul 4, 2021

Iterating over zip_disk.frame to beat 'Possible truncation of >= 4GB file' #345

Iterating over zip_disk.frame to beat 'Possible truncation of >= 4GB file' #345

Comments

lime-n commented Jul 1, 2021 • edited Loading

xiaodaigh commented Jul 1, 2021

lime-n commented Jul 1, 2021 • edited Loading

xiaodaigh commented Jul 1, 2021

lime-n commented Jul 1, 2021

lime-n commented Jul 4, 2021

xiaodaigh commented Jul 4, 2021

lime-n commented Jul 1, 2021 •

edited

Loading

lime-n commented Jul 1, 2021 •

edited

Loading