Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): Refactor decompression checks and add support for decompressing JSON #18536

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

ohanf
Copy link
Contributor

@ohanf ohanf commented Sep 3, 2024

It looks like the bulk of the work for supporting on the fly decompression (#8323) was already implemented. However there appeared to be a couple of inconsistencies, some of which this PR attempts to address.

The first commit is refactoring the compression detection logic to only be performed in a single function, instead of checking prefix bytes in multiple places. It also moves maybe_decompress_bytes to live with other compression related code.
The second commit is a small change to add automatic decompression for JSON files. This essentially mirrors what is done for the Lazy JSON reader the CSV reader, adapted for use with simd_json

There does seem to be some extra unification work that could be done around CSV and Lazy reader decompression, but I wanted to submit the feature without making too many wide-sweeping changes. Another point of divergence is that decompression is only done for JSON on the rust side, as when reading ND-JSON from Python, the bindings call into the Lazy API which already supported decompression.

As a final note, I wasn't sure if the docs should be updated, especially since they were not when this was originally implemented. The docs probably should mention that any compressed file, read via eager or lazy methods, will read the entire file into memory. The underlying reason being that the polars parsing functions want their readers to be Seek which flate2 does not support (see rust-lang/flate2-rs#310)

@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars labels Sep 3, 2024
Copy link

codspeed-hq bot commented Sep 3, 2024

CodSpeed Performance Report

Merging #18536 will not alter performance

Comparing vectra-ai-research:json-csv-decompression (5dbb175) with main (76a340b)

Summary

✅ 39 untouched benchmarks

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.92%. Comparing base (76a340b) to head (5dbb175).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18536       /-   ##
==========================================
- Coverage   79.92%   79.92%   -0.01%     
==========================================
  Files        1506     1506              
  Lines      203031   203030       -1     
  Branches     2891     2891              
==========================================
- Hits       162280   162266      -14     
- Misses      40201    40214       13     
  Partials      550      550              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

Thank you @ohanf. This looks like an improvement to me.

Can you add a test on the python side, where we compress some json on the fly and decompress it. Please don't add any (compressed) data files.

@ohanf
Copy link
Contributor Author

ohanf commented Sep 9, 2024

I added some basic tests roughly based on the CSV ones, please let me know what else might be needed, thanks

@ritchie46 ritchie46 merged commit 79b91c3 into pola-rs:main Sep 11, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants