fix(rust): Refactor decompression checks and add support for decompressing JSON #18536
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It looks like the bulk of the work for supporting on the fly decompression (#8323) was already implemented. However there appeared to be a couple of inconsistencies, some of which this PR attempts to address.
The first commit is refactoring the compression detection logic to only be performed in a single function, instead of checking prefix bytes in multiple places. It also moves
maybe_decompress_bytes
to live with other compression related code.The second commit is a small change to add automatic decompression for JSON files. This essentially mirrors what is done for the Lazy JSON reader the CSV reader, adapted for use with
simd_json
There does seem to be some extra unification work that could be done around CSV and Lazy reader decompression, but I wanted to submit the feature without making too many wide-sweeping changes. Another point of divergence is that decompression is only done for JSON on the rust side, as when reading ND-JSON from Python, the bindings call into the Lazy API which already supported decompression.
As a final note, I wasn't sure if the docs should be updated, especially since they were not when this was originally implemented. The docs probably should mention that any compressed file, read via eager or lazy methods, will read the entire file into memory. The underlying reason being that the polars parsing functions want their readers to be
Seek
whichflate2
does not support (see rust-lang/flate2-rs#310)