Why Dask read_parquet much slower than vanilla fsspec.open Pandas read_parquet when loading Parquet from HDFS? #10735
Unanswered
daviddwlee84
asked this question in
Q&A
Replies: 1 comment
-
It's hard to tell what's actually slow without looking at profiles or things like that. For very large files we have some heuristic that splits the reading up into multiple chunks (based on parquet row groups) so the fsspec is one contiguous call while the dask call is likely fragmented (but parallelized). This behavior is controlled by |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As title.
For a about 6.8GB parquet file on HDFS cluster. And running Python script on a local machine.
Using Dask read_parquet and then
compute()
to Pandas DataFrame is much slower.Beta Was this translation helpful? Give feedback.
All reactions