Why Dask read_parquet much slower than vanilla fsspec.open Pandas read_parquet when loading Parquet from HDFS? #10735

daviddwlee84 · 2023-12-22T09:00:29Z

daviddwlee84
Dec 22, 2023

As title.

For a about 6.8GB parquet file on HDFS cluster. And running Python script on a local machine.
Using Dask read_parquet and then compute() to Pandas DataFrame is much slower.

import dask.dataframe as dd
df = dd.read_parquet('hdfs://server_addr:server_port/path/to/file.parquet').compute()

# This takes about 1min 24.4 secs

import fsspec
import pandas as pd
with fsspec.open('hdfs://server_addr:server_port/path/to/file.parquet', 'rb') as fp:
    df = pd.read_parquet(fp)

# This takes about 16.9 secs

fjetter · 2023-12-22T13:24:58Z

fjetter
Dec 22, 2023
Maintainer

It's hard to tell what's actually slow without looking at profiles or things like that. For very large files we have some heuristic that splits the reading up into multiple chunks (based on parquet row groups) so the fsspec is one contiguous call while the dask call is likely fragmented (but parallelized). This behavior is controlled by split_row_groups / blocksize, see https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html Maybe you'll have a better experience if you disable this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Dask read_parquet much slower than vanilla fsspec.open Pandas read_parquet when loading Parquet from HDFS? #10735

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why Dask read_parquet much slower than vanilla fsspec.open Pandas read_parquet when loading Parquet from HDFS? #10735

daviddwlee84 Dec 22, 2023

Replies: 1 comment

fjetter Dec 22, 2023 Maintainer

daviddwlee84
Dec 22, 2023

fjetter
Dec 22, 2023
Maintainer