PERF: Difference in using zipped pickle files #59279

buhtz · 2024-07-19T13:13:31Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.

#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np

FN = 'df.pickle.zip'


def create_and_zip_pickle_data():
    num_rows = 1_000_000
    num_cols = 10

    print('Create data frame')

    int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
    str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
    str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
    columns = [f'col_{i}' for i in range(num_cols)]

    df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
    df_one = df.copy()

    for _ in range(20):
        df = pd.concat([df, df_one])

    df = df.reset_index()
    df['col_2'] = df['col_2'].astype('Int16')
    df['col_4'] = df['col_4'].astype('Int16')
    df['col_5'] = df['col_5'].astype('category')
    df['col_7'] = df['col_7'].astype('category')
    df['col_9'] = df['col_9'].astype('category')

    print(df.head())

    print(f'Pickle {len(df):n} rows')
    df.to_pickle(FN)


def unpickle_via_pandas():
    timestamp = datetime.now()
    print('Unpickle with pandas')

    df = pd.read_pickle(FN)
    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')


def unpickle_from_memory():
    timestamp = datetime.now()
    print('Unpickle after unzipped into RAM')

    # Unzip into RAM
    print('Unzip into RAM')
    with zipfile.ZipFile(FN) as zf:
        stream = io.BytesIO(zf.read(zf.namelist()[0]))

    # Unpickle from RAM
    print('Unpickle from RAM')
    df = pd.read_pickle(stream)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_zip_filehandle():
    timestamp = datetime.now()
    print('Unpickle with zip filehandle')

    with zipfile.ZipFile(FN) as zf:
        with zf.open('df.pickle') as handle:
            print('Unpickle from filehandle')
            df = pd.read_pickle(handle)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

if __name__ == '__main__':
    print(f'{pd.__version__=}')
    # create_and_zip_pickle_data()
    print('-'*20)
    unpickle_from_memory()
    print('-'*20)
    unpickle_via_pandas()
    print('-'*20)
    unpickle_zip_filehandle()
    print('-'*20)
    print('FIN')

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.

Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an io.BytesIO() object and using this with pandas.read_pickle() (6sec in my example).

In the example code below the function unpickle_from_memory() demonstrate the fast way.
The slower one is unpickle_via_pandas() and unpickle_zip_filehandle(). The later might be an example about how pandas work internally with that zip file.

Here is the output from the script:

pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN

My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in unpickle_from_memory()?

The text was updated successfully, but these errors were encountered:

phofl · 2024-07-29T11:03:44Z

Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users.

buhtz · 2024-07-29T11:17:55Z

I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before.

phofl · 2024-07-29T11:52:36Z

I just wanted to warn you that this is probably not a priority for us

buhtz · 2024-07-29T12:33:22Z

I just wanted to warn you that this is probably not a priority for us

Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides.

buhtz added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jul 19, 2024

MarcoGorelli added IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 25, 2024

buhtz changed the title ~~PERF:~~ PERF: Difference in using zipped pickle files Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Difference in using zipped pickle files #59279

PERF: Difference in using zipped pickle files #59279

buhtz commented Jul 19, 2024

INSTALLED VERSIONS

phofl commented Jul 29, 2024

buhtz commented Jul 29, 2024

phofl commented Jul 29, 2024

buhtz commented Jul 29, 2024

PERF: Difference in using zipped pickle files #59279

PERF: Difference in using zipped pickle files #59279

Comments

buhtz commented Jul 19, 2024

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

phofl commented Jul 29, 2024

buhtz commented Jul 29, 2024

phofl commented Jul 29, 2024

buhtz commented Jul 29, 2024