Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Difference in using zipped pickle files #59279

Open
2 of 3 tasks
buhtz opened this issue Jul 19, 2024 · 4 comments
Open
2 of 3 tasks

PERF: Difference in using zipped pickle files #59279

buhtz opened this issue Jul 19, 2024 · 4 comments
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance

Comments

@buhtz
Copy link

buhtz commented Jul 19, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.

#!/usr/bin/env python3
import io
import zipfile
from datetime import datetime
import pandas as pd
import numpy as np

FN = 'df.pickle.zip'


def create_and_zip_pickle_data():
    num_rows = 1_000_000
    num_cols = 10

    print('Create data frame')

    int_data = np.random.randint(0, 100, size=(num_rows, num_cols // 2))
    str_choices = np.array(['Troi', 'Crusher', 'Yar', 'Guinan'])
    str_data = np.random.choice(str_choices, size=(num_rows, num_cols // 2))
    columns = [f'col_{i}' for i in range(num_cols)]

    df = pd.DataFrame(np.hstack((int_data, str_data)), columns=columns)
    df_one = df.copy()

    for _ in range(20):
        df = pd.concat([df, df_one])

    df = df.reset_index()
    df['col_2'] = df['col_2'].astype('Int16')
    df['col_4'] = df['col_4'].astype('Int16')
    df['col_5'] = df['col_5'].astype('category')
    df['col_7'] = df['col_7'].astype('category')
    df['col_9'] = df['col_9'].astype('category')

    print(df.head())

    print(f'Pickle {len(df):n} rows')
    df.to_pickle(FN)


def unpickle_via_pandas():
    timestamp = datetime.now()
    print('Unpickle with pandas')

    df = pd.read_pickle(FN)
    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')


def unpickle_from_memory():
    timestamp = datetime.now()
    print('Unpickle after unzipped into RAM')

    # Unzip into RAM
    print('Unzip into RAM')
    with zipfile.ZipFile(FN) as zf:
        stream = io.BytesIO(zf.read(zf.namelist()[0]))

    # Unpickle from RAM
    print('Unpickle from RAM')
    df = pd.read_pickle(stream)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

def unpickle_zip_filehandle():
    timestamp = datetime.now()
    print('Unpickle with zip filehandle')

    with zipfile.ZipFile(FN) as zf:
        with zf.open('df.pickle') as handle:
            print('Unpickle from filehandle')
            df = pd.read_pickle(handle)

    duration = datetime.now() - timestamp

    print(f'{len(df):n} rows. Duration {duration}.')

if __name__ == '__main__':
    print(f'{pd.__version__=}')
    # create_and_zip_pickle_data()
    print('-'*20)
    unpickle_from_memory()
    print('-'*20)
    unpickle_via_pandas()
    print('-'*20)
    unpickle_zip_filehandle()
    print('-'*20)
    print('FIN')

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.11.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.

Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an io.BytesIO() object and using this with pandas.read_pickle() (6sec in my example).

In the example code below the function unpickle_from_memory() demonstrate the fast way.
The slower one is unpickle_via_pandas() and unpickle_zip_filehandle(). The later might be an example about how pandas work internally with that zip file.

Here is the output from the script:

pd.__version__='2.2.2'
--------------------
Unpickle after unzipped into RAM
Unzip into RAM
Unpickle from RAM
21000000 rows. Duration 0:00:06.289123.
--------------------
Unpickle with pandas
21000000 rows. Duration 0:01:51.749488.
--------------------
Unpickle with zip filehandle
Unpickle from filehandle
21000000 rows. Duration 0:01:50.909909.
--------------------
FIN

My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in unpickle_from_memory()?

@buhtz buhtz added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jul 19, 2024
@MarcoGorelli MarcoGorelli added IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 25, 2024
@buhtz buhtz changed the title PERF: PERF: Difference in using zipped pickle files Jul 25, 2024
@phofl
Copy link
Member

phofl commented Jul 29, 2024

Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users.

@buhtz
Copy link
Author

buhtz commented Jul 29, 2024

I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before.

@phofl
Copy link
Member

phofl commented Jul 29, 2024

I just wanted to warn you that this is probably not a priority for us

@buhtz
Copy link
Author

buhtz commented Jul 29, 2024

I just wanted to warn you that this is probably not a priority for us

Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants