-
-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Difference in using zipped pickle files #59279
Comments
Investigations are welcome, I think you will have to dig into this yourself. We focus on other file formats like parquet these days and would generally recommend this for users. |
I wouldn't ask that question if I hadn't tried to dig into the pandas code myself before. |
I just wanted to warn you that this is probably not a priority for us |
Thank you for clarifying that Patrick. As a maintainer I have always had good experiences with communicating such things (the priorities of the project and the maintainers) unsolicited and transparently. This ease up communication and increase empathy on both sides. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
The following code has 4 functions. One of them create a sample data frame and zip-pickle it to the current working directory. The other three are different variants to unpickle that file again.
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.11.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
I am assuming this is not a bug because pandas is "old" in its best meaning and well developed. There must be a good reason for this behavior.
Unpickle a data frame from a zip-file is very slow (1min 51sec in my example) compared to unzip the pickle file into memory using an
io.BytesIO()
object and using this withpandas.read_pickle()
(6sec in my example).In the example code below the function
unpickle_from_memory()
demonstrate the fast way.The slower one is
unpickle_via_pandas()
andunpickle_zip_filehandle()
. The later might be an example about how pandas work internally with that zip file.Here is the output from the script:
My question is why is it that way? Wouldn't pandas be more faster and efficient if it would use my method demonstrated in
unpickle_from_memory()
?The text was updated successfully, but these errors were encountered: