Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

Closed
2 of 3 tasks
jukiewiczm opened this issue Feb 25, 2022 · 3 comments · Fixed by #46204
Closed
2 of 3 tasks
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jukiewiczm
Copy link

jukiewiczm commented Feb 25, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I have recently migrated from pandas 1.1.3 to pandas 1.4.1 and I'm experiencing some memory-related issues. The code that used to work just fine is now crashing due to memory limitations.

Reproducible example:

import pandas as pd
import numpy as np

arr = np.full((500, 400), np.nan)
df = pd.DataFrame(arr)

for col in df.columns:
    df[col].fillna(-1, inplace=True)

Interestingly enough, this snippet, which I would expect to have bigger memory requirements (as it's not inplace), works just fine:

import pandas as pd
import numpy as np

df = pd.DataFrame([[np.nan] * 400]*500000)

for col in df.columns:
    df[col] = df[col].fillna(-1)

The real code is obviously more complicated than the example, so I'd like to keep filling NA's column by column and doing this in place. Is that an unintended way of using fillna for DataFrame's Series or is it a bug?
I don't have any memory profiling applied that I could share, but I can try making one if that's necessary. I've just noticed the code work in old pandas version and doesn't work in the new one, and I can see the memory usage in system's resource monitor.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 06d230151e6f18fdb8139d09abf539867a8cd481
python           : 3.8.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.4.0
Version          : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.4.1
numpy            : 1.21.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 22.0.3
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6
jinja2           : 3.0.3
IPython          : 8.0.1
pandas_datareader: None
bs4              : None
bottleneck       : 1.3.2
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.3.2
numba            : 0.53.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 2.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.3
sqlalchemy       : 1.4.11
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None

Prior Performance

In pandas 1.1.3 the first snippet works just fine.
Both snippets also seem to work incomparably faster on the older pandas version.

@jukiewiczm jukiewiczm added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Feb 25, 2022
@jbrockmendel
Copy link
Member

Potential solution is to change Series._update_inplace to check for if self._values is result._values, in which case it could short-circuit the cache_clearing, in particular _maybe_update_cacher, which triggers a copy in the DataFrame.

@jukiewiczm
Copy link
Author

jukiewiczm commented Feb 28, 2022

@jbrockmendel Thanks for taking a look at this. Perhaps you know: what is the latest pandas version where that issue doesn't occur? I could switch to an older one instead of the newest one until that's fixed (assuming this will be considered a bug).

EDIT: Found it myself. It works just fine on pandas 1.3.5

@pkuca
Copy link

pkuca commented Mar 2, 2022

Having the same problem. The logic I'm working on (which includes fillna()) tops out at 8GB of memory usage on 1.3.5, but an update to 1.4.1 caused OOMKills on my laptop with 16GB of memory.

@jreback jreback added this to the 1.4.2 milestone Mar 3, 2022
@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants