PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

jukiewiczm · 2022-02-25T17:58:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I have recently migrated from pandas 1.1.3 to pandas 1.4.1 and I'm experiencing some memory-related issues. The code that used to work just fine is now crashing due to memory limitations.

Reproducible example:

import pandas as pd
import numpy as np

arr = np.full((500, 400), np.nan)
df = pd.DataFrame(arr)

for col in df.columns:
    df[col].fillna(-1, inplace=True)

Interestingly enough, this snippet, which I would expect to have bigger memory requirements (as it's not inplace), works just fine:

import pandas as pd
import numpy as np

df = pd.DataFrame([[np.nan] * 400]*500000)

for col in df.columns:
    df[col] = df[col].fillna(-1)

The real code is obviously more complicated than the example, so I'd like to keep filling NA's column by column and doing this in place. Is that an unintended way of using fillna for DataFrame's Series or is it a bug?
I don't have any memory profiling applied that I could share, but I can try making one if that's necessary. I've just noticed the code work in old pandas version and doesn't work in the new one, and I can see the memory usage in system's resource monitor.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 06d230151e6f18fdb8139d09abf539867a8cd481
python           : 3.8.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.4.0
Version          : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.4.1
numpy            : 1.21.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 22.0.3
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6
jinja2           : 3.0.3
IPython          : 8.0.1
pandas_datareader: None
bs4              : None
bottleneck       : 1.3.2
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.3.2
numba            : 0.53.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 2.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.3
sqlalchemy       : 1.4.11
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None

Prior Performance

In pandas 1.1.3 the first snippet works just fine.
Both snippets also seem to work incomparably faster on the older pandas version.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-02-26T03:35:51Z

Potential solution is to change Series._update_inplace to check for if self._values is result._values, in which case it could short-circuit the cache_clearing, in particular _maybe_update_cacher, which triggers a copy in the DataFrame.

jukiewiczm · 2022-02-28T00:16:02Z

@jbrockmendel Thanks for taking a look at this. Perhaps you know: what is the latest pandas version where that issue doesn't occur? I could switch to an older one instead of the newest one until that's fixed (assuming this will be considered a bug).

EDIT: Found it myself. It works just fine on pandas 1.3.5

pkuca · 2022-03-02T18:06:09Z

Having the same problem. The logic I'm working on (which includes fillna()) tops out at 8GB of memory usage on 1.3.5, but an update to 1.4.1 caused OOMKills on my laptop with 16GB of memory.

jukiewiczm added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Feb 25, 2022

jbrockmendel mentioned this issue Mar 2, 2022

PERF: avoid copies in frame[col].fillna(val, inplace=True) GH#46149 #46204

Merged

4 tasks

jreback added this to the 1.4.2 milestone Mar 3, 2022

jreback closed this as completed in #46204 Mar 5, 2022

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 8, 2022

TendouArisu mentioned this issue Mar 1, 2024

Potential performance issue: .fillna memory issue in pandas below 1.4.2 version xingyaoww/code-act#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

jukiewiczm commented Feb 25, 2022 •

edited by jbrockmendel

Loading

jbrockmendel commented Feb 26, 2022

jukiewiczm commented Feb 28, 2022 •

edited

Loading

pkuca commented Mar 2, 2022

PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking #46149

Comments

jukiewiczm commented Feb 25, 2022 • edited by jbrockmendel Loading

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance

jbrockmendel commented Feb 26, 2022

jukiewiczm commented Feb 28, 2022 • edited Loading

pkuca commented Mar 2, 2022

jukiewiczm commented Feb 25, 2022 •

edited by jbrockmendel

Loading

jukiewiczm commented Feb 28, 2022 •

edited

Loading