Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.testing.assert_frame_equal(..., check_exact=True) raises assertion error if any columns are not numeric #35446

Closed
2 of 3 tasks
ivirshup opened this issue Jul 29, 2020 · 8 comments · Fixed by #35522
Closed
2 of 3 tasks
Labels
Bug Regression Functionality that used to work in a prior pandas version Testing pandas testing functions or related to the test suite
Milestone

Comments

@ivirshup
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
a = pd.DataFrame({"a": list("bc"), "b": [1, 2]})
pd.testing.assert_frame_equal(a, a.copy(), check_exact=True)
   ...: a = pd.DataFrame({"a": list("bc"), "b": [1, 2]}) 
   ...: pd.testing.assert_frame_equal(a, a.copy(), check_exact=True)                                                               
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-2-bac78c400991> in <module>
      1 import pandas as pd
      2 a = pd.DataFrame({"a": list("bc"), "b": [1, 2]})
----> 3 pd.testing.assert_frame_equal(a, a.copy(), check_exact=True)

    [... skipping hidden 2 frame]

AssertionError: check_exact may only be used with numeric Series

Problem description

Previously, the example above would not have thrown an error. It makes it a less useful function if I have to start subdividing dataframes by types just to check that they are equal. Presumably check_exact should either have no effect if called on a non-numeric Series or assert_frame_equal should not pass check_exact=True if the series to be compared isn't numeric.

It looks like this was introduced recently, in 08c6597

Expected Output

I expected it to work like it did previously, and not throw an error. Here's an example of test failures this is causing.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.5.0
Version          : Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0
numpy            : 1.19.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 49.2.0
Cython           : None
pytest           : 5.4.3
hypothesis       : None
sphinx           : 3.1.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.16.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.7.4
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.0
numexpr          : 2.7.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.17.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.1
sqlalchemy       : 1.3.18
tables           : 3.6.1
tabulate         : None
xarray           : 0.16.0
xlrd             : 1.2.0
xlwt             : None
numba            : 0.50.1
@ivirshup ivirshup added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
ivirshup added a commit to ivirshup/anndata that referenced this issue Jul 29, 2020
See pandas-dev/pandas#35446 for details.
Basically broke how we compare dataframes in tests.
@simonjayhawkins
Copy link
Member

@ivirshup Thanks for the report.

It looks like this was introduced recently, in 08c6597

#32513 cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Testing pandas testing functions or related to the test suite and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 29, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Jul 29, 2020
ivirshup added a commit to scverse/anndata that referenced this issue Jul 29, 2020
See pandas-dev/pandas#35446 for details.
Basically broke how we compare dataframes in tests.
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 29, 2020

Agreed that check_exact should just have no effect for non-numeric dtypes.

@ivirshup are you interested in submitting a PR to update it?

@ivirshup
Copy link
Contributor Author

Sure! I'm having some trouble finding where to put tests though. Turns out assert_frame_equal occurs a lot in the test suite!

Could you point me to that?

@simonjayhawkins
Copy link
Member

\pandas\tests\util\test_assert_frame_equal.py

@cloud-rocket
Copy link

I am comparing 2 data frames indexed by timestamp and getting the following error (works perfectly well in pandas 1.0.5):

E       AssertionError: (<Second>, None)

Can it be related?

@ivirshup
Copy link
Contributor Author

ivirshup commented Aug 5, 2020

It looked to me like the assert_equal methods got a refactor in 1.1. I don't think it's the same bug, but just got introduced around the same time.

HyukjinKwon pushed a commit to apache/spark that referenced this issue Jun 7, 2021
…ting

### What changes were proposed in this pull request?

Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions.

### Why are the changes needed?

`pd.testing` utils are utilized in pandas-on-Spark tests.
Due to pandas-dev/pandas#35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`.  We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #32772 from xinrong-databricks/test_util.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@devProdigy
Copy link

btw, if assert_frame_equal gives you smth like AssertionError: (<10 * Minutes>, None) you can:

  1. set up frequency of index (if you use timeseries aggregated datetime as your index). I lost it after reading expected dataframe from the .csv. For example:
df = pandas.read_csv(file_path, header=0, index_col=0)
df.index = pd.to_datetime(df.index)  # setup datetime index to match dtypes
df.index.freq = "10T"  # make same frequency for df matching
  1. or pass check_like=True to the assert_frame_equal function. But I'm not sure that this is the best idea when you're strictly comparing dataframes :)

@kevinsqi
Copy link

kevinsqi commented Jul 5, 2022

@devProdigy's comment is also the problem I was running into (though not strictly related to the original issue). Looks like check_freq=False (available in 1.1.0) may also do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants