Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Merging on different number of levels #34862

Closed
2 tasks done
dinya opened this issue Jun 18, 2020 · 7 comments · Fixed by #40993
Closed
2 tasks done

DEPR: Merging on different number of levels #34862

dinya opened this issue Jun 18, 2020 · 7 comments · Fixed by #40993
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@dinya
Copy link

dinya commented Jun 18, 2020

Two dataframes with MultiIndex in columns (shapes are different).

df1_columns = pd.MultiIndex.from_tuples([("A0", "B0", "C0"), ("A1", "B1", "C1")])
df1 = pd.DataFrame([[1, 2], [10, 20]], columns=df1_columns)

df2_columns = pd.MultiIndex.from_tuples([("X0", "Y0"), ("X1", "Y1")])
df2 = pd.DataFrame([[1, 200], [10, 200]], columns=df2_columns)

Case 1

pd.merge(df1, df2, left_on=[("A0", "B0","C0")], right_on=[("X0", "Y0")])

returns

   (A0, B0, C0)  (A1, B1, C1)  (X0, Y0)  (X1, Y1)
0             1             2         1       200
1            10            20        10       200

It's ok. Columns saved its structure from the both dataframes. They are tuples, but it can be fixed (for example with something like this).

Case 2

pd.merge(df2, df1, left_on=[("X0", "Y0")], right_on=[("A0", "B0","C0")])

returns

   X0   X1  A0  A1
   Y0   Y1  B0  B1
0   1  200   1   2
1  10  200  10  20

When the left DataFrame is a DataFrame with fewer column levels the merge result is questionable: lower levels are dropped.

Is this a bug or a feature (result levels shape must be the same as for the left DataFrame)?

Notes:

  1. In both cases the warning is the same UserWarning: merging between different levels can give an unintended result ....
  2. Checked with pandas 0.25.1 and 1.0.4.
  3. See notebook.

@dinya dinya added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 18, 2020
@TomAugspurger
Copy link
Contributor

Does the UesrWarning make it clear that you should be passing the same number of levels?

@TomAugspurger TomAugspurger added Usage Question Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 4, 2020
@dinya
Copy link
Author

dinya commented Sep 8, 2020

Does the UesrWarning make it clear that you should be passing the same number of levels?

Yes, the warning is clear. But why not to add auto-leveling of columns with different levels? If it can break something (I really don't know what it is) then the pd.merge behaviour may be optional.

The auto-leveling can be useful if the pandas is used in an automatic data-processing flow when the column levels shapes are not known ahead.

Or does this feature look like "rarely used" and therefore unnecessary?

@TomAugspurger
Copy link
Contributor

I'm not sure. Perhaps investigate the original issue where the warning was added?

@dinya
Copy link
Author

dinya commented Sep 9, 2020

Ok, investigation is started...

The warning was added in bb9b9c5 (line 200) by @nbonnotte and @jreback after #9455 and #12219.

See discussion in #12219 starting with this comment (as result the warning was added?).

@jreback wrote in this comment

my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level.

The default is too flexible here. We could simply show a warning, but warnings are often ignored.

I don't have a concrete idea of how to do this ATM.

@jreback, did you mean "option == optional parameter for pd.merge function to auto-leveling"?

If so, is the optional parameter for auto-leveling of multi-index columns is an actual issue?

@jreback
Copy link
Contributor

jreback commented Sep 10, 2020

@dinya I think the warning is fine, meaning some people really want to do this, but I am open to deprecating this and simply prohibiting merging on different number of levels as the results are unexpected.

@shawnbissell
Copy link

@TomAugspurger @jreback was this level dropping deprecated/prohibited in the 1.2.0 release? After upgrading I'm seeing "ValueError: Length of names must match number of levels in MultiIndex." where as with 1.1.5 I previously saw the UserWarning.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2021

@shawnbissell you would have to show a reproducible example

@mroeschke mroeschke changed the title QST: Merge drops lower levels in result columns when the left DF has fewer column levels than the right DF DEPR: Merging on different number of levels Mar 26, 2021
@mroeschke mroeschke added Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Info Clarification about behavior needed to assess issue Usage Question labels Mar 26, 2021
@jreback jreback added this to the 1.3 milestone Apr 20, 2021
ClaudioSalvatoreArcidiacono pushed a commit to ClaudioSalvatoreArcidiacono/feature_engine that referenced this issue Apr 18, 2023
Merging dfs with different column lelvels has been disallowed
ref pandas-dev/pandas#34862
solegalli pushed a commit to feature-engine/feature_engine that referenced this issue Apr 22, 2023
* Transform positional argument into keyword argument

From pandas 2.0 any only accepts keyworkd arguments
ref pandas-dev/pandas#44896

* Change how reciprocal is computed

I have not fully understood why this solve the problem, but splitting
the operation in 2 lines does not seem to work

* Catch warnings from pandas.to_datetime

Now pandas.to_datetime raises a warning when the column cannot be converted

* check_dtype=False in tests datetime features

Pandas dataframes created from python integers are created with int
column types `int64` but the operation tested returns `int32` which
caused issues

* Use droplevel before merging

Merging dfs with different column lelvels has been disallowed
ref pandas-dev/pandas#34862

* Change expected values for months

I am not sure why this caused an issue, maybe due to type casting?

* run black

* run black on tests

* isort _variable_type_checks.py

* Fix datetime_subtraction

---------

Co-authored-by: Claudio Salvatore Arcidiacono <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants