-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: consistent types in output of df.groupby #42795
Comments
I haven't yet considered the implementation or all the ramifications of this, but we do similar things in other places, e.g.
My guess is that a tuple may be appropriate in the case of passing a list to groupby, but that the result should still have an Index (not a MultiIndex). cc @jbrockmendel for any thoughts here. |
a length-1 tuple seems reasonable to me |
Will need to deprecate current behavior for this. |
@rhshadrach should I add a warning or just fix it? |
I think this needs to be deprecated before changing. |
@rhshadrach does that mean raising a deprecation warning for the case where there is only one group name? |
Yes, that's correct - something along the lines of "In a future version of pandas, a length 1 tuple will be returned when grouping by a list of length 1. Don't supply a list with a single grouper to avoid this warning." |
@rhshadrach The deprecation PR #47761 |
…to_dataset (#14306) This PR is looking at the deprecation warning from Pandas 1.5.0 connected to length 1 tuple used in the `groupby()` operation, see pandas-dev/pandas#42795. Currently we use `groupby` operation with a length 1 tuple in: - `multisourcefs()` fixture in `test_dataset.py`: https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/tests/test_dataset.py#L197 - `write_to_dataset()` in `pyarrow/parquet/core.py`: https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/parquet/core.py#L3346-L3348 This PR fixes the the test and adds a check in `parquet/core.py` to avoid the warning. Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…to_dataset (apache#14306) This PR is looking at the deprecation warning from Pandas 1.5.0 connected to length 1 tuple used in the `groupby()` operation, see pandas-dev/pandas#42795. Currently we use `groupby` operation with a length 1 tuple in: - `multisourcefs()` fixture in `test_dataset.py`: https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/tests/test_dataset.py#L197 - `write_to_dataset()` in `pyarrow/parquet/core.py`: https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/parquet/core.py#L3346-L3348 This PR fixes the the test and adds a check in `parquet/core.py` to avoid the warning. Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Hi, I got the warning message "In a future version of pandas, a length 1 tuple will be returned when grouping by a list of length 1. Don't supply a list with a single grouper to avoid this warning." |
@GilWu - It sounds like you're getting an improper warning. Can you open a new issue with a reproducible example demonstrating the warning message you're seeing? |
@GilWu It's still unclear to me if I should go ahead and make my code test the returned group name with |
If you are passing a single element, the behavior will be the same in 2.0. If you are passing a length-1 list, the behavior will change (scalar in 1.5.x, tuple in 2.0). In general, if you are not getting a warning then you should experience no change. |
Apologies if I've missed something, but should this change not also be applied to import pandas as pd
df = pd.DataFrame({'x': [10, 20, 30], 'y': ['a', 'b', 'c']})
# group by atomic - names and keys are atomic
name, _ = next(iter(df.groupby('x')))
name # `10` as expected
df.groupby('x').groups # `{10: [0], 20: [1], 30: [2]}` as expected
# group by list of 2 - names and keys are 2-tuples
name, _ = next(iter(df.groupby(['x', 'y'])))
name # `(10, 'a')` as expected
df.groupby(['x', 'y']).groups # `{(10, 'a'): [0], (20, 'b'): [1], (30, 'c'): [2]}` as expected
# group by list of 1 - names are 1-tuples but keys are atomic!?
name, _ = next(iter(df.groupby(['x'])))
name # `(10,)` as expected
df.groupby(['x']).groups # `{10: [0], 20: [1], 30: [2]}` UNEXPECTED |
@Azureuse - can you open a new issue. |
Is your feature request related to a problem?
The problem that occurred to me is due to inconsistent types outputted by
for values, match_df in df.groupby(names)
- if names is a list of length >1, thenvalues
can be a tuple. Otherwise it's an element type, which can be e.g. string.I had a bug where I iterating over
values
resulted in going over characters of a string, together withzip
it just discarded most of the field.Describe the solution you'd like
Whenever df.groupby is called with a list, output a tuple of values found. Specifically when called with a list of length 1, output a tuple of length 1 instead of unwrapped element.
API breaking implications
This would change how existing code behaves.
Describe alternatives you've considered
Leave things as they are. Unfortunately it means people need to write special case handling ifs around groupby to get consistent types outputted.
Additional context
The text was updated successfully, but these errors were encountered: