-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Reduce type requirements for the subset parameter in drop_duplicates/duplicated #59237
Comments
Take |
This seems reasonable to me; cc @pandas-dev/pandas-core for any thoughts on the expansion of this parameter to sets. In the past, we've explicitly raised on sets for some arguments, but I believe in those cases it was because order mattered. Here, order does not matter. |
I think this is a documentation and typing issue, not a code issue. >>> df=pd.DataFrame.from_records([[1,2,3,4], [5,6,7,8], [10,20,3,4], [50,60,7,8]
], columns=["a", "b", "c", "d"])
>>> df
a b c d
0 1 2 3 4
1 5 6 7 8
2 10 20 3 4
3 50 60 7 8
>>> df.drop_duplicates(subset=["c", "d"])
a b c d
0 1 2 3 4
1 5 6 7 8
>>> df.drop_duplicates(subset=set(["c", "d"]))
a b c d
0 1 2 3 4
1 5 6 7 8 So you can pass a set today, but typing isn't allowing it if you are using pandas source to type check your code instead of I would suggest making a change here, but also in |
That is generally true except for single element sets.
|
OK, that makes sense. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I would like to pass a set to drop_duplicates like so:
According to type hints,
subset
is aHashable | Sequence[Hashable] | None
. The documentation says "column label or sequence of labels, optional".The problem is, I would like to pass a set of columns. The name
subset
suggests that should be ok. And it does work indeed. However, aset
is not aSequence
.Can the requirements be lowered? Maybe to
Collection
? (or evenIterable
, but that might come with problems).Looking at the code: https://github.com/pandas-dev/pandas/blob/v2.2.2/pandas/core/frame.py#L6935
It checks if the subset is
np.iterable
, then aset
is created andlen(subset)
is called. This could be done with collections as well.Am I missing anything or could this Sequence be make a Collection?
Feature Description
If I am not mistaken, Sequence could simply be replaced with Collection in duplicated as well as drop_duplicates in the docu.
Alternative Solutions
no alt solution required
Additional Context
No response
The text was updated successfully, but these errors were encountered: