-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in code suggestion of when to use split_out in dask.dataframe.groupby #8001
Comments
That's a really interesting idea @raybellwaves - I like the idea of introducing these kinds of guardrails and suggestions into the codebase (as long as we don't go overboard :) ). I'm wondering how this will work though. After looking at the code for a bit I am actually wondering if it might make sense to default |
In the common case split_out=1 makes sense for groupby-aggregations. There are usually a small number of groups (where "small" means less than ten million) and so we want a single output partition. Split_out is computationally decently expensive. It requires something like a mini-shuffle in order to achieve. I agree that informative error messages would be good. If we knew the values of the data ahead of time then this is something that we would definitely do, but as things stand we only know about this situation once we're in the middle of a computation. Maybe we can emit such a message during computation in the intermediate steps, but this is going to be more awkward and less effective than in something like Client.scatter (where we know the answer much closer to client code). I think in general the answer to this is 👍 full support, but it's not clear how best to accomplish it, so if anyone wants to get their hands dirty and investigate that would be welcome. |
Yeah that makes sense. I think as a first step we could template in docstrings for |
dask/distributed#5220 could make it easier to send a warning back to users at runtime. Yes, it would be nice to know at call time, but getting a warning at compute time would still be far more helpful than no warning at all! |
One would still need to put some thought into how to issue the warning with dask/distributed#5220 you wouldn't want every single task to issue a warning since the client would receive hundreds or even thousands of warnings which might be a little excessive :) but the infrastructure is there, yes. |
With dask/distributed#5217 we could have a |
After spending some time working out why my groupby operation was not working I came across https://examples.dask.org/dataframes/02-groupby.html#Many-groups. If It wasn't for the great docs around dask I would likely to have missed this.
The
split_out
arg really helped.This made sense as I had ~500,000,000 groups.
I know it may not be easy but similar to other warning messages (I believe i've seen a suggestion of using
Client.Scatter
at some point but can't find where it comes from in the codebase) it would be nice to suggest in the code to the user once (if possible) it works out how many groups there are, or even optimize it so it happens under the hood (see #7933).Example:
Performance reports:
Traceback of
df.groupby(["x", "y"]).sum().compute()
:The text was updated successfully, but these errors were encountered: