You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is not much doc I can find about the behaviour of indexing and divisions after performing a ddf.groupby operations other than just typing out and try out myself. So far this is what I have found:
As for divisions, it seems that the only case where dask will keep the divisions after performing groupby is when calling it on an indexed column and then followed by apply. e.g (ddf.groupby(ddf.index.name).apply(fn)). See this: Groupby-agg on index with known divisions results in unknown divisions #2999. All other cases will generated unknown divisions, such as groupby.transform, groupby.agg().
As for indexing, it seems that after ddf.groupby(any_cols).apply() will keep whatever the index ddf has before, although the order of the index values could change. And the same goes for ddf.groupby(any_cols).transform(). The only case where index get reset is when you call reduction on groupby, such as ddf.groupby(any_cols).agg({}). And even you could get multi-index if you passed a list to groupby , although Dask does not generally supports multi-index.
As for npartitions, based on the observations above, my general assumption is for groupby.apply(), will keep the original indexing as well as the original npartititions before calling groupby.apply()? So ddf.groupby().apply() will have the same number of partitions as ddf? And for ddf.groupby.agg, it would get 1 partition in the result unless you specified otherwise via split_out parameter in agg() function?
It would be good if we have a general document that describe the behaviour of groupby in terms of the results' index, divisions, and npartitions after the groupby call
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
There is not much doc I can find about the behaviour of indexing and divisions after performing a
ddf.groupby
operations other than just typing out and try out myself. So far this is what I have found:As for divisions, it seems that the only case where dask will keep the
divisions
after performinggroupby
is when calling it on an indexed column and then followed byapply
. e.g (ddf.groupby(ddf.index.name).apply(fn)
). See this: Groupby-agg on index with known divisions results in unknown divisions #2999. All other cases will generated unknown divisions, such asgroupby.transform
,groupby.agg()
.As for indexing, it seems that after
ddf.groupby(any_cols).apply()
will keep whatever the indexddf
has before, although the order of the index values could change. And the same goes forddf.groupby(any_cols).transform()
. The only case where index get reset is when you call reduction ongroupby
, such asddf.groupby(any_cols).agg({})
. And even you could get multi-index if you passed a list togroupby
, although Dask does not generally supports multi-index.As for npartitions, based on the observations above, my general assumption is for
groupby.apply()
, will keep the original indexing as well as the original npartititions before callinggroupby.apply()
? Soddf.groupby().apply()
will have the same number of partitions asddf
? And forddf.groupby.agg
, it would get 1 partition in the result unless you specified otherwise viasplit_out
parameter inagg()
function?It would be good if we have a general document that describe the behaviour of groupby in terms of the results' index, divisions, and npartitions after the
groupby
callBeta Was this translation helpful? Give feedback.
All reactions