You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand ddf.groupby().apply on index column with known divisions would avoid shuffling which is good. But I really struggle to understand if dask would do shuffling if I join two dask dataframe along their indexes but their divisions are unknown. lhs.join(rhs) where lhs and rhs both have unknown divisions. Based on the doc, it seems like the answer is No, and falls into the easy cases as long as you are joining on indexes. But how could dask efficiently avoid data shuffling if both data frames' divisions are unknown?
Also it would be good if we have a doc describing how the divisions of a dataframe changes and evolves as we perform operations like join, groupby on dataframe, similarly like how we send meta to a lot of operations to keep track of the schema of the data frames during all dask transformations.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I was reading this doc on dask to avoid shuffling when doing some common operations such as
ddf.groupby
,ddf.join(other_ddf)
etc:https://docs.dask.org/en/stable/dataframe-groupby.html
I understand
ddf.groupby().apply
on index column with known divisions would avoid shuffling which is good. But I really struggle to understand if dask would do shuffling if I join two dask dataframe along their indexes but their divisions are unknown.lhs.join(rhs)
wherelhs
andrhs
both have unknown divisions. Based on the doc, it seems like the answer is No, and falls into the easy cases as long as you are joining on indexes. But how could dask efficiently avoid data shuffling if both data frames' divisions are unknown?Also it would be good if we have a doc describing how the
divisions
of a dataframe changes and evolves as we perform operations likejoin
,groupby
on dataframe, similarly like how we sendmeta
to a lot of operations to keep track of the schema of the data frames during all dask transformations.Please clarify if possible.
Beta Was this translation helpful? Give feedback.
All reactions