Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use uniform distribution in timeseries demo #8020

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jrbourbeau
Copy link
Member

I was putting together a demo using our timeseries utility and noticed generating random integers can take a significant amount of time -- especially when generating a DataFrame with many columns. After digging in a bit deeper, this is because we're generating integers from a Poisson distribution

def make_int(n, rstate, lam=1000):
return rstate.poisson(lam, size=n)

which can be an order of magnitude slower than from, for example, a uniform distribution:

In [1]: import numpy as np

In [2]: rstate = np.random.RandomState(2)

In [3]: %timeit rstate.poisson(1_000, size=86_400)   # 86_400 is the default partition size for `timeseries`
5.39 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit rstate.randint(low=0, high=1e9, size=86_400)
395 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: 5.39e3 / 395
Out[5]: 13.645569620253164

This PR proposes we switch to using a uniform distribution for generating random integers.

cc @rjzamora who has recently worked with our timeseries utility

@jrbourbeau jrbourbeau changed the title Use random integers in timeseries demo Use uniform distribution in timeseries demo Aug 11, 2021
@mrocklin
Copy link
Member

Consider me -0 on this

FWIW I like having a poisson distribution. I think that it makes things feel a bit more natural, which is nice to see in random data. People can use float for uniform data if they want something speedy.

However, if it's fast to produce "natural looking" data in some other faster way then I'd be for that.

@quasiben
Copy link
Member

some related to extending the timeseries demo, @isVoid recently added a feature to generate nulls as part of the distribution:
rapidsai/cudf#8925

@jsignell
Copy link
Member

It is surprising to me how often I use the timeseries demo. I have never personally been bothered by the time it takes to generate the data, but I am all for making it even more useful.

In case it's useful to others, I recently started adding an annual temperature column:

annual_cycle = np.sin(2 * np.pi * (ddf.index.dayofyear.values / 365.25 - 0.28)).compute_chunk_sizes()
temperature_values = 10   15 * annual_cycle   3 * da.random.normal(size=annual_cycle.size)
ddf["temperature"] = temperature_values

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants