Use uniform distribution in `timeseries` demo #8020

jrbourbeau · 2021-08-11T17:51:19Z

I was putting together a demo using our timeseries utility and noticed generating random integers can take a significant amount of time -- especially when generating a DataFrame with many columns. After digging in a bit deeper, this is because we're generating integers from a Poisson distribution

dask/dask/dataframe/io/demo.py

Lines 16 to 17 in da24582

    
           def make_int(n, rstate, lam=1000): 
        
               return rstate.poisson(lam, size=n)

which can be an order of magnitude slower than from, for example, a uniform distribution:

In [1]: import numpy as np

In [2]: rstate = np.random.RandomState(2)

In [3]: %timeit rstate.poisson(1_000, size=86_400)   # 86_400 is the default partition size for `timeseries`
5.39 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit rstate.randint(low=0, high=1e9, size=86_400)
395 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: 5.39e3 / 395
Out[5]: 13.645569620253164

This PR proposes we switch to using a uniform distribution for generating random integers.

cc @rjzamora who has recently worked with our timeseries utility

mrocklin · 2021-08-11T18:23:29Z

Consider me -0 on this

FWIW I like having a poisson distribution. I think that it makes things feel a bit more natural, which is nice to see in random data. People can use float for uniform data if they want something speedy.

However, if it's fast to produce "natural looking" data in some other faster way then I'd be for that.

quasiben · 2021-08-11T19:47:00Z

some related to extending the timeseries demo, @isVoid recently added a feature to generate nulls as part of the distribution:
rapidsai/cudf#8925

jsignell · 2021-08-12T13:21:58Z

It is surprising to me how often I use the timeseries demo. I have never personally been bothered by the time it takes to generate the data, but I am all for making it even more useful.

In case it's useful to others, I recently started adding an annual temperature column:

annual_cycle = np.sin(2 * np.pi * (ddf.index.dayofyear.values / 365.25 - 0.28)).compute_chunk_sizes()
temperature_values = 10   15 * annual_cycle   3 * da.random.normal(size=annual_cycle.size)
ddf["temperature"] = temperature_values

jrbourbeau added 2 commits August 11, 2021 12:27

Use random integers in timeseries demo

bebd2ae

Update docstring

728f09f

github-actions bot added dataframe io labels Aug 11, 2021

jrbourbeau changed the title ~~Use random integers in timeseries demo~~ Use uniform distribution in timeseries demo Aug 11, 2021

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use uniform distribution in `timeseries` demo #8020

Use uniform distribution in `timeseries` demo #8020

jrbourbeau commented Aug 11, 2021

mrocklin commented Aug 11, 2021

quasiben commented Aug 11, 2021

jsignell commented Aug 12, 2021

	def make_int(n, rstate, lam=1000):
	return rstate.poisson(lam, size=n)

Use uniform distribution in timeseries demo #8020

Are you sure you want to change the base?

Use uniform distribution in timeseries demo #8020

Conversation

jrbourbeau commented Aug 11, 2021

mrocklin commented Aug 11, 2021

quasiben commented Aug 11, 2021

jsignell commented Aug 12, 2021

Use uniform distribution in `timeseries` demo #8020

Use uniform distribution in `timeseries` demo #8020