HyperLogLog for Erlang

This is an implementation of the HyperLogLog algorithm in Erlang. Using HyperLogLog you can estimate the cardinality of very large data sets using constant memory. The relative error is 1.04 * sqrt(2^P). When creating a new HyperLogLog filter, you provide the precision P, allowing you to trade memory for accuracy. The union of two filters is lossless.

In practice this allows you to build efficient analytics systems. For example, you can create a new filter in each mapper and feed it a portion of your dataset while the reducers simply union together all filters they receive. The filter you end up with is exactly the same filter as if you would sequentially insert all data into a single filter.

In addition to the base algorithm, we have implemented the new estimator as based on Mean Limit as described this great paper by Otmar Ertl. This new estimator greatly improves the estimates for lower cardinalities while using a single estimator for the whole range of cardinalities.

TODO

Usage

1> hyper:insert(<<"foobar">>, hyper:insert(<<"quux">>, hyper:new(4))).
{hyper,4,
       {hyper_binary,{dense,<<0,0,0,0,0,0,0,0,64,0,0,0>>,
                            [{8,1}],
                            1,16}}}

2> hyper:card(v(-1)).
2.136502281992361

The errors introduced by estimations can be seen in this example:

3> rand:seed(exsss, {1, 2, 3}).
{#{bits => 58,jump => #Fun<rand.3.47293030>,
   next => #Fun<rand.0.47293030>,type => exsss,
   uniform => #Fun<rand.1.47293030>,
   uniform_n => #Fun<rand.2.47293030>},
 [117085240290607817|199386643319833935]}
4> Run = fun (P, Card) -> hyper:card(lists:foldl(fun (_, H) -> Int = rand:uniform(10000000000000), hyper:insert(<<Int:64/integer>>, H) end, hyper:new(P), lists:seq(1, Card))) end.
#Fun<erl_eval.12.80484245>
5> Run(12, 10_000).
10038.192365345985
6> Run(14, 10_000).
9967.916262642864
7> Run(16, 10_000).
9972.832893293473

A filter can be persisted and read later. The serialized struct is formatted for usage with jiffy:

8> Filter = hyper:insert(<<"foo">>, hyper:new(4)).
{hyper,4,
       {hyper_binary,{dense,<<4,0,0,0,0,0,0,0,0,0,0,0>>,[],0,16}}}
9> Filter =:= hyper:from_json(hyper:to_json(Filter)).
true

As of today, we only support the binary backend. More to come You can select a different backend. See below for a description of why you might want to do so. They serialize in exactly the same way, but can't be mixed in memory.

Is it any good?

No idea ! I do not know anyone that uses it extensively, but it is relatively well tested. As far as i can tell, it is the only FOSS implementation that does precision reduction properly !

Hacking

Documentation

We use ex_doc for documentation. In order to generate the docs, you need to install it

mix escript.install hex ex_doc
ex_doc --version

Then generate the docs, after targetting the correct version in docs.sh

docs.sh

Backends

Effort has been spent on implementing different backends in the pursuit of finding the right performance trade-off. Fill rate refers to how many registers has a value other than 0.

hyper_binary: Fixed memory usage (6 bits * 2^P), fastest on insert, union, cardinality and serialization. Best default choice.

You can also implement your own backend. In test theres a bunch of tests run for all backends, including some PropEr tests. The test suite will ensure your backend gives correct estimates and correctly encodes/decodes the serialized filters.

Fork

This is a fork of the original Hyper library by GameAnalytics. It was not maintained anymore.

The main difference are a move to the rand module for tests and to rebar3 as a build tool, in order to support OTP 23 .

The carray backend was dropped, as it was never moved outside of experimental status and could not be serialised for a distributed use. Some backends using NIF may come back in the future.

The bisect implementation was dropped too. Its use case was limited and it forced a dependency on a library that was not maintained either.

The gb backend was dropped for the time being too.

The Array backend was dropped for the time being too.

The estimator was rebuilt following this paper by Otmar Ertl, as it was broken for any precision not 14. This should also provide better estimation across the board for cardinality.

The reduce_precision function has been rebuilt properly, as it was quite simply wrong. This fixed a lot of bugs for unions.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/workflows		.github/workflows
priv		priv
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
docs.config		docs.config
docs.sh		docs.sh
erlang_ls.config		erlang_ls.config
rebar.config		rebar.config
rebar.lock		rebar.lock
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperLogLog for Erlang

TODO

Usage

Is it any good?

Hacking

Documentation

Backends

Fork

About

Releases 1

Packages

Languages

License

LivewareProblems/hyper

Folders and files

Latest commit

History

Repository files navigation

HyperLogLog for Erlang

TODO

Usage

Is it any good?

Hacking

Documentation

Backends

Fork

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages