A single binary written in C to estimate the count of unique strings in a stream of lines on stdin.
Use ccard
where you'd use zcat some_huge_file.txt.gz | sort -u | wc -l
This project exists because I often have a need to quickly know the cardinality of something at the command line. I've always used standard unix tools for this purpose, but they are much slower than they could be if you're willing to sacrifice a small amount of accuracy (typically <5%) to get an estimate of the true cardinality.
ccard
counts unique strings using the Qin, et al's LogLog-Beta algorithm translated mechanically from seiflotfy's
implementation Go seiflotfy/loglogbeta.
Credit is also due to J. Andrew Rogers for his metrohash-c library jedisct1/metrohash-c
Using the data files from @seiflotfy's loglogbeta repo.
The ratio
field below is computed as 100 - (100 * estimate/exact)
$ make test
file: data/words-1
exact: 150
estimate: 149
ratio: 0.667%
file: data/words-2
exact: 1308
estimate: 1315
ratio: -0.535%
file: data/words-3
exact: 46685
estimate: 47169
ratio: -1.036%
file: data/words-4
exact: 235886
estimate: 232426
ratio: 1.467%
file: data/words-5
exact: 349900
estimate: 346596
ratio: 0.945%
file: data/words-6
exact: 479829
estimate: 477269
ratio: 0.534%
# exact
$ cat /usr/share/dict/words | sort -u | wc -l
235886
# approximate with ccard
$ cat /usr/share/dict/words | ./ccard
232426
So in this case ccard
error is 1.47%
# Compute exact count using trivial Python script contained herein
$ time pigz -c -d data/text8.lines.gz | python uniq.py
5.96 real 0.45 user 0.10 sys
253855
# the slow way
$ time pigz -c -d data/text8.lines.gz | sort -u | wc -l
293.16 real 0.51 user 0.09 sys
253855
$ time pigz -c -d data/text8.lines.gz | ./ccard
1.40 real 0.48 user 0.13 sys
253801
Here ccard
is off by -0.021%
ccard is licensed under the MIT license and J. Andrew Rogers' metrohash-c library (vendored directly into this project) is as well.
LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting
@misc{1612.02284,
Author = {Jason Qin and Denys Kim and Yumei Tung},
Title = {LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting},
Year = {2016},
Eprint = {arXiv:1612.02284},
}