Skip to content

Fast Approximate Unique Word Counting (via LogLog-Beta) for the command line

License

Notifications You must be signed in to change notification settings

travisbrady/ccard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CCard Build Status MIT license

A single binary written in C to estimate the count of unique strings in a stream of lines on stdin. Use ccard where you'd use zcat some_huge_file.txt.gz | sort -u | wc -l

Description

This project exists because I often have a need to quickly know the cardinality of something at the command line. I've always used standard unix tools for this purpose, but they are much slower than they could be if you're willing to sacrifice a small amount of accuracy (typically <5%) to get an estimate of the true cardinality.

ccard counts unique strings using the Qin, et al's LogLog-Beta algorithm translated mechanically from seiflotfy's implementation Go seiflotfy/loglogbeta. Credit is also due to J. Andrew Rogers for his metrohash-c library jedisct1/metrohash-c

Examples

Using the data files from @seiflotfy's loglogbeta repo. The ratio field below is computed as 100 - (100 * estimate/exact)

$ make test
file: data/words-1
exact:      150
estimate: 149
ratio: 0.667%

file: data/words-2
exact:     1308
estimate: 1315
ratio: -0.535%

file: data/words-3
exact:    46685
estimate: 47169
ratio: -1.036%

file: data/words-4
exact:   235886
estimate: 232426
ratio: 1.467%

file: data/words-5
exact:   349900
estimate: 346596
ratio: 0.945%

file: data/words-6
exact:   479829
estimate: 477269
ratio: 0.534%

Additional Examples with text8 and others

# exact
$ cat /usr/share/dict/words | sort -u | wc -l
235886
# approximate with ccard
$ cat /usr/share/dict/words | ./ccard
232426

So in this case ccard error is 1.47%

# Compute exact count using trivial Python script contained herein
$ time pigz -c -d data/text8.lines.gz | python uniq.py
    5.96 real         0.45 user         0.10 sys
253855
# the slow way
$ time pigz -c -d data/text8.lines.gz | sort -u | wc -l
    293.16 real         0.51 user         0.09 sys
253855
$ time pigz -c -d data/text8.lines.gz | ./ccard
    1.40 real         0.48 user         0.13 sys
253801

Here ccard is off by -0.021%

License

ccard is licensed under the MIT license and J. Andrew Rogers' metrohash-c library (vendored directly into this project) is as well.

References

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

@misc{1612.02284,
Author = {Jason Qin and Denys Kim and Yumei Tung},
Title = {LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting},
Year = {2016},
Eprint = {arXiv:1612.02284},
}

About

Fast Approximate Unique Word Counting (via LogLog-Beta) for the command line

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published