CCard

A single binary written in C to estimate the count of unique strings in a stream of lines on stdin. Use ccard where you'd use zcat some_huge_file.txt.gz | sort -u | wc -l

Description

This project exists because I often have a need to quickly know the cardinality of something at the command line. I've always used standard unix tools for this purpose, but they are much slower than they could be if you're willing to sacrifice a small amount of accuracy (typically <5%) to get an estimate of the true cardinality.

ccard counts unique strings using the Qin, et al's LogLog-Beta algorithm translated mechanically from seiflotfy's implementation Go seiflotfy/loglogbeta. Credit is also due to J. Andrew Rogers for his metrohash-c library jedisct1/metrohash-c

Examples

Using the data files from @seiflotfy's loglogbeta repo. The ratio field below is computed as 100 - (100 * estimate/exact)

$ make test
file: data/words-1
exact:      150
estimate: 149
ratio: 0.667%

file: data/words-2
exact:     1308
estimate: 1315
ratio: -0.535%

file: data/words-3
exact:    46685
estimate: 47169
ratio: -1.036%

file: data/words-4
exact:   235886
estimate: 232426
ratio: 1.467%

file: data/words-5
exact:   349900
estimate: 346596
ratio: 0.945%

file: data/words-6
exact:   479829
estimate: 477269
ratio: 0.534%

Additional Examples with text8 and others

# exact
$ cat /usr/share/dict/words | sort -u | wc -l
235886
# approximate with ccard
$ cat /usr/share/dict/words | ./ccard
232426

So in this case ccard error is 1.47%

# Compute exact count using trivial Python script contained herein
$ time pigz -c -d data/text8.lines.gz | python uniq.py
    5.96 real         0.45 user         0.10 sys
253855
# the slow way
$ time pigz -c -d data/text8.lines.gz | sort -u | wc -l
    293.16 real         0.51 user         0.09 sys
253855
$ time pigz -c -d data/text8.lines.gz | ./ccard
    1.40 real         0.48 user         0.13 sys
253801

Here ccard is off by -0.021%

License

ccard is licensed under the MIT license and J. Andrew Rogers' metrohash-c library (vendored directly into this project) is as well.

References

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

@misc{1612.02284,
Author = {Jason Qin and Denys Kim and Yumei Tung},
Title = {LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting},
Year = {2016},
Eprint = {arXiv:1612.02284},
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
ccard.c		ccard.c
metrohash.h		metrohash.h
metrohash64.c		metrohash64.c
test.sh		test.sh
uniq.py		uniq.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCard

Description

Examples

Additional Examples with text8 and others

License

References

About

Releases

Packages

Languages

License

travisbrady/ccard

Folders and files

Latest commit

History

Repository files navigation

CCard

Description

Examples

Additional Examples with text8 and others

License

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages