comparator implements similarity and distance measures for clustering and record linkage applications. It includes measures for comparing strings as well as numeric vectors. Where possible, measures are implemented in C/C to ensure fast performance.
Levenshtein()
: Levenshtein distance/similarityDamerauLevenshtein()
Damerau-Levenshtein distance/similarityHamming()
: Hamming distance/similarityOSA()
: Optimal String Alignment distance/similarityLCS()
: Longest Common Subsequence distance/similarityJaro()
: Jaro distance/similarityJaroWinkler()
: Jaro-Winkler distance/similarity
Not yet implemented.
MongeElkan()
: Monge-Elkan measureFuzzyTokenSet()
: Fuzzy Token Set distance
InVocabulary()
: Compares strings using a reference vocabulary. Useful for comparing names.Lookup()
: Retrieves distances/similarities from a lookup tableBinaryComp()
: Compares strings based on whether they agree/disagree exactly.
Euclidean()
: Euclidean (L-2) distanceManhattan()
: Manhattan (L-1) distanceChebyshev()
: Chebyshev (L-∞) distanceMinkowski()
: Minkowski (L-p) distance
You can install the latest release from CRAN by entering:
install.packages("comparator")
The development version can be installed from GitHub using devtools
:
# install.packages("devtools")
devtools::install_github("ngmarchant/comparator")
A measure can be instantiated by calling its constructor function. For instance, we can define a Levenshtein similarity measure that ignores differences in upper/lowercase characters as follows:
measure <- Levenshtein(similarity = TRUE, normalize = TRUE, ignore_case = TRUE)
A measure can be used to compare vectors element-wise as follows:
x <- c("John Doe", "Jane Doe")
y <- c("jonathon doe", "jane doe")
elementwise(measure, x, y)
#> [1] 0.6666667 1.0000000
# shorthand for above
measure(x, y)
#> [1] 0.6666667 1.0000000
Pairwise comparisons are also supported using the following syntax:
# compare each value in x with each value in y and return a similarity matrix
pairwise(measure, x, y, return_matrix = TRUE)
#> [,1] [,2]
#> [1,] 0.6666667 0.6842105
#> [2,] 0.5384615 1.0000000
# compare the values in x pairwise and return a similarity matrix
pairwise(measure, x, return_matrix = TRUE)
#> [,1] [,2]
#> [1,] 1.0000000 0.6842105
#> [2,] 0.6842105 1.0000000