Simhash2 is a rewrite of the bookmate/simhash gem, which is an implementation of Moses Charikar's simhashes in Ruby. The key differences are that this gem doesn't monkey patch the String
and Integer
, and configuration is also slightly easier. The simhash values generated by this gem on a default configuration should be identical to what is generated by the Bookmate version.
Install the gem from RubyGems:
gem install simhash2
If you use Bundler, just add it to your Gemfile and run bundle install
gem 'simhash2'
I have only tested this gem on Ruby 2.4.1, but there shouldn't be any reason why it wouldn't work on earlier Ruby versions as well.
str1 = "I am the king of the world!"
str2 = "I am the queen of the world!"
simhash1 = Simhash.generate(str1) # => 86798109229625320
simhash2 = Simhash.generate(str2) # => 13921220612431195624
Simhash.hamming_distance(simhash1, simhash2) # => 8
Thanks to some performance optimizations by JayTeeSF, this gem generally performs better than bookmate/simhash
, especially when working with longer strings with lots of tokens.
test_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
def test_simhash (x)
x.simhash # bookmate/simhash
end
def test_simhash2 (x)
Simhash.generate(x) # this gem
end
n = 5000
Benchmark.bm do |x|
x.report("simhash") { for i in 1..n; test_simhash(test_str); end }
x.report("simhash2") { for i in 1..n; test_simhash2(test_str); end }
end
Results:
user system total real
simhash 5.109375 0.093750 5.203125 ( 5.199069)
simhash2 4.109375 0.000000 4.109375 ( 4.108586)