-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KnnFloatVectorQuery misses highest-ranking results that FloatVectorSimilarityQuery retrieves #13611
Comments
Here is a link to the index: https://drive.google.com/file/d/1157Sn7KXZhnJb0AWd4vKyH3SWrr_wZB5/view?usp=sharing. |
@msokolov here is a good data set to test graph connectedness improvements. @david-sitsky looking at your data, there are multiple duplicates, which will form highly clustered regions within the HNSW graph and this will lead to disconnected components and difficulty searching. I downloaded your index and debugged the graph. It isn't fully connected from what I can tell, and the searcher seems to get stuck in local minima:
With connecting disconnected components as @msokolov 's new PR work does:
Now, even with this, we get stuck in local minima with So, I experimented with a couple (here is the code I used, note, I manually patched to use @msokolov 's connectedness improvements):
Even with all this, I couldn't get the searcher to break out of the local minima. I wonder if our connectedness check accounts for the directional nature of the graph... I will debug further as I can. |
^ My testing code has a bug. Fixing that to actually test other graph building params. FYI, to get the true nearest with the default params, I had to set |
OK, after correcting my testing function, here are the
It seems given the order of your data, we are hitting local minima pretty often. Increasing the beamWidth for your format seems like the best option as it will increase graph quality, allowing the graph building itself to not get stuck in local minma. Or, you can set |
@benwtrent - thank you so much for your insights and recommendation! I will try using m=16 and b=250 by default when we create our indexes to see if this improves things overall. Many thanks. |
Description
I work on a program which supports using both KnnFloatVectorQuery and FloatVectorSimilarityQuery for querying an index. Each item in an index can have multiple embeddings/vectors associated with it, so I use Lucene's parent/child documents, as you can see in the query code below.
A user has found an unusual situation where using KnnFloatVectorQuery has missed all top-ranking results which FloatVectorSimilarityQuery found. I do realise these queries work differently, and that HNSW is approximate, however I still found the results strange and worth reporting here in case this is seen as unexpected behaviour.
I have a reproduction here in this rough test program. FWIW in my index this runs against, all parent items in this instance only have one image embedding child document.
The output from running this is as follows:
FloatVectorSimilarityQuery finds the best results, and the "top k" queries (with/without using DiversifyingChildrenFloatKnnVectorQuery) return the same results, but all missing the top results.
If it helps, I can provide the index to run against this program (it is 24M compressed), but wanted to check first if what I am reporting is expected or not.
Version and environment details
This is using Lucene 9.11.1 on Linux.
The text was updated successfully, but these errors were encountered: