Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KnnFloatVectorQuery misses highest-ranking results that FloatVectorSimilarityQuery retrieves #13611

Closed
david-sitsky opened this issue Jul 26, 2024 · 5 comments
Labels

Comments

@david-sitsky
Copy link

david-sitsky commented Jul 26, 2024

Description

I work on a program which supports using both KnnFloatVectorQuery and FloatVectorSimilarityQuery for querying an index. Each item in an index can have multiple embeddings/vectors associated with it, so I use Lucene's parent/child documents, as you can see in the query code below.

A user has found an unusual situation where using KnnFloatVectorQuery has missed all top-ranking results which FloatVectorSimilarityQuery found. I do realise these queries work differently, and that HNSW is approximate, however I still found the results strange and worth reporting here in case this is seen as unexpected behaviour.

I have a reproduction here in this rough test program. FWIW in my index this runs against, all parent items in this instance only have one image embedding child document.

package org.testme;

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.search.FieldExistsQuery;
import org.apache.lucene.search.FloatVectorSimilarityQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.KnnFloatVectorQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.join.BitSetProducer;
import org.apache.lucene.search.join.DiversifyingChildrenFloatKnnVectorQuery;
import org.apache.lucene.search.join.QueryBitSetProducer;
import org.apache.lucene.search.join.ScoreMode;
import org.apache.lucene.search.join.ToParentBlockJoinQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;

public class TestLuceneVectorSearch
{
    @Test
    public void testMe() throws Exception
    {
        try (Directory directory = FSDirectory.open(Paths.get("/path/to/TextIndex"));
             DirectoryReader directoryReader = DirectoryReader.open(directory))
        {
            IndexSearcher indexSearcher = new IndexSearcher(directoryReader);

            // Matches all "parent" documents in the index, which have the "store-item-id" field.
            // A parent document can have multiple image embeddings, by creating multiple child documents which use
            // the "image-embeddings" field.
            BitSetProducer parentDocsFilter = new QueryBitSetProducer(new FieldExistsQuery("store-item-id"));

            // Embeddings which represent the image search query "sports car".
            float[] queryVector = new float[] {
                0.0044572074f, -0.023088364f, -0.0036703937f, -0.009964277f,
                -0.00916677f, 0.024540152f, 0.0070927753f, -0.09274965f, -0.005670112f,
                -1.6179009E-4f, 0.015744649f, -0.020366224f, 0.0016601892f, -0.010853463f,
                0.033737496f, 0.0045916773f, 0.058148578f, 9.6350856E-4f, -0.014317f,
                0.016393812f, 0.020966044f, 0.018872999f, -7.3815876E-4f, 0.014115239f,
                -0.031016221f, 0.03167156f, -0.005343557f, 6.839247E-4f, -0.0042909216f,
                0.016312996f, -0.004735699f, -0.012280128f, 0.019533258f, 0.017941408f,
                -0.045601472f, -0.008958914f, -0.002675118f, -9.5679046E-4f, 0.010099755f,
                0.0017631804f, 0.01860111f, 0.037259832f, -0.012140477f, -0.0030949886f,
                -0.0058084503f, 0.0124143725f, 2.2733907E-4f, -0.012101588f,
                0.0108692385f, -0.018079244f, -0.011381395f, -0.017509837f, 0.0012023754f,
                -0.014764546f, -0.022564514f, -0.03368696f, 0.029053459f, -0.001420365f,
                -0.050087184f, -0.018782789f, 0.024201525f, -0.027481187f, -0.0052858875f,
                0.010976288f, 1.2510385E-4f, 0.014984402f, 0.034309097f, 0.017706916f,
                0.018962046f, -0.008943878f, 0.0020767413f, -0.0373546f, -0.005417921f,
                0.025909677f, 0.0024755383f, -0.0110565685f, 0.019262856f, 0.01675921f,
                -0.009826738f, -0.020427415f, -0.01732683f, 0.002310435f, -0.004211249f,
                0.023289248f, 0.018740546f, -0.006772653f, -0.006745499f, -0.024986906f,
                0.008061354f, -0.015194572f, 0.011593046f, -0.05354502f, -0.13884935f,
                0.033540063f, -0.0027567758f, 0.023013994f, -0.014023355f, 0.015025772f,
                0.016264228f, -0.01661682f, 0.0035698174f, -0.016123693f, 0.034136593f,
                0.004460381f, -0.018264858f, -0.006348571f, -0.011179938f, -0.010155596f,
                -5.2316306E-4f, -0.012665312f, 0.0061210217f, 0.016656024f, -0.004163066f,
                -3.4165586E-4f, 0.016313404f, -0.015221417f, 0.008724262f, 0.037221473f,
                0.0038612096f, -0.016207112f, 0.022699108f, -0.019367028f, -0.019470742f,
                -0.0063872794f, -0.0015510321f, -0.04432974f, -0.020032836f,
                -0.011690585f, 0.03094845f, 0.0067724036f, 0.012485819f, -0.009286312f,
                0.008886298f, 0.61861366f, -0.0045251283f, -0.0077076033f, -0.043705218f,
                -0.025324458f, 0.021726586f, -0.026047882f, -0.009551234f, -0.0071868496f,
                -0.0036969658f, 0.020132458f, -0.006516556f, -0.0070351847f, 0.017480128f,
                -0.0035662379f, -5.9353746E-4f, -0.02526074f, 0.022630077f, -0.011171131f,
                -0.005936157f, 0.040870648f, -0.018079637f, 0.026608476f, 0.009430965f,
                -0.027296953f, -0.014650009f, 0.006681159f, -2.6793202E-4f, 0.0054786704f,
                -0.013636381f, 0.016031679f, -0.028956043f, -8.7672856E-4f, 0.013015282f,
                -0.013950296f, -0.012305141f, 0.007395428f, 0.0032757986f, 0.011180955f,
                0.018238183f, 0.012033082f, -0.036541812f, 0.0057558473f, -0.0071096867f,
                -0.008386121f, 0.012468599f, 0.022702914f, -0.0073613483f, 0.028406166f,
                -0.016778922f, -0.017091695f, -0.033710238f, -0.016843721f, 0.015285634f,
                -0.019003538f, -0.00687855f, -7.775667E-4f, -0.024790084f, 0.016236953f,
                -0.006595245f, -0.015513008f, -0.03021261f, 0.0030078986f, -0.026664777f,
                0.008451913f, 0.004026551f, -0.011371533f, -0.015816687f, -0.0026805112f,
                0.017776044f, 0.017499488f, 0.0044229627f, 0.017531231f, -0.033204503f,
                -0.038329072f, -0.011035979f, 0.008958172f, 0.07328921f, 0.0038306648f,
                0.03270265f, 0.015056664f, -0.006860551f, -0.004933787f, 0.016191917f,
                -0.006549873f, -0.015812844f, -0.0099520385f, -0.019040879f,
                -0.037397895f, 0.015847206f, -0.0016991902f, 0.003470394f, -0.0069604022f,
                0.0123413615f, -0.009023129f, -0.007122265f, -0.011230118f, -0.007362384f,
                0.0020543125f, 0.0024772482f, -0.0076109925f, 0.03498191f, -0.011076619f,
                -0.011154479f, -0.01450519f, -0.01843803f, -0.017011909f, 0.0018331372f,
                0.0151024535f, 0.016623776f, -0.027112132f, -0.030555645f, -0.011304468f,
                0.0251135f, 0.006708286f, 0.00846858f, -0.010242636f, -0.00698456f,
                0.019706938f, 0.013477113f, 0.048511542f, -0.005879136f, 0.009369399f,
                0.004999097f, -0.004784924f, 0.016561827f, 0.0036518855f, -0.005227837f,
                0.0037853734f, -0.009837364f, 0.012072863f, 0.03813349f, 0.0040256353f,
                0.0013520177f, -0.01447286f, 0.008837758f, -0.0066623543f, 0.0029706238f,
                0.018294264f, -0.01446418f, -0.0021699388f, 9.294378E-4f, -0.009523726f,
                0.005299897f, -0.012993116f, 0.025575459f, -0.016830947f, 0.011483546f,
                -0.0011682257f, 0.005689315f, -0.01871892f, -0.017454233f, -0.0015068237f,
                0.04453382f, 0.0029374026f, 0.038485717f, 0.0019930135f, -0.004014516f,
                -0.016176851f, 0.0055262805f, 0.008696258f, -0.021886224f, 0.025037047f,
                -0.038151f, 0.006943026f, 0.017139055f, 0.013372888f, 0.023437364f,
                -0.0054156454f, -0.0014752378f, 0.0046605296f, -0.0044771726f,
                -0.011856738f, 0.0010809092f, 0.010216948f, -0.012713817f, -0.0031348357f,
                0.009013894f, 0.0011253358f, 0.61798275f, 0.007944298f, 0.0085330885f,
                0.016979003f, 4.995474E-4f, -0.027207121f, 0.04165457f, 0.0020099438f,
                0.008510639f, 0.019254327f, 0.013971549f, 0.0073373774f, -0.0055961516f,
                4.949079E-4f, 0.02810051f, 0.0060176505f, -0.008400483f, -0.19501963f,
                0.016252134f, 0.012292523f, 0.0018070682f, 0.008999078f, 0.022372805f,
                -0.016504897f, -0.028100906f, 0.007098479f, -0.009990526f, -0.0017882327f,
                0.0050334823f, -0.0068439087f, 0.0026650713f, -0.03168618f,
                -0.0012727041f, 0.008549434f, -0.0067351903f, 4.8637684E-4f,
                -0.007929317f, -0.004617511f, -0.03894391f, 0.013047643f, 0.036115382f,
                -0.0026169834f, -0.02540212f, -4.6752606E-4f, -8.121685E-4f, 0.022683896f,
                0.00134045f, 0.042805973f, -0.0041396986f, -0.008076729f, 4.813038E-4f,
                -0.026571859f, -0.002208052f, -0.030623492f, 0.0071517443f, 0.0060770884f,
                8.646011E-4f, 0.006398815f, -0.007452149f, 0.018887492f, -0.0148247555f,
                0.016297784f, -0.015086365f, 0.015252803f, -0.0042130435f, -0.002824615f,
                0.029199244f, 0.009138435f, -0.015550282f, 0.019079657f, -6.981265E-4f,
                1.9067482E-4f, 0.01982623f, 0.0011727469f, 0.0057251197f, -0.0015611411f,
                0.004203257f, -0.008882021f, -0.050709292f, 0.036732737f, -0.0016383937f,
                -0.0052129203f, 5.78685E-4f, 0.01028424f, 0.0071797483f, -0.020324964f,
                0.003225342f, 0.054530565f, 0.006593899f, 0.005106005f, -0.014254335f,
                0.0025621254f, -0.037771065f, -0.010182639f, 0.004708179f, 9.6000374E-5f,
                0.014761056f, -0.012892494f, -0.0025439663f, 0.009076798f, 0.0032978996f,
                0.00796419f, 0.0025830409f, 0.0055782637f, -0.008025513f, -0.016867429f,
                -0.0023941789f, -0.008508283f, -0.008827625f, -0.012730328f,
                -0.006827924f, -0.03513044f, -0.019266263f, -0.011573588f, -0.0035062141f,
                0.0052483953f, -3.5721017E-4f, -0.0021933548f, -0.015921012f,
                -0.011550315f, -0.008281973f, -0.0033136331f, -0.015491238f,
                -0.007224302f, -0.028960207f, 0.031132156f, -0.005436975f, 0.00838252f,
                -0.013607596f, -0.0048204553f, -0.010242622f, -0.030366635f,
                -0.0072604655f, 7.622423E-4f, 0.0013710709f, -0.035052024f, -0.013582093f,
                0.005741299f, 0.008179583f, 0.02272927f, -0.0040672733f, 0.017910969f,
                -0.006078158f, -0.04835871f, 0.025611773f, 0.02066559f, -0.0017394141f,
                1.7129006E-4f, -0.00600073f, 0.011923645f, 0.02351016f, 0.006471754f,
                0.00868545f, 0.0075797923f, 0.023683062f, -0.015859105f, -0.0062999893f,
                -0.0094027f, -0.018763369f, -0.02838345f, -0.004544819f, -0.03608459f,
                0.016126886f, 0.005982367f, 0.0012092822f, 0.020421034f, 0.027935015f,
                0.011481908f, 0.029014295f, -0.06716036f, -0.011798545f, -0.0021892604f,
                0.0022094583f, -0.007288418f, 0.002441089f, 0.015705219f, 0.0016868426f,
                -0.016558398f, -0.0013452561f, 0.014902193f, -0.023527546f, 0.0833602f,
                -0.010013801f, -0.012113727f, 0.022079771f, 0.0064695985f, -0.020935113f,
                6.643729E-4f, -0.016690062f, -6.999961E-4f, -0.002155845f, 0.0222167f,
                -0.0024071531f, -0.011394607f, -0.0042578937f, -0.015400263f,
                -0.006934272f, 0.025316682f, -0.03549049f, -0.0050169053f };

            // Perform vector similarity query using a threshold of 0.62.
            float resultSimilarity = 0.62f;
            float traversalSimilarity = resultSimilarity - 0.05f;

            System.out.println("Similarity query results: \n");
            Query similarityQuery =
                new ToParentBlockJoinQuery(
                    new FloatVectorSimilarityQuery("image-embeddings",
                                                   queryVector,
                                                   traversalSimilarity,
                                                   resultSimilarity),
                    parentDocsFilter,
                    ScoreMode.Max);
            TopDocs topDocs = indexSearcher.search(similarityQuery, 5);
            printResults(indexSearcher, topDocs);

            // Perform "top k" vector search using DiversifyingChildrenFloatKnnVectorQuery.
            System.out.println();
            System.out.println("Top k query results using DiversifyingChildrenFloatKnnVectorQuery: \n");
            Query diversifyingChildrenFloatKnnVectorQuery = new DiversifyingChildrenFloatKnnVectorQuery(
                "image-embeddings",
                queryVector,
                null,
                5,
                parentDocsFilter);
            Query rewrittenKnnQuery = indexSearcher.rewrite(diversifyingChildrenFloatKnnVectorQuery);
            Query finalQuery = new ToParentBlockJoinQuery(rewrittenKnnQuery, parentDocsFilter, ScoreMode.Max);
            topDocs = indexSearcher.search(finalQuery, 5);
            printResults(indexSearcher, topDocs);

            // Perform regular "top k" vector search.
            System.out.println();
            System.out.println("Top k query results: \n");
            Query knnQuery =
                new ToParentBlockJoinQuery(
                    new KnnFloatVectorQuery("image-embeddings", queryVector, 5),
                    parentDocsFilter,
                    ScoreMode.Max);
            topDocs = indexSearcher.search(knnQuery, 5);
            printResults(indexSearcher, topDocs);
        }
    }

    private static void printResults(IndexSearcher indexSearcher, TopDocs topDocs)
        throws IOException
    {
        for (ScoreDoc scoreDoc : topDocs.scoreDocs)
        {
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println("Name: "   document.get("name")  
                               " store-item-id: "   document.get("store-item-id")  
                               " score: "   scoreDoc.score);
        }
    }
}

The output from running this is as follows:

Similarity query results: 

Name: ferrari.02.jpg store-item-id: 13 score: 0.6431116
Name: ferrari.02.jpg store-item-id: 23 score: 0.6431116
Name: lambo.01.jpg store-item-id: 11 score: 0.63762814
Name: lambo.01.jpg store-item-id: 21 score: 0.63762814
Name: ferrari.01.jpg store-item-id: 5 score: 0.6363953

Top k query results using DiversifyingChildrenFloatKnnVectorQuery: 

Name:  store-item-id: 6093 score: 0.62397873
Name:  store-item-id: 6095 score: 0.62397873
Name:  store-item-id: 1368 score: 0.6209105
Name:  store-item-id: 142 score: 0.6206993
Name:  store-item-id: 5611 score: 0.62044996

Top k query results: 

Name:  store-item-id: 6093 score: 0.62397873
Name:  store-item-id: 6095 score: 0.62397873
Name:  store-item-id: 1368 score: 0.6209105
Name:  store-item-id: 142 score: 0.6206993
Name:  store-item-id: 5611 score: 0.62044996

FloatVectorSimilarityQuery finds the best results, and the "top k" queries (with/without using DiversifyingChildrenFloatKnnVectorQuery) return the same results, but all missing the top results.

If it helps, I can provide the index to run against this program (it is 24M compressed), but wanted to check first if what I am reporting is expected or not.

Version and environment details

This is using Lucene 9.11.1 on Linux.

@david-sitsky
Copy link
Author

@benwtrent
Copy link
Member

benwtrent commented Jul 26, 2024

@msokolov here is a good data set to test graph connectedness improvements.

@david-sitsky looking at your data, there are multiple duplicates, which will form highly clustered regions within the HNSW graph and this will lead to disconnected components and difficulty searching.

I downloaded your index and debugged the graph. It isn't fully connected from what I can tell, and the searcher seems to get stuck in local minima:

> Task :lucene:sandbox:TestLuceneVectorSearch.main()
Leaf 0 has 3 layers
Leaf 0 has 10832 documents
Graph level=2 size=16, Fanout min=1, mean=3.63, max=8
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   2   2   2   3   3   4   5   6   8
Graph level=1 size=338, Fanout min=1, mean=5.81, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   2   3   5   6   8  10  14  16
Graph level=0 size=5179, Fanout min=1, mean=4.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   4   7  11  32
Graph level=2 size=16, connectedness=1.00
Graph level=1 size=338, connectedness=0.99
Graph level=0 size=5179, connectedness=0.96
Similarity query results: 

With connecting disconnected components as @msokolov 's new PR work does:

Graph level=2 size=16, Fanout min=1, mean=3.63, max=8
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   2   2   2   3   3   4   5   6   8
Graph level=1 size=338, Fanout min=1, mean=5.81, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   2   3   5   6   8  10  14  16
Graph level=0 size=5179, Fanout min=1, mean=4.15, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   2   4   7  11  32
Graph level=2 size=16, connectedness=1.00
Graph level=1 size=338, connectedness=0.99
Graph level=0 size=5179, connectedness=1.00

Now, even with this, we get stuck in local minima with k so small & default graph building params.

So, I experimented with a couple (here is the code I used, note, I manually patched to use @msokolov 's connectedness improvements):

for (int m = 16; m <= 64; m  = 16) {
        for (int b = 100; b <= 250; b = 50) {
          int finalM = m;
          int finalB = b;
          // lets rebuild the index no nestedness
          // Delete the new Dir if it exists
          System.out.println("-----------------------------------");
          System.out.println("Rebuilding the index with m="   m   " and b="   b);
          try (Directory newDir =
                 FSDirectory.open(Paths.get("/Users/benjamintrent/Downloads/TextIndexRebuilt"))) {
            try (IndexWriter writer = new IndexWriter(newDir, new IndexWriterConfig().setCodec(
              new Lucene99Codec() {
                @Override
                public KnnVectorsFormat getKnnVectorsFormatForField(String field) {
                  return new Lucene99HnswVectorsFormat(finalM, finalB);
                }
              }
            ).setOpenMode(IndexWriterConfig.OpenMode.CREATE))) {
              FloatVectorValues vectorValues = leafReader.getFloatVectorValues("image-embeddings");
              while (vectorValues.nextDoc() != NO_MORE_DOCS) {
                Document document = new Document();
                document.add(new KnnFloatVectorField("image-embeddings", vectorValues.vectorValue(), VectorSimilarityFunction.DOT_PRODUCT));
                writer.addDocument(document);
              }
              writer.commit();
              writer.forceMerge(1);
            }
            try (DirectoryReader newDirReader = DirectoryReader.open(newDir)) {
              IndexSearcher newSearcher = new IndexSearcher(newDirReader);
              LeafReaderContext newContext = newDirReader.leaves().get(0);

              LeafReader newLeaf = newContext.reader();
              KnnVectorsReader newVectors =
                ((PerFieldKnnVectorsFormat.FieldsReader) ((CodecReader) newLeaf).getVectorReader())
                  .getFieldReader("image-embeddings");
              HnswGraph newknnValues = ((Lucene99HnswVectorsReader) newVectors).getGraph("image-embeddings");
              System.out.printf("New Leaf %d has %d layers\n", newContext.ord, knnValues.numLevels());
              System.out.printf("New Leaf %d has %d documents\n", newContext.ord, newLeaf.maxDoc());
              printGraphFanout(newknnValues, newLeaf.maxDoc());
              printGraphConnectedNess(newknnValues);
              System.out.println("Rebuilt index results: \n");
              for (int i = 1; i <= 55; i  ) {
                System.out.println("Top 5 results with k="   i * 5   " with graph settings m="   finalM   " and b="   finalB);
                topDocs = newSearcher.search(new KnnFloatVectorQuery("image-embeddings", queryVector, i * 5), 5);
                printResults(newSearcher, topDocs);
              }
            }
          }
        }
      }

Even with all this, I couldn't get the searcher to break out of the local minima. I wonder if our connectedness check accounts for the directional nature of the graph... I will debug further as I can.

@benwtrent
Copy link
Member

^ My testing code has a bug.

Fixing that to actually test other graph building params.

FYI, to get the true nearest with the default params, I had to set k: 155. I will expect that to drop as the more expensive (but better graph quality) params are used.

@benwtrent
Copy link
Member

OK, after correcting my testing function, here are the ks I had to set to get the true NN.

k=140 with graph settings m=16 and b=100
k=145 with graph settings m=16 and b=150
k=145 with graph settings m=16 and b=200
k=5 with graph settings m=16 and b=250 // Seems like the sweet spot
k=165 with graph settings m=32 and b=100
k=165 with graph settings m=32 and b=150
k=165 with graph settings m=32 and b=200
k=170 with graph settings m=32 and b=250
k=170 with graph settings m=48 and b=100
k=170 with graph settings m=48 and b=150
k=170 with graph settings m=48 and b=200
k=170 with graph settings m=48 and b=250
k=170 with graph settings m=64 and b=100
k=170 with graph settings m=64 and b=150
k=170 with graph settings m=64 and b=200
k=170 with graph settings m=64 and b=250

It seems given the order of your data, we are hitting local minima pretty often. Increasing the beamWidth for your format seems like the best option as it will increase graph quality, allowing the graph building itself to not get stuck in local minma.

Or, you can set k higher.

@david-sitsky
Copy link
Author

@benwtrent - thank you so much for your insights and recommendation! I will try using m=16 and b=250 by default when we create our indexes to see if this improves things overall. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants