Create a read-only index that drops index files not needed for searching #13158

msokolov · 2024-03-05T18:08:22Z

Description

Now that we have vector quantization we face the possibility of writing an index that is 5 times bigger than is needed for searching. If the index is primarily vectors and they are quantized, we will save the full-precision vectors but they may not be required at all for searching. In an architecture where indexes are written on one set of hosts and replicated to another set of hosts for searching, it is wasteful to copy all of the full-precision vectors to the searcher nodes. But Lucene doesn't have any way of distinguishing. I wonder if we could create a "write read-only index" operation that would effectively clone the existing index, dropping any data required only for indexing, and mark the index as read-only so it could never be opened for writing. This might be useful in some way for version upgrades as well?

mikemccand · 2024-03-21T12:46:06Z

This is a cool idea @msokolov. It is wasteful to lug around those float32 precision vectors out to the searchers in an NRT segment replication architecture. In practice, they would consume disk space on the searchers, and waste time copying them out, but since the OS would never load them at search time, their bytes would remain cold on disk and not put much pressure on OS free RAM? The OS would only cache the disk pages in the index that are actually needed at search time. It would be nice not to copy all that deadweight around ...

Probably the solution would have to be something like segment to segment? I.e. for each segment in the index, we would make a corresponding "read only" segment (stripped of the float32 vectors). This way, as the normal index changes (gets new flushed/merged segments), we could also incrementally/NRT maintain the shadow read-only index.

I wonder if there are other things in a Lucene index today that are needed only during indexing?

benwtrent · 2024-03-21T13:01:49Z

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

For a quicker win in scalar quantization, it could be cheaper or easier to have a configured threshold where we throw away the floating point as we know the distribution of the values won't change significantly from that point on. Then on merges, if adjustments are required, we assume the cost of de-quantizing and re-quantizing. This can help relevancy more than you would expect, but obviously not as much as having access to the raw values.

msokolov · 2024-03-21T16:38:38Z

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

The goal of segment-replication is to completely separate searching from writing, so in that world, no merging is done by searchers -- it happens upstream in a writer/indexer, or perhaps in a dedicated merger (we have both setups going on).

benwtrent · 2024-03-21T17:05:27Z

@msokolov AH, yes, for segment-replication, once the segments are built, I could see certain things being removed. I better understand the idea now.

msokolov added the type:enhancement label Mar 5, 2024

benwtrent mentioned this issue Mar 29, 2024

Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a read-only index that drops index files not needed for searching #13158

Create a read-only index that drops index files not needed for searching #13158

msokolov commented Mar 5, 2024

mikemccand commented Mar 21, 2024

benwtrent commented Mar 21, 2024

msokolov commented Mar 21, 2024

benwtrent commented Mar 21, 2024

Create a read-only index that drops index files not needed for searching #13158

Create a read-only index that drops index files not needed for searching #13158

Comments

msokolov commented Mar 5, 2024

Description

mikemccand commented Mar 21, 2024

benwtrent commented Mar 21, 2024

msokolov commented Mar 21, 2024

benwtrent commented Mar 21, 2024