Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a read-only index that drops index files not needed for searching #13158

Open
msokolov opened this issue Mar 5, 2024 · 4 comments
Open

Comments

@msokolov
Copy link
Contributor

msokolov commented Mar 5, 2024

Description

Now that we have vector quantization we face the possibility of writing an index that is 5 times bigger than is needed for searching. If the index is primarily vectors and they are quantized, we will save the full-precision vectors but they may not be required at all for searching. In an architecture where indexes are written on one set of hosts and replicated to another set of hosts for searching, it is wasteful to copy all of the full-precision vectors to the searcher nodes. But Lucene doesn't have any way of distinguishing. I wonder if we could create a "write read-only index" operation that would effectively clone the existing index, dropping any data required only for indexing, and mark the index as read-only so it could never be opened for writing. This might be useful in some way for version upgrades as well?

@mikemccand
Copy link
Member

This is a cool idea @msokolov. It is wasteful to lug around those float32 precision vectors out to the searchers in an NRT segment replication architecture. In practice, they would consume disk space on the searchers, and waste time copying them out, but since the OS would never load them at search time, their bytes would remain cold on disk and not put much pressure on OS free RAM? The OS would only cache the disk pages in the index that are actually needed at search time. It would be nice not to copy all that deadweight around ...

Probably the solution would have to be something like segment to segment? I.e. for each segment in the index, we would make a corresponding "read only" segment (stripped of the float32 vectors). This way, as the normal index changes (gets new flushed/merged segments), we could also incrementally/NRT maintain the shadow read-only index.

I wonder if there are other things in a Lucene index today that are needed only during indexing?

@benwtrent
Copy link
Member

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

For a quicker win in scalar quantization, it could be cheaper or easier to have a configured threshold where we throw away the floating point as we know the distribution of the values won't change significantly from that point on. Then on merges, if adjustments are required, we assume the cost of de-quantizing and re-quantizing. This can help relevancy more than you would expect, but obviously not as much as having access to the raw values.

@msokolov
Copy link
Contributor Author

Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"?

The goal of segment-replication is to completely separate searching from writing, so in that world, no merging is done by searchers -- it happens upstream in a writer/indexer, or perhaps in a dedicated merger (we have both setups going on).

@benwtrent
Copy link
Member

@msokolov AH, yes, for segment-replication, once the segments are built, I could see certain things being removed. I better understand the idea now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants