-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a read-only index that drops index files not needed for searching #13158
Comments
This is a cool idea @msokolov. It is wasteful to lug around those Probably the solution would have to be something like segment to segment? I.e. for each segment in the index, we would make a corresponding "read only" segment (stripped of the I wonder if there are other things in a Lucene index today that are needed only during indexing? |
Don't "deletes" require "writes"? Meaning, if enough docs get deleted in a segment, it will ultimately require to be merged, which then is a "write"? For a quicker win in scalar quantization, it could be cheaper or easier to have a configured threshold where we throw away the floating point as we know the distribution of the values won't change significantly from that point on. Then on merges, if adjustments are required, we assume the cost of de-quantizing and re-quantizing. This can help relevancy more than you would expect, but obviously not as much as having access to the raw values. |
The goal of segment-replication is to completely separate searching from writing, so in that world, no merging is done by searchers -- it happens upstream in a writer/indexer, or perhaps in a dedicated merger (we have both setups going on). |
@msokolov AH, yes, for segment-replication, once the segments are built, I could see certain things being removed. I better understand the idea now. |
Description
Now that we have vector quantization we face the possibility of writing an index that is 5 times bigger than is needed for searching. If the index is primarily vectors and they are quantized, we will save the full-precision vectors but they may not be required at all for searching. In an architecture where indexes are written on one set of hosts and replicated to another set of hosts for searching, it is wasteful to copy all of the full-precision vectors to the searcher nodes. But Lucene doesn't have any way of distinguishing. I wonder if we could create a "write read-only index" operation that would effectively clone the existing index, dropping any data required only for indexing, and mark the index as read-only so it could never be opened for writing. This might be useful in some way for version upgrades as well?
The text was updated successfully, but these errors were encountered: