-
-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use fst for sstable index #2268
Conversation
sstable/src/sstable_index.rs
Outdated
// we actually add some correcting factor to have proper rounding, not truncation. | ||
|
||
let denominator = (min_slop_idx max_slop_idx) as u64; | ||
let final_slop = ((min_slop_val max_slop_val denominator / 2) / denominator) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let final_slop = ((min_slop_val max_slop_val denominator / 2) / denominator) as u32; | |
let final_slop = ((min_slop_val max_slop_val denominator / 2) / denominator) as u32; |
would using a decimal slope (not a float but just using a 4 bits as decimal for instance) help here?
let final_slop = ((min_slop_val max_slop_val denominator / 2) * 16 / denominator) as u32;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that code is enough to make sure we round in the right direction. I think using fixed point integers wouldn't really help as we would still need to use a similar trick when collapsing the fixed point integers to whole numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant: wouldn't it make the decimal slope make the prediction for high indexes and eventually help with the compression.
as we would still need to use a similar trick when collapsing the fixed point integers to whole numbers
I suspect you forgot there is a multiplication in between.
truncateroundorwhatever(slope*idx)
vs
idx * truncateroundorwhatever(slope)
will actually give you very different precision.
I would not be surprised if you could shave off one or two bits in your encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftr: I tried and on my sample dataset, there was a negligible gain of about 0.1%. Either I did something wrong and I can't figure out what, or the impact is negligible
sstable/src/sstable_index.rs
Outdated
|
||
let cmp = cmp_fn(mid); | ||
|
||
if cmp == Less { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: match would be nice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation was in part copied from libcore. It said the following about that comparison:
// The reason why we use if/else control flow rather than match
// is because match reorders comparison operations, which is perf sensitive.
// This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing the same assembly code. Maybe the compiler improved?
anyway no need to fix.
sstable/src/sstable_index.rs
Outdated
} | ||
|
||
fn flush_block(&mut self) -> io::Result<()> { | ||
let ref_block_addr = self.block_addrs[0].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it non empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caller make sure of that, but I changed it so it's always safe to call
sstable/src/sstable_index.rs
Outdated
.map(|block| block.byte_range.start as u64) | ||
.chain(std::iter::once(last_block_addr.byte_range.end as u64)) | ||
.enumerate() | ||
.skip(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know block addrs contains more than one el?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to, find_best_slope
accept empty iterator
* read path for new fst based index * implement BlockAddrStoreWriter * extract slop/derivation computation * use better linear approximator and allow negative correction to approximator * document format and reorder some fields * optimize single block sstable size * plug backward compat
related to quickwit-oss/quickwit#3964
makes loading an sstable index basically free (previously you'd decode a datastructure, which could take a non-negligible amount of time, in practice ~200ms for a 125m term table, with keys encoding json paths).