Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docid deltas while indexing #2249

Merged
merged 2 commits into from
Nov 13, 2023
Merged

docid deltas while indexing #2249

merged 2 commits into from
Nov 13, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Nov 12, 2023

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before: 760 MB
Now: 590 MB

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before:  760 MB
Now:     590 MB
@PSeitz PSeitz force-pushed the index_delta_docids branch from c11ddec to f6512d4 Compare November 12, 2023 15:05
VInt32Reader::new(&buffer[..])
.map(|old_doc_id| doc_id_map.get_new_doc_id(old_doc_id)),
);
doc_ids.extend(VInt32Reader::new(&buffer[..]).map(|delta_doc_id| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of prev_doc here looks like scan instead of map could make this more obvious.

VInt32Reader::new(&buffer[..])
.map(|old_doc_id| doc_id_map.get_new_doc_id(old_doc_id)),
);
doc_ids.extend(VInt32Reader::new(&buffer[..]).map(|delta_doc_id| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1 in favor of scan of a loop.

Having prev_doc outside of the if-statement is not a good idea. Having it in both statement makes the code easier to (proof)read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to scan, but we can't apply the same technique in the other cases, because they contain a mix of delta and non delta encoded values

@PSeitz PSeitz merged commit b60d862 into main Nov 13, 2023
4 checks passed
@PSeitz PSeitz deleted the index_delta_docids branch November 13, 2023 04:14
PSeitz added a commit that referenced this pull request Apr 10, 2024
* docid deltas while indexing

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before:  760 MB
Now:     590 MB

* use scan for delta decoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants