RMQV

This repository contains the necessary data structures and algorithms to perform a) RM3 query expansion, b) Rank fusion, and c) A combination of these approaches.

This framework was used in the results from the following work:

Rodger Benham, J. Shane Culpepper, Luke Gallagher, Xiaolu Lu, Joel Mackenzie, Towards Efficient and Effective Query Variant Generation, DESIRES 2018.

Building Collections

Inverted Index

First, we start with an Indri index. We then convert it to a ds2i index using the indri_to_ds2i binary in the format_collection directory.

Apart from the normal ds2i files (which a description can be found at the bottom of this README), a document map and lexicon are also output.

Next, once you have a ds2i formatted collection, you can build the PEF index and wand data required for top-k search. This is well documented below (in the ds2i section of this README). Note that I always use opt encoding. VBMW indexes need to be built with caution: the parameter fixed_cost_wand_partition found in configuration.hpp will impact the optimization of block sizes, and you may end up with an index with a strange/unexpected average block size. This is the lambda parameter from the VBMW paper. If you only care about fixed BMW indexes, you can use the block_size parameter (also in configuration.hpp) to create a normal BMW index with the provided block size.

Document Vectors

The document vector code is entirely contained within the docvector/ directory. Build the code, and then use create_docvectors to generate the document vector for the collection. This is similar to the creation of the inverted indexes (it takes a ds2i collection as input). You can also provide a stoplist to ensure your document vectors do not contain certain terms.

Query Format

Queries are of the form ID t1 t2 ... tk where terms should be appropriately stemmed/stopped before being passed into the engine. A Krovetz stemmer has been provided in the format_queries directory.

Param files

The supplied binaries that enable RM3 expansion take param files as arguments. These are described as follows.

raw_collection=/path/to/ds2i/collection/prefix
inverted_index=/path/to/ds2i/index/example-opt-pef.idx
forward_index=/path/to/ds2i/forward/example-forward.idx
wand_file=/path/to/ds2i/index/example.bmw
docs_to_expand=50
terms_to_expand=100
lambda_expand=0.1
final_k=1000
gen_queries=100

Note that:

inverted_index is created using create_freq_index,
wand_file is created with create_wand_data`,
forward_index is created using create_docvectors,
docs_to_expand is the number of documents to use for RM3 expansion,
terms_to_expand is the number of terms to expand from the RM (when creating the new, expanded query),
lambda_expand is the weight given to the original query, (1-lambda is given to the expanded query),
final_k is the final top-k list size, and
gen_queries is the number of queries to generate if using the sampler (external_corpus_sampler).

Walk through

We provide a basic end-to-end walkthrough in the example directory.

ds2i

Data Structures for Inverted Indexes (ds2i) is a library of data structures to represent the integer sequences used in inverted indexes.

This code was used in the experiments of the following papers.

Giuseppe Ottaviano, Rossano Venturini, Partitioned Elias-Fano Indexes, ACM SIGIR 2014.
Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, Rossano Venturini, Faster BlockMax WAND with variable-sized blocks, ACM SIGIR 2017.

Building the code

The code is tested on Linux with GCC 5.4.0 and macOS Sierra with Clang.

The following dependencies are needed for the build.

CMake >= 2.8, for the build system
Boost >= 1.51

To install Boost, run ./boost.sh first.

The code depends on several git submodules. If you have cloned the repository without --recursive, you will need to perform the following commands before building:

$ git submodule update --init --recursive

To build the code:

$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ make

It is also preferable to perform a make test, which runs the unit tests.

For further information about ds2i and variable block-max wand, please visit the original repos:

ds2i
vbmw

Collection input format

A binary sequence is a sequence of integers prefixed by its length, where both the sequence integers and the length are written as 32-bit little-endian unsigned integers.

A collection consists of 3 files, <basename>.docs, <basename>.freqs, <basename>.sizes.

<basename>.docs starts with a singleton binary sequence where its only integer is the number of documents in the collection. It is then followed by one binary sequence for each posting list, in order of term-ids. Each posting list contains the sequence of document-ids containing the term.
<basename>.freqs is composed of a one binary sequence per posting list, where each sequence contains the occurrence counts of the postings, aligned with the previous file (note however that this file does not have an additional singleton list at its beginning).
<basename>.sizes is composed of a single binary sequence whose length is the same as the number of documents in the collection, and the i-th element of the sequence is the size (number of terms) of the i-th document.

We thank the original authors for providing their code:

Antonio Mallia [email protected]
Giuseppe Ottaviano [email protected]
Elia Porciani [email protected]
Nicola Tonellotto [email protected]
Rossano Venturini [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
FastPFor @ c90a035		FastPFor @ c90a035
benchmarks		benchmarks
docvector		docvector
example		example
ext		ext
format_index		format_index
format_queries		format_queries
indri @ bc83987		indri @ bc83987
param_files		param_files
stxxl @ 263df0c		stxxl @ 263df0c
succinct @ 669eebb		succinct @ 669eebb
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
all_ones_sequence.hpp		all_ones_sequence.hpp
benchmark.h		benchmark.h
binary_collection.hpp		binary_collection.hpp
binary_freq_collection.hpp		binary_freq_collection.hpp
bitvector_collection.hpp		bitvector_collection.hpp
block_codecs.hpp		block_codecs.hpp
block_freq_index.hpp		block_freq_index.hpp
block_posting_list.hpp		block_posting_list.hpp
block_profiler.hpp		block_profiler.hpp
collection_config.hpp		collection_config.hpp
compact_elias_fano.hpp		compact_elias_fano.hpp
compact_ranked_bitvector.hpp		compact_ranked_bitvector.hpp
configuration.hpp		configuration.hpp
create_freq_index.cpp		create_freq_index.cpp
create_wand_data.cpp		create_wand_data.cpp
dec_time_prediction.hpp		dec_time_prediction.hpp
dec_time_regression.py		dec_time_regression.py
document_fuser.hpp		document_fuser.hpp
ds2i_config.hpp		ds2i_config.hpp
ds2i_config.hpp.in		ds2i_config.hpp.in
dump_rm.cpp		dump_rm.cpp
external_corpora_expansion.cpp		external_corpora_expansion.cpp
external_corpus_expansion.cpp		external_corpus_expansion.cpp
external_corpus_sampler.cpp		external_corpus_sampler.cpp
freq_index.hpp		freq_index.hpp
global_parameters.hpp		global_parameters.hpp
index_build_utils.hpp		index_build_utils.hpp
index_types.hpp		index_types.hpp
indexed_sequence.hpp		indexed_sequence.hpp
integer_codes.hpp		integer_codes.hpp
interpolative_coding.hpp		interpolative_coding.hpp
l1l1.py		l1l1.py
mixed_block.hpp		mixed_block.hpp
optimal_hybrid_index.cpp		optimal_hybrid_index.cpp
optimal_partition.hpp		optimal_partition.hpp
partitioned_sequence.hpp		partitioned_sequence.hpp
positive_sequence.hpp		positive_sequence.hpp
profile_decoding.cpp		profile_decoding.cpp
queries.cpp		queries.cpp
queries.hpp		queries.hpp
queries_util.hpp		queries_util.hpp
rankers.hpp		rankers.hpp
score_partitioning.hpp		score_partitioning.hpp
semiasync_queue.hpp		semiasync_queue.hpp
sequence_collection.hpp		sequence_collection.hpp
shuffle_docids.cpp		shuffle_docids.cpp
single_shot_expansion.cpp		single_shot_expansion.cpp
strict_elias_fano.hpp		strict_elias_fano.hpp
strict_sequence.hpp		strict_sequence.hpp
train_corpus_sampler.cpp		train_corpus_sampler.cpp
trec_queries.cpp		trec_queries.cpp
tst.qry		tst.qry
uniform_partitioned_sequence.hpp		uniform_partitioned_sequence.hpp
util.hpp		util.hpp
verify_collection.hpp		verify_collection.hpp
wand_data.hpp		wand_data.hpp
wand_data_compressed.hpp		wand_data_compressed.hpp
wand_data_raw.hpp		wand_data_raw.hpp
wand_utils.hpp		wand_utils.hpp
weighted_queries.hpp		weighted_queries.hpp
weighted_sampler.hpp		weighted_sampler.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RMQV

Building Collections

Inverted Index

Document Vectors

Query Format

Param files

Walk through

ds2i

Building the code

Collection input format

About

Releases

Packages

Contributors 3

Languages

License

JMMackenzie/RMQV

Folders and files

Latest commit

History

Repository files navigation

RMQV

Building Collections

Inverted Index

Document Vectors

Query Format

Param files

Walk through

ds2i

Building the code

Collection input format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages