The official repo for our CIKM'21 Full paper, Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance (poster, presentation record).
**************************** Updates ****************************
- 11/13: We released code to evaluate zero-shot retrieval performance of JPQ and used BEIR benchmark as an example. The code can be used for other datasets with the same format.
- 11/9: We provided a script to support "on-the-fly" query tokenization and open-sourced the ranking results for TREC 2020 queries.
- 11/4: We provided several scripts to help download model checkpoints and evaluation. We also open-sourced the ranking results.
- 8/31: We released model checkpoints, retrieval code, and training code.
- 8/8: Our paper has been accepted by CIKM! Please check out the preprint paper.
- Quick Tour
- Model Checkpoints
- Ranking Results
- Requirements
- Preprocess Data
- Evaluate Open-sourced Checkpoints
- Train JPQ
- Citation
- Related Work
JPQ greatly improves the efficiency of Dense Retrieval. It is able to compress the index size by 30x with negligible performance loss. It also provides 10x speedup on CPU and 2x speedup on GPU in query latency.
Here is the effectiveness - index size (log-scale) tradeoff on MSMARCO Passage Ranking. In contrast with trading index size for ranking performance, JPQ achieves high ranking effectiveness with a tiny index.
Results at different trade-off settings are shown below.
MS MARCO Passage Ranking | MS MARCO Document Ranking |
---|---|
JPQ is still very effective even if the compression ratio is over 100x and outperforms baselines at different compression ratio settings. For more details, please refer to our paper.
You can download trained models and indexes from our dropbox link. After open this link in your browser, you can see two folders, doc
and passage
. They correspond to MSMARCO passage ranking and document ranking. There are also two folders in either of them, trained_models
and indexes
. trained_models
are the trained query encoders, and indexes
are trained PQ indexes. Note, the pid
in the index is actually the row number of a passage in the collection.tsv
file instead of the official pid provided by MS MARCO. Different query encoders and indexes correspond to different compression ratios. For example, the query encoder named m32.tar.gz
or the index named OPQ32,IVF1,PQ32x8.index
means 32 bytes per doc, i.e., 768*4/32=96x
compression ratio.
You can easily download the files using download_query_encoder.sh and download_index.sh. Just run:
sh ./cmds/download_query_encoder.sh
sh ./cmds/download_index.sh
We open-source the ranking results in our dropbox links: passage rank link, document rank link.
msmarco-dev
folder and trec19
folder correspond to MS MARCO development queries and TREC 2019 DL queries, respectively.
In either folder, for each m
value, we provide two ranking files corresponding to different text-id mapping. The one prefixed with 'official' means that it uses the official MS MARCO / TREC 2019 text-id mapping so you can directly use the official qrel files to evaulate the ranking. The other one uses the mapping generated by our preprocessing where we use line offset as id. Both files will give you the same metric number. The files are generated by run_retrieve.sh. Please see retrieval section to know about how to get those ranking results.
UPDATE 2021/11/9
We additionally released the ranking results for queries from TREC 2020 Deep Learning Track. They are available via the same dropbox link provided above. When M is set to 96
, i.e., 32x compression ratio, JPQ achieves 0.580 and 0.671 in NDCG@10 for document and passage ranking, respectively.
This repo needs the following libraries (Python 3.x):
torch >= 1.9.0
transformers >= 4.3.3
faiss-gpu == 1.7.1
tensorboard >= 2.5.0
boto3
Here are the commands for preprocessing/tokenization.
If you do not have MS MARCO dataset, run the following command:
sh ./cmds/download_marco.sh
Preprocessing (tokenizing) only requires a simple command:
python -m jpq.preprocess --data_type 0; python -m jpq.preprocess --data_type 1
It will create two directories, i.e., ./data/passage/preprocess
and ./data/doc/preprocess
. We map the original qid/pid to new ids, the row numbers in the file. The mapping is saved to pid2offset.pickle
and qid2offset.pickle
, and new qrel files (train/dev/test-qrel.tsv
) are generated. The passages and queries are tokenized and saved in the numpy memmap file.
Note: JPQ, as long as our SIGIR'21 models, utilizes Transformers 2.x version to tokenize text. However, when Transformers library updates to 3.x or 4.x versions, the RobertaTokenizer behaves differently.
To support REPRODUCIBILITY, we copy the RobertaTokenizer source codes from 2.x version to star_tokenizer.py. During preprocessing, we use from star_tokenizer import RobertaTokenizer
instead of from transformers import RobertaTokenizer
. It is also necessary for you to do this if you use our JPQ model on other datasets.
Our paper utilizes datasets from TREC 2019 Deep Learning track. This section shows how to reproduce the reported results using our open-sourced models and indexes. Since we use TREC_EVAL toolkit for evaluation, please download it and compile:
sh ./cmds/download_trec_eval.sh
We show how to retrieve candidates and evaluate results in run_retrieve.sh. Just run the command
sh cmds/run_retrieve.sh
Then you are expected to get the results reported in our paper.
In run_retrieve.sh, it calls run_retrieval.py. Arguments for this evaluation script are as follows,
--preprocess_dir
: preprocess dir./data/passage/preprocess
: default dir for passage preprocessing../data/doc/preprocess
: default dir for document preprocessing.
--mode
: Evaluation modedev
run retrieval for msmarco development queries.test
: run retrieval for TREC 2019 DL Track queries.
--index_path
: Index path.--query_encoder_dir
: Query encoder dir, which involvesconfig.json
andpytorch_model.bin
.--output_path
: Output ranking file path, formatted following msmarco guideline (qid\tpid\trank) for dev set or TREC guideline for test set.--max_query_length
: Max query length, default: 32.--batch_size
: Encoding and retrieval batch size at each iteration.--topk
: Retrieve topk passages/documents.--gpu_search
: Whether to use gpu for embedding search.
Here we provide instructions on how to retrieve candidates for TREC 2020 queries, which is not included in the paper. We use this retrieval script, which supports on-the-fly query tokenization.
Please download TREC 2020 queries:
sh ./cmds/download_trec20.sh
Run this shell script for retrieval and evaluation:
sh ./cmds/run_tokenize_retrieve.sh
It calls tokenize_retrieve. Arguments for this evaluation script are as follows,
--query_file_path
: Query file with TREC format.--index_path
: Index path.--query_encoder_dir
: Query encoder dir, which involvesconfig.json
andpytorch_model.bin
.--output_path
: Output ranking file path.--pid2offset_path
: It is used only for converting offset pids to official pids.--dataset
: "doc" or "passage". It is used to convert offset pids to official pids because msmarco doc adds a 'D' as docid prefix.--max_query_length
: Max query length, default: 32.--batch_size
: Encoding and retrieval batch size at each iteration.--topk
: Retrieve topk passages/documents.--gpu_search
: Whether to use gpu for embedding search.
This section shows how to use JPQ for other datasets in a zero-shot fashion. Please download the JPQ dual-encoders by running
sh ./cmds/download_jpq_encoders.sh
In fact, these query encoders are equivalent to the data in Models and Indexes, and the document encoder is equivalent to STAR model. The difference is that they are objects of class JPQTower, which involves the encoding parameters and PQ index parameters. Thanks to it, we can easily adapt JPQ to other datasets in a zero-shot fashion.
Note, the downloaded dual-encoders are trained on MS MARCO passage ranking task. We do not use ones trained on document ranking task because they are trained with URL, which is often not available in other datasets.
We use BEIR as an example because it involves a wide range of datasets. For your own dataset, you only need to format it in the same way as BEIR and you are good to go. Now, we show how to use JPQ for TREC-Covid dataset. Run
sh ./cmds/run_eval_beir.sh trec-covid
You can also replace trec-covid with other datasets, such as nq. The script calls eval_beir.py. Arguments are as follows,
--dataset
: Dataset name in BEIR .--beir_data_root
: Where to save BEIR dataset.--query_encoder
: Path to JPQ query encoder.--doc_encoder
: Path to JPQ document encoder.--split
: test/dev/train.--encode_batch_size
: Batch size, default: 64.--output_index_path
: Optional, where to save the compact index. If the pointed file exists, it will be loaded to save the corpus-encoding time.--output_ranking_path
: Optional, where to save the retrieval results.
Here are the NDCG@10 on several datasets when M=96, i.e., 32x compression ratio:
Dataset | TREC-COVID | NFCorpus | NQ | HotpotQA | FiQA-2018 | ArguAna | Touche-2020 | Quora | DBPedia | SCIDOCS | FEVER | Climate-FEVER | SciFact |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ANCE (Uncompressed) | 0.654 | 0.237 | 0.446 | 0.456 | 0.295 | 0.415 | 0.284 | n.a. | 0.281 | 0.122 | 0.669 | 0.198 | 0.507 |
JPQ (32x Compression) | 0.636 | 0.272 | 0.449 | 0.450 | 0.286 | 0.429 | 0.200 | 0.853 | 0.304 | 0.120 | 0.636 | 0.194 | 0.531 |
Even though JPQ compresses the index by 32x, it achieves ranking performance on par with or even better than ANCE, a competitive uncompressed Dense Retrieval model.
JPQ is initialized by STAR. STAR trained on passage ranking is available here. STAR trained on document ranking is available here.
First, use STAR to encode the corpus and run OPQ to initialize the index. For example, on document ranking task, please run:
python -m jpq.run_init \
--preprocess_dir ./data/doc/preprocess/ \
--model_dir ./data/doc/star \
--max_doc_length 512 \
--output_dir ./data/doc/init \
--subvector_num 96
On passage ranking task, you can set the max_doc_length
to 256 for faster inference.
Now you can train the query encoder and PQ index. For example, on document ranking task, the command is
python -m jpq.run_train \
--preprocess_dir ./data/doc/preprocess \
--model_save_dir ./data/doc/train/m96/models \
--log_dir ./data/doc/train/m96/log \
--init_index_path ./data/doc/init/OPQ96,IVF1,PQ96x8.index \
--init_model_path ./data/doc/star \
--lambda_cut 10 \
--centroid_lr 1e-4 \
--train_batch_size 32
--gpu_search
is optional for fast gpu search during training. lambda_cut
should be set to 200 for passage ranking task. centroid_lr
is different for different compression ratios.
Let M be the number of subvectors. centroid_lr
equals to 5e-6 for M = 16/24
, 2e−5 for M = 32
, and 1e−4 for M = 48/64/96
. The number of training epochs is set to 6. In fact, the performance is already quite satisfying after 1 or 2 epochs. Each epoch costs less than 2 hours on our machine.
If you find this repo useful, please consider citing our work:
@inproceedings{zhan2021jointly,
author = {Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
title = {Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance},
year = {2021},
isbn = {9781450384469},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3459637.3482358},
doi = {10.1145/3459637.3482358},
pages = {2487–2496},
numpages = {10},
location = {Virtual Event, Queensland, Australia},
series = {CIKM '21}
}
-
🔥WSDM 2022: Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval [code]: It presents RepCONC and achieves state-of-the-art first-stage retrieval effectiveness-efficiency tradeoff. It utilizes constrained clustering to train discrete codes and then incorporates JPQ in the second-stage training.
-
SIGIR 2021: Optimizing Dense Retrieval Model Training with Hard Negatives [code]: It provides theoretical analysis on different negative sampling strategies and greatly improves the effectiveness of Dense Retrieval with hard negative sampling. The proposed dynamic hard negative sampling is adopted by JPQ.
-
ARXIV 2020: RepBERT: Contextualized Text Embeddings for First-Stage Retrieval [code]: It is one of the pioneer studies about Dense Retrieval.