skip to main content
10.1145/3477495.3531740acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open access

The Istella22 Dataset: Bridging Traditional and Neural Learning to Rank Evaluation

Published: 07 July 2022 Publication History

Abstract

Neural approaches that use pre-trained language models are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their effectiveness compared to feature-based Learning-to-Rank (LtR) methods has not yet been well-established. A major reason for this is because present LtR benchmarks that contain query-document feature vectors do not contain the raw query and document text needed for neural models. On the other hand, the benchmarks often used for evaluating neural models, e.g., MS MARCO, TREC Robust, etc., provide text but do not provide query-document feature vectors. In this paper, we present Istella22, a new dataset that enables such comparisons by providing both query/document text and strong query-document feature vectors used by an industrial search engine. The dataset consists of a comprehensive corpus of 8.4M web documents, a collection of query-document pairs including 220 hand-crafted features, relevance judgments on a 5-graded scale, and a set of 2,198 textual queries used for testing purposes. Istella22 enables a fair evaluation of traditional learning-to-rank and transfer ranking techniques on the same data. LtR models exploit the feature-based representations of training samples while pre-trained transformer-based neural rankers can be evaluated on the corresponding textual content of queries and documents. Through preliminary experiments on Istella22, we find that neural re-ranking approaches lag behind LtR models in terms of effectiveness. However, LtR models identify the scores from neural models as strong signals.

References

[1]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic Models of Information Retrieval based on Measuring the Divergence from Randomness. ACM Trans. Inf. Sys. 20, 4 (2002), 357--389.
[2]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proc. InCoCo@NIPS Workshop.
[3]
Lasse Bergroth, Harri Hakonen, and Timo Raita. 2000. A Survey of Longest Common Subsequence Algorithms. In Proc. SPIRE. IEEE, 39--48.
[4]
James Bergstra, Daniel Yamins, and David Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proc. ICML. PMLR, 115--123.
[5]
Luiz Henrique Bonifacio, Israel Campiotti, Roberto Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset. arXiv:2108.13897 (2021).
[6]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank using Gradient Descent. In Proc. ICML.
[7]
Christopher JC Burges. 2010. From RankNet to LambdaRank to LambdaMart: An overview. Learning (2010).
[8]
Christopher J Burges, Robert Ragno, and Quoc V Le. 2007. Learning to Rank with Nonsmooth Cost Functions. In Proc. NIPS.
[9]
Stefan Büttcher, Charles L. A. Clarke, and Ian Soboroff. 2006. The TREC 2006 Terabyte Track. In Proc. TREC.
[10]
Gabriele Capannini, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Nicola Tonellotto. 2016. Quality versus Efficiency in Document Scoring with Learning-to-rank Models. Inf. Proc. Man. 52, 6 (2016), 1161-- 1177.
[11]
Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge Overview. In Proc. the learning to rank challenge. Proc. PMLR, 1--24.
[12]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proc. SIGKDD. 785--794.
[13]
Zhumin Chu, Yiqun Liu, Chen Nuo, Yujing Li, Junjie Wang, Tetsuya Sakai, Sijie Tao, Nicola Ferro, Maria Maistro, and Ian Soboroff. 2021. NTCIR We Want Web with CENTRE Task. http://sakailab.com/www4/ Task currently running.
[14]
Charles L. A. Clark, Falk Scholer, and Ian Soboroff. 2005. The TREC 2005 Terabyte Track. In Proc. TREC.
[15]
Charles Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track. In Proc. TREC.
[16]
Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 Web Track. In Proc. TREC.
[17]
Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. 2010. Overview of the TREC 2010 Web Track. In Proc. TREC.
[18]
Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In Proc. TREC.
[19]
Charles L. A. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In Proc. TREC.
[20]
Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charles L. A. Clarke, and Ellen M. Voorhees. 2013. TREC 2013 Web Track Overview. In Proc. TREC.
[21]
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M. Voorhees. 2014. TREC 2014 Web Track Overview. In Proc. TREC.
[22]
Nick Craswell and David Hawking. 2002. Overview of the TREC-2002 Web Track. In Proc. TREC.
[23]
Nick Craswell and David Hawking. 2004. Overview of the TREC-2004 Web Track. In Proc. TREC.
[24]
Nick Craswell, David Hawking, Ross Wilkinson, and Mingfang Wu. 2003. Overview of the TREC 2003 Web Track. In Proc. TREC.
[25]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. In Proc. TREC.
[26]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen Voorhees. 2019. Overview of the TREC 2019 deep learning track. In Proc. TREC.
[27]
Nick Craswell, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 deep learning track. In Proc. TREC.
[28]
Zhuyun Dai and Jamie Callan. 2019. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. ArXiv abs/1910.10687 (2019).
[29]
Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM Trans. Inf. Syst. 35, 2, Article 15 (2016), 31 pages.
[30]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
[31]
J. H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (2000), 1189--1232.
[32]
Andrea Gigli, Claudio Lucchese, Franco Maria Nardini, and Raffaele Perego. 2016. Fast Feature Selection for Learning to Rank. In Proc. SIGIR. 167--170.
[33]
J. Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. Proc. CIKM (2016).
[34]
Sebastian Hofstätter and Allan Hanbury. 2019. Let's Measure Run Time! Extending the IR Replicability Infrastructure to Include Performance Aspects. ArXiv abs/1907.04614 (2019).
[35]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. Proc. CIKM (2013).
[36]
Vladimir Karpukhin, Barlas Ouz, Sewon Min, Patrick Lewis, Ledell Yu Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. ArXiv abs/2004.04906 (2020).
[37]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proc. NeurIPS. 3146--3154.
[38]
O. Khattab and Matei A. Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proc. SIGIR (2020).
[39]
Francesco Lettich, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2019. Parallel Traversal of Large Ensembles of Decision Trees. IEEE Trans. Par. Dist. Sys. 30, 9 (2019).
[40]
Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. 2020. PARADE: Passage Representation Aggregation for Document Reranking. arXiv abs/2008.09093 (2020). https://arxiv.org/abs/2008.09093
[41]
Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (March 2009), 225--331.
[42]
Claudio Lucchese, Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, and Salvatore Trani. 2017. RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions. In Proc. SIGIR (Tokyo, Japan).
[43]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. In Proc. SIGIR. 949--952.
[44]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2018. X-CLEaVER: Learning Ranking Ensembles by Growing and Pruning Trees. ACM TIST 9, 6 (2018), 1--26.
[45]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2015. QuickScorer: A Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees. In Proc. SIGIR.
[46]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2016. Exploiting CPU SIMD Extensions to Speed-up Document Scoring with Tree Ensembles. In Proc. SIGIR. 833--836.
[47]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. 2017. X-DART: Blending Dropout and Pruning for Efficient Learning to Rank. In Proc. SIGIR. 1077--1080.
[48]
Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, Salvatore Orlando, and Salvatore Trani. 2018. Selective Gradient Boosting for Effective Learning to Rank. In Proc. SIGIR. 155--164.
[49]
Sean MacAvaney. 2020. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In Proc. WSDM. 845--848.
[50]
Sean MacAvaney, Arman Cohan, and Nazli Goharian. 2020. SLEDGE-Z: A ZeroShot Baseline for COVID-19 Literature Search. In Proc. EMNLP.
[51]
Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2022. Reproducing Personalised Session Search over the AOL Query Log. In Proc. ECIR.
[52]
Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2022. Streamlining Evaluation with ir-measures. In Proc. ECIR.
[53]
Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Proc. SIGIR (2020).
[54]
Sean MacAvaney, Luca Soldaini, and Nazli Goharian. 2020. Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning. In Proc. ECIR. 246--254.
[55]
Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR: Contextualized Embeddings for Document Ranking. In Proc. SIGIR.
[56]
Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with ir_datasets. In Proc. SIGIR.
[57]
Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation in Information Retrieval Using PyTerrier. In Proc. ICTIR. 161--168.
[58]
Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, and Iadh Ounis. 2021. PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. In Proc. CIKM.
[59]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. ArXiv abs/1901.04085 (2019).
[60]
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Fidnings of EMNLP.
[61]
Rodrigo Nogueira, Wei Yang, Jimmy J. Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. ArXiv abs/1904.08375 (2019).
[62]
Liang Pang, Jun Xu, Qingyao Ai, Yanyan Lan, Xueqi Cheng, and Jirong Wen. 2020. SetRank: Learning a Permutation-Invariant Ranking Model for Information Retrieval. In Proc. SIGIR. 499--508.
[63]
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A Picture of Search. In Proc. InfoScale.
[64]
Rama Kumar Pasumarthi, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2020. Permutation Equivariant Document Interaction Network for Neural Learning to Rank. In Proc. ICTIR. 145--148.
[65]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRR abs/1306.2597 (2013). http://arxiv.org/abs/1306.2597
[66]
Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2021. Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?. In Proc. ICLR.
[67]
Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683 (2020).
[68]
Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr.
[69]
Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Y. Liu, Zhicheng Dou, and Ian Soboroff. 2020. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task.
[70]
Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2013. Learning to Rank Query Suggestions for Adhoc and Diversity Search. Inf. Retr. 16, 4 (2013).
[71]
Peng Shi, He Bai, and Jimmy J. Lin. 2020. Cross-Lingual Training of Neural Models for Document Ranking. In Findings of EMNLP.
[72]
Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. CIKM.
[73]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. abs/2104.08663 (2021).
[74]
Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. 2018. Efficient Query Processing for Scalable Web Search. Found. Trends Inf. Ret. 12, 4--5 (2018).
[75]
Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, and Wei-Ying Ma. 2007. FRank: a Ranking Method with Fidelity Loss. In Proc. SIGIR.
[76]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. NeurIPS, Vol. 30. 5998--6008.
[77]
Yu Wang, Jinchao Li, Tristan Naumann, Chenyan Xiong, Hao Cheng, Robert Tinn, Cliff Wong, Naoto Usuyama, Richard Rogahn, Zhihong Shen, Yang Qin, Eric Horvitz, Paul Bennett, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature. In Proc. SIGKDD.
[78]
Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao. 2010. Adapting Boosting for Information Retrieval Measures. Information Retrieval (2010).
[79]
Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, and Jimmy J. Lin. 2020. Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval. Proc. WSDM (2020).
[80]
Ting Ye, Hucheng Zhou, Will Y. Zou, Bin Gao, and Ruofei Zhang. 2018. RapidScorer: Fast Tree Ensemble Evaluation by Maximizing Compactness in Data Level Parallelization. In Proc. SIGKDD. 941--950.
[81]
Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, and Jimmy J. Lin. 2019. Applying BERT to Document Retrieval with Birch. In Proc. EMNLP.
[82]
Yukun Zheng, Zhen Fan, Yiqun Liu, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Sogou-QCL: A New Dataset with Click Relevance Label. In Proc. SIGIR.

Cited By

View all
  • (2024)Learning to Rank for Non Independent and Identically Distributed DatasetsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672513(71-79)Online publication date: 2-Aug-2024
  • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
  • (2024)Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted TreesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657918(2390-2394)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. The Istella22 Dataset: Bridging Traditional and Neural Learning to Rank Evaluation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2022
      3569 pages
      ISBN:9781450387323
      DOI:10.1145/3477495
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 July 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. evaluation
      2. learning to rank
      3. neural information retrieval

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGIR '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)273
      • Downloads (Last 6 weeks)43
      Reflects downloads up to 12 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Learning to Rank for Non Independent and Identically Distributed DatasetsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672513(71-79)Online publication date: 2-Aug-2024
      • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
      • (2024)Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted TreesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657918(2390-2394)Online publication date: 10-Jul-2024
      • (2024)Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search DatasetProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657892(1546-1556)Online publication date: 10-Jul-2024
      • (2024)DiscoLQA: zero-shot discourse-based legal question answering on European LegislationArtificial Intelligence and Law10.1007/s10506-023-09387-2Online publication date: 10-Jan-2024
      • (2024)PR-Rank: A Parameter Regression Approach for Learning-to-Rank Model Adaptation Without Target Domain DataWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0573-6_1(3-18)Online publication date: 27-Nov-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media