Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? (EMNLP 2023)
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter and Ming-Wei Chang.
[Project Page] [Annotation] [Images] [Contributed Code] [Leaderboard (Coming Soon)]
InfoSeek, A New VQA Benchmark focuses on Visual Info-Seeking Questions
Please use the following bib entry to cite this paper if you are using any resources from the repo.
@article{chen2023infoseek,
title={Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?},
author={Chen, Yang and Hu, Hexiang and Luan, Yi and Sun, Haitian and Changpinyo, Soravit and Ritter, Alan and Chang, Ming-Wei},
journal={arXiv preprint arXiv:2302.11713},
year={2023}
}
In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.
The annotations are released as jsaonline file for each set and data split as discussed in the paper.
Below is an example of the format for a training data:
{
"data_id": "infoseek_train_00000000",
"image_id": "oven_01963180",
"question": "Which place is this animal endemic to?",
"answer": ["People's Republic of China"],
"answer_eval": ["cn", "People's Republic of China", "China", "Mainland China", "China PR", "PR China", "CHN", "CN", "PRC", "\ud83c\udde8\ud83c\uddf3"],
"data_split": "train"
}
Here image_id
indicates which image files this annotation is associated with (note that InfoSeek images are derived from OVEN). The answer
field indicates the most standard language term for the answer. And the answer_eval
field is reserved for the evaluation purpose, which includes other acceptable equivalent forms of the answer
, to increase the precision of evaluation.
Following are links to each annotation file:
- Dataset Annotation
- Train Split Link (245M)
- Val Split Link (21M)
- Test Split Link (44M)
- KB mapping
- Train Split Link (84M)
- Val Split Link (21M)
- Human Set Link (1.1M)
We also release the 6M wikipedia text information (derived from Wikidump 2022/10/01).
- 6 Million Wikipedia Text Information
To use multimodal Wikipedia information, you would need to download images from the url in the field wikipedia_image_url
.
See this guideline for image downloading