Skip to content

Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs

Notifications You must be signed in to change notification settings

chenllliang/MMEvalPro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMEvalPro

Static Badge Static Badge Static Badge

We create MMEvalPro for more accurate and efficent evaluation for Large Multimodal Models. It is designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. It comprises 2,138 question triplets, totaling 6,414 distinct questions.

Trilogy Evaluation

For each original question from ScienceQA, MathVista, or MMMU, MMEvalPro annotates an additional perception question and a knowledge question. Only if a multimodal model can simultaneously answer all three questions, we regard it demonstrates a true understanding of the problem rather than merely exploiting shortcuts. We introduce a new metric called Genuine Accuracy to evaluate the performance of models in MMEvalPro.

Trilogy Evaluation Examples in MMEvalPro

Automatic Evaluation

🔔 To automatically evaluate a model on the dataset and compute the genuine accuracy, average accuracy and different analysis metric, we provide an example code to compute the scores given model output and groundtruth labels.

First, download the dataset from Static Badge .

The output for all questions should be saved in json file, following ./demo_model_output.json

[
    {
        "index": 0,
        "model_output": "A",
        "answer": "B",
        "triplet_id": 1,
        "eval_type": "Origin"
    },
    {
        "index": 1,
        "model_output": "A",
        "answer": "B",
        "triplet_id": 1,
        "eval_type": "Perception"
    },
    {
        "index": 2,
        "model_output": "A",
        "answer": "B",
        "triplet_id": 1,
        "eval_type": "Knowledge"
    }

]

Then you can run the ./auto_score.py to get the scores.

python auto_score.py \ 
    --model_output  ./demo_model_output.json \  # model output file in json format
    --output_path  ./demo_score.json \  # path to save the result

The overall score file looks like below:

{
    "MMMU": {
        "genuine_accuracy_score": 17.11,
        "average_score": 52.7,
        "origin_score": 45.13,
        "perception_score": 62.24,
        "knowledge_score": 50.74
    },
    "MathVista": {
        "genuine_accuracy_score": 15.37,
        "average_score": 51.67,
        "origin_score": 55.93,
        "perception_score": 50.37,
        "knowledge_score": 48.7
    },
    "ScienceQA": {
        "genuine_accuracy_score": 44.96,
        "average_score": 74.61,
        "origin_score": 80.54,
        "perception_score": 72.2,
        "knowledge_score": 71.09
    },
    "Macro_Average": {
        "genuine_accuracy_score": 25.81,
        "average_score": 59.66,
        "origin_score": 60.53,
        "perception_score": 61.6,
        "knowledge_score": 56.84
    },
    "Micro_Average": {
        "genuine_accuracy_score": 33.07,
        "average_score": 65.34,
        "origin_score": 68.71,
        "perception_score": 65.11,
        "knowledge_score": 62.21
    }
}

You could email the model outputs to [email protected] with the reproduction method, we would update the online benchmark ASAP.

All LLMs perform poorly in the benchmark due to the rigorous metric. Best performing LMM (Qwen-VL-Max, GPT4-o) still lag behind human by 30% in average Genuine Accuracy of MMEvalPro.

Acknowledgements

We thank the creators of ScienceQA, MathVista and MMMU for providing the excellent evaluation resources!

License

The new contributions to our dataset are distributed under the CC BY-SA 4.0 license, including

The copyright of the images and the original questions belongs to the authors of MMMU, ScienceQA and MathVista

  • Purpose: The dataset was primarily designed for use as a test set.
  • Commercial Use: The dataset can be used commercially as a test set, but using it as a training set is prohibited. By accessing or using this dataset, you acknowledge and agree to abide by these terms in conjunction with the CC BY-SA 4.0 license.

Citation

@misc{huang2024mmevalprocalibratingmultimodalbenchmarks,
      title={MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation}, 
      author={Jinsheng Huang and Liang Chen and Taian Guo and Fu Zeng and Yusheng Zhao and Bohan Wu and Ye Yuan and Haozhe Zhao and Zhihui Guo and Yichi Zhang and Jingyang Yuan and Wei Ju and Luchen Liu and Tianyu Liu and Baobao Chang and Ming Zhang},
      year={2024},
      eprint={2407.00468},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.00468}, 
}

About

Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages