Papers
arxiv:2310.17631

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Published on Oct 26, 2023
ยท Submitted by akhaliq on Oct 27, 2023
#1 Paper of the day

Abstract

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

A bot for recommending similar papers is an amazing idea! @librarian-bot

@LianghuiZhu @xinggangw @xinlongwang What is the proper way to prompt JudgeLM for single answer reference free evaluation?

The paper talks about this, but I don't see any explicit examples in the appendix nor in the hosted demo

Paper author

@LianghuiZhu @xinggangw @xinlongwang What is the proper way to prompt JudgeLM for single answer reference free evaluation?

The paper talks about this, but I don't see any explicit examples in the appendix nor in the hosted demo

@andrewrreed

Greetings!

As mentioned in the 6.4 Extensions of JudgeLM - Grading a single answer, the judging of a single answer needs the corresponding reference answer as a full-grade one. Therefore, judging a single answer always needs a reference answer.

Best regards,
lianghui

Thanks!

JudgeLM: Revolutionizing AI Evaluation with Fine-Tuned Large Language Models

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.17631 in a Space README.md to link it from this page.

Collections including this paper 19