adding DRO trainer #2383

morLev · 2024-11-22T16:51:24Z

Method description

Hello!
I’m considering implementing direct reward optimization (DRO) from this paper.
However, I'm unsure if it aligns with the contribution guidelines.

Here’s a comparison between this approach and KTO, demonstrating its superior performance:

Another advantage of this method is that it uses a dataset that assigns scores to each example, rather than relying on a pairwise dataset (like DPO).
from the DRO paper:
"Second and more importantly, annotating pairwise data is more expensive and less natural than simply indicating whether a single completion is satisfactory or not, e.g., by assigning a binary thumbs up or down rating to the model completion. "

Open source status

The method implementation is available
The model weights are available
The training datasets are available

Provide useful links for the implementation

paper: https://arxiv.org/pdf/2405.19107
weights: google-t5 (t5-large and t5-3b)
dataset: openbmb/UltraFeedback

qgallouedec · 2024-11-24T16:07:10Z

Hi! Thanks for the suggestion. It could be a great addition. I haven't read the paper in detail yet but what you describe sounds closer to KTO than DPO, doesn't it?
Do you have an implementation that already works?

qgallouedec added ✨ enhancement New feature or request 🙋 help wanted Open invitation for community members to contribute labels Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding DRO trainer #2383

adding DRO trainer #2383

morLev commented Nov 22, 2024 •

edited

Loading

qgallouedec commented Nov 24, 2024

adding DRO trainer #2383

adding DRO trainer #2383

Comments

morLev commented Nov 22, 2024 • edited Loading

Method description

Open source status

Provide useful links for the implementation

qgallouedec commented Nov 24, 2024

morLev commented Nov 22, 2024 •

edited

Loading