You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
I’m considering implementing direct reward optimization (DRO) from this paper.
However, I'm unsure if it aligns with the contribution guidelines.
Here’s a comparison between this approach and KTO, demonstrating its superior performance:
Another advantage of this method is that it uses a dataset that assigns scores to each example, rather than relying on a pairwise dataset (like DPO).
from the DRO paper:
"Second and more importantly, annotating pairwise data is more expensive and less natural than simply indicating whether a single completion is satisfactory or not, e.g., by assigning a binary thumbs up or down rating to the model completion. "
Hi! Thanks for the suggestion. It could be a great addition. I haven't read the paper in detail yet but what you describe sounds closer to KTO than DPO, doesn't it?
Do you have an implementation that already works?
Method description
Hello!
I’m considering implementing direct reward optimization (DRO) from this paper.
However, I'm unsure if it aligns with the contribution guidelines.
Here’s a comparison between this approach and KTO, demonstrating its superior performance:
Another advantage of this method is that it uses a dataset that assigns scores to each example, rather than relying on a pairwise dataset (like DPO).
from the DRO paper:
"Second and more importantly, annotating pairwise data is more expensive and less natural than simply indicating whether a single completion is satisfactory or not, e.g., by assigning a binary thumbs up or down rating to the model completion. "
Open source status
Provide useful links for the implementation
paper: https://arxiv.org/pdf/2405.19107
weights: google-t5 (t5-large and t5-3b)
dataset: openbmb/UltraFeedback
The text was updated successfully, but these errors were encountered: