Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRPO] initial GRPO trainer #1954

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Conversation

saisurbehera
Copy link

@saisurbehera saisurbehera commented Aug 21, 2024

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

  • Will be adding iterative reward model training
  • Only outcome supervision has been enabled, will be implementing process supervision later

closes #2103

@saisurbehera saisurbehera changed the title initial grpo files [GRPO] initial GRPO trainer Aug 21, 2024
@saisurbehera saisurbehera marked this pull request as draft August 21, 2024 02:24
@lewtun
Copy link
Member

lewtun commented Aug 21, 2024

Thank you for working on this nifty algorithm @saisurbehera ! I see you're basing your implementation on PPOTrainer but we've recently overhauled our RL implementations to be more aligned with the rest of the library, e.g. here's the new PPO version: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py

Would you mind adapting your implementation to this new API? Since GRPO is somewhat similar to RLOO, you might find it is possible to copy-paste a large part of that code: https://github.com/huggingface/trl/blob/main/trl/trainer/rloo_trainer.py

@saisurbehera
Copy link
Author

Sure, i can make the changes similar to PPOtrainerv2

@saisurbehera
Copy link
Author

Hello @lewtun ,

I ported the format to the new methodlogy, it was way simpler than the first version. I still have to do some validations and testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GRPO as part of HF TRL?
2 participants