[GRPO] initial GRPO trainer #1954

saisurbehera · 2024-08-21T02:17:53Z

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

Will be adding iterative reward model training
Only outcome supervision has been enabled, will be implementing process supervision later

lewtun · 2024-08-21T08:19:10Z

Thank you for working on this nifty algorithm @saisurbehera ! I see you're basing your implementation on PPOTrainer but we've recently overhauled our RL implementations to be more aligned with the rest of the library, e.g. here's the new PPO version: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py

Would you mind adapting your implementation to this new API? Since GRPO is somewhat similar to RLOO, you might find it is possible to copy-paste a large part of that code: https://github.com/huggingface/trl/blob/main/trl/trainer/rloo_trainer.py

saisurbehera · 2024-08-21T14:34:23Z

Sure, i can make the changes similar to PPOtrainerv2

saisurbehera · 2024-08-22T03:03:16Z

Hello @lewtun ,

I ported the format to the new methodlogy, it was way simpler than the first version. I still have to do some validations and testing.

initial grpo files

7200785

saisurbehera changed the title ~~initial grpo files~~ [GRPO] initial GRPO trainer Aug 21, 2024

fixed thw commit

4cee7cb

saisurbehera marked this pull request as draft August 21, 2024 02:24

port to new format, completed basics , requires testing

96f5c07

saisurbehera and others added 6 commits August 21, 2024 23:06

Merge branch 'main' into grpo

150ba02

remove the core changes

bd5084e

Merge branch 'grpo' of github.com:saisurbehera/trl into grpo

68a42b8

added example script

eea6d4c

Added some checks and validated the results

d40a433

Merge branch 'main' into grpo

9e50e7b

This was referenced Sep 24, 2024

Group Relative Policy Optimization Trainer #1583

Closed

GRPO as part of HF TRL? #2103

Open

saisurbehera and others added 6 commits October 4, 2024 17:45

fixed vals

4b27d5d

Merge branch 'grpo' of github.com:saisurbehera/trl into grpo

f8f6c08

Merge branch 'main' into grpo

3a312e4

added delta

6cc2c9f

merge to main

33f11ef

Added some documentation

f308e3a