Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Are there any tips and tricks about GRPO reward function design ? #2832

Open
MohamedAliRashad opened this issue Feb 11, 2025 · 1 comment
Open
Labels
🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling

Comments

@MohamedAliRashad
Copy link

I am testing different reward functions with GRPO and i was wandering if there are some stuff i need to watch out from.

For example, I am trying negative rewards when the format i want doesn't get followed with the sum of three positive normalized reward values if certain conditions are met. Is this good or something else is preferable ?

I noticed in the docs some reward functions are without a ceiling while others produce rewards only between 0 and 1.

@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling labels Feb 11, 2025
@zaporter
Copy link

See https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md#computing-the-advantage

It doesn't matter if you have negative or positive weights -- all that matters is the group relative advantage.

Rewards of {1, 0} will result in advantages of 1 and -1 respectively. That is the same as rewards of {1,-1} which results in 1, -1

Or consider rewards of {1, 1, 2}, this will result in advantages of -1/sqrt(2), -1/sqrt(2), sqrt(2)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling
Projects
None yet
Development

No branches or pull requests

2 participants