Are there any tips and tricks about GRPO reward function design ? #2832

MohamedAliRashad · 2025-02-11T20:27:26Z

I am testing different reward functions with GRPO and i was wandering if there are some stuff i need to watch out from.

For example, I am trying negative rewards when the format i want doesn't get followed with the sum of three positive normalized reward values if certain conditions are met. Is this good or something else is preferable ?

I noticed in the docs some reward functions are without a ceiling while others produce rewards only between 0 and 1.

zaporter · 2025-02-12T04:26:33Z

See https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md#computing-the-advantage

It doesn't matter if you have negative or positive weights -- all that matters is the group relative advantage.

Rewards of {1, 0} will result in advantages of 1 and -1 respectively. That is the same as rewards of {1,-1} which results in 1, -1

Or consider rewards of {1, 1, 2}, this will result in advantages of -1/sqrt(2), -1/sqrt(2), sqrt(2)

github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling labels Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there any tips and tricks about GRPO reward function design ? #2832

Are there any tips and tricks about GRPO reward function design ? #2832

MohamedAliRashad commented Feb 11, 2025

zaporter commented Feb 12, 2025

Are there any tips and tricks about GRPO reward function design ? #2832

Are there any tips and tricks about GRPO reward function design ? #2832

Comments

MohamedAliRashad commented Feb 11, 2025

zaporter commented Feb 12, 2025