You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing different reward functions with GRPO and i was wandering if there are some stuff i need to watch out from.
For example, I am trying negative rewards when the format i want doesn't get followed with the sum of three positive normalized reward values if certain conditions are met. Is this good or something else is preferable ?
I noticed in the docs some reward functions are without a ceiling while others produce rewards only between 0 and 1.
The text was updated successfully, but these errors were encountered:
I am testing different reward functions with GRPO and i was wandering if there are some stuff i need to watch out from.
For example, I am trying negative rewards when the format i want doesn't get followed with the sum of three positive normalized reward values if certain conditions are met. Is this good or something else is preferable ?
I noticed in the docs some reward functions are without a ceiling while others produce rewards only between 0 and 1.
The text was updated successfully, but these errors were encountered: