-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777
Comments
No, there is no "normalization" that depends on the episode length. SB3 takes the reward from environment and uses it directly in discounted return/value computations. As for if you should do it: I do not have an answer as this is not something that is commonly done in RL. I'd personally not do it unless I had a very good reason to try it out. I recommend you try with and without this reward scaling (or completely different reward altogether) :) . |
Thanks for both the clarification of the PPO implementation and your comments on this issue. Actually, I have been trying both and I could only successfully train my agent "without scaling" the rewards. However, when I read the PPO paper, I just thought that this fixed terminal rewards seem like creating an unequal distribution between similar states and decreasing the success of training for longer episodes. For example if the step reward is not adjusted well along with the terminal one, then for longer paths, the agent starts slowing down and prefers collecting smaller rewards which gives cumulatively larger reward than some big terminal reward based on how fast it gets there. I also found a similar discussion here if someone interested in further. |
Thanks for the link, I think I now understand the issue :). It comes down to experimentation (there are cases where constant, small penalty/reward can be helpful), but I personally would try to keep rewards as simple as possible and build on from there. One additional note is that I would keep returns (the discounted sum) in reasonable magnitudes (e.g. in interval [-10, 10]). Otherwise, value loss will have a large magnitude which may result in disruptive updates in the network (keep your eye on the policy loss vs. value loss for this). Closing as resolved and "no tech support". Good luck with your experiments :) |
Hello,
One additional remark: one way to deal with that is treating the problem as an infinite horizon problem (if it makes sense) as SB3 does support it. |
Question
When the environment has variable horizon (number of timesteps per episode is changing), does the default implementation of PPO in SB3 normalize the terminal (end-of epiosode) rewards/penalties when updating actor/critic networks?
Additional context
For example, let's say I have two trajectories that terminates at some equally important goal states.
In order to equally (assuming discount_factor=1) update the value estimates of all states in both trajectories, should I scale the reward during the training as:
OR I wonder if this is handled within the SB3 implementation of PPO (especially considering vectorized envs)?
Checklist
The text was updated successfully, but these errors were encountered: