[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

akmandor · 2022-02-17T22:44:37Z

Question

When the environment has variable horizon (number of timesteps per episode is changing), does the default implementation of PPO in SB3 normalize the terminal (end-of epiosode) rewards/penalties when updating actor/critic networks?

Additional context

For example, let's say I have two trajectories that terminates at some equally important goal states.

Traj1 consists of 10 steps
Traj2 consists of 20 steps

In order to equally (assuming discount_factor=1) update the value estimates of all states in both trajectories, should I scale the reward during the training as:

Terminal reward of Traj1 is set to 1
Terminal reward of Traj2 is set to 2?

OR I wonder if this is handled within the SB3 implementation of PPO (especially considering vectorized envs)?

Checklist

[ x] I have read the documentation (required)
[ x] I have checked that there is no similar issue in the repo (required)

Miffyli · 2022-02-17T23:23:05Z

No, there is no "normalization" that depends on the episode length. SB3 takes the reward from environment and uses it directly in discounted return/value computations.

As for if you should do it: I do not have an answer as this is not something that is commonly done in RL. I'd personally not do it unless I had a very good reason to try it out. I recommend you try with and without this reward scaling (or completely different reward altogether) :) .

akmandor · 2022-02-18T02:01:55Z

Thanks for both the clarification of the PPO implementation and your comments on this issue.

Actually, I have been trying both and I could only successfully train my agent "without scaling" the rewards. However, when I read the PPO paper, I just thought that this fixed terminal rewards seem like creating an unequal distribution between similar states and decreasing the success of training for longer episodes. For example if the step reward is not adjusted well along with the terminal one, then for longer paths, the agent starts slowing down and prefers collecting smaller rewards which gives cumulatively larger reward than some big terminal reward based on how fast it gets there.

I also found a similar discussion here if someone interested in further.

Miffyli · 2022-02-18T10:14:36Z

Thanks for the link, I think I now understand the issue :). It comes down to experimentation (there are cases where constant, small penalty/reward can be helpful), but I personally would try to keep rewards as simple as possible and build on from there. One additional note is that I would keep returns (the discounted sum) in reasonable magnitudes (e.g. in interval [-10, 10]). Otherwise, value loss will have a large magnitude which may result in disruptive updates in the network (keep your eye on the policy loss vs. value loss for this).

Closing as resolved and "no tech support". Good luck with your experiments :)

araffin · 2022-02-24T12:39:29Z

Hello,

has variable horizon (number of timesteps per episode is changing), does the default implementation of PPO in SB3 normalize the terminal (end-of epiosode) rewards/penalties when updating actor/critic networks?

One additional remark: one way to deal with that is treating the problem as an infinite horizon problem (if it makes sense) as SB3 does support it.
For that, the termination is usually due to a timeout and not a normal termination, and a info["TimeLimit.truncated"] = True is set by the env (see https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py#L20).
If you provide the "TimeLimit.truncated" key, then SB3 can automatically deals with it (you need latest version of SB3 for that), please take a look at #633 for more details.

akmandor added the question Further information is requested label Feb 17, 2022

Miffyli closed this as completed Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

akmandor commented Feb 17, 2022

Miffyli commented Feb 17, 2022

akmandor commented Feb 18, 2022 •

edited

Loading

Miffyli commented Feb 18, 2022

araffin commented Feb 24, 2022

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

Comments

akmandor commented Feb 17, 2022

Question

Additional context

Checklist

Miffyli commented Feb 17, 2022

akmandor commented Feb 18, 2022 • edited Loading

Miffyli commented Feb 18, 2022

araffin commented Feb 24, 2022

akmandor commented Feb 18, 2022 •

edited

Loading