Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Question] Terminal reward/penalty shaping in variable horizon environments while using PPO #777

Closed
akmandor opened this issue Feb 17, 2022 · 4 comments
Labels
question Further information is requested

Comments

@akmandor
Copy link

Question

When the environment has variable horizon (number of timesteps per episode is changing), does the default implementation of PPO in SB3 normalize the terminal (end-of epiosode) rewards/penalties when updating actor/critic networks?

Additional context

For example, let's say I have two trajectories that terminates at some equally important goal states.

  • Traj1 consists of 10 steps
  • Traj2 consists of 20 steps

In order to equally (assuming discount_factor=1) update the value estimates of all states in both trajectories, should I scale the reward during the training as:

  • Terminal reward of Traj1 is set to 1
  • Terminal reward of Traj2 is set to 2?

OR I wonder if this is handled within the SB3 implementation of PPO (especially considering vectorized envs)?

Checklist

  • [ x] I have read the documentation (required)
  • [ x] I have checked that there is no similar issue in the repo (required)
@akmandor akmandor added the question Further information is requested label Feb 17, 2022
@Miffyli
Copy link
Collaborator

Miffyli commented Feb 17, 2022

No, there is no "normalization" that depends on the episode length. SB3 takes the reward from environment and uses it directly in discounted return/value computations.

As for if you should do it: I do not have an answer as this is not something that is commonly done in RL. I'd personally not do it unless I had a very good reason to try it out. I recommend you try with and without this reward scaling (or completely different reward altogether) :) .

@akmandor
Copy link
Author

akmandor commented Feb 18, 2022

Thanks for both the clarification of the PPO implementation and your comments on this issue.

Actually, I have been trying both and I could only successfully train my agent "without scaling" the rewards. However, when I read the PPO paper, I just thought that this fixed terminal rewards seem like creating an unequal distribution between similar states and decreasing the success of training for longer episodes. For example if the step reward is not adjusted well along with the terminal one, then for longer paths, the agent starts slowing down and prefers collecting smaller rewards which gives cumulatively larger reward than some big terminal reward based on how fast it gets there.

I also found a similar discussion here if someone interested in further.

@Miffyli
Copy link
Collaborator

Miffyli commented Feb 18, 2022

Thanks for the link, I think I now understand the issue :). It comes down to experimentation (there are cases where constant, small penalty/reward can be helpful), but I personally would try to keep rewards as simple as possible and build on from there. One additional note is that I would keep returns (the discounted sum) in reasonable magnitudes (e.g. in interval [-10, 10]). Otherwise, value loss will have a large magnitude which may result in disruptive updates in the network (keep your eye on the policy loss vs. value loss for this).

Closing as resolved and "no tech support". Good luck with your experiments :)

@Miffyli Miffyli closed this as completed Feb 18, 2022
@araffin
Copy link
Member

araffin commented Feb 24, 2022

Hello,

has variable horizon (number of timesteps per episode is changing), does the default implementation of PPO in SB3 normalize the terminal (end-of epiosode) rewards/penalties when updating actor/critic networks?

One additional remark: one way to deal with that is treating the problem as an infinite horizon problem (if it makes sense) as SB3 does support it.
For that, the termination is usually due to a timeout and not a normal termination, and a info["TimeLimit.truncated"] = True is set by the env (see https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py#L20).
If you provide the "TimeLimit.truncated" key, then SB3 can automatically deals with it (you need latest version of SB3 for that), please take a look at #633 for more details.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants