-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Question] Does PPO handle timeout and bootstrap correctly? #651
Comments
Hello, It currently does not (but you can use a EDIT: but timeouts are handled properly for off-policy algorithms |
I'm happy to work on this, as it's related to my current project as well. Before I proceed, I want to clear up a few things:
|
it is not used because we don't bootstrap when
compared to off-policy algorithms, PS: I will close this to have all the discussion in #633 |
Question
SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return:
stable-baselines3/stable_baselines3/common/buffers.py
Line 349 in 2bb4500
For example, when GAE lambda is set to 1, the comment says that
R - V(s)
would be computed, where R is the discounted reward with bootstrap. What if bootstrap is not appropriate for certain envs (done = 1
means done literally)?Also, as the documentation for VecEnv points out, the real next observation is only available in
terminal_observation
key of theinfo
dictionary. However, I don't see the real next observation being used in computing the bootstrap.Checklist
The text was updated successfully, but these errors were encountered: