Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Question] Does PPO handle timeout and bootstrap correctly? #651

Closed
2 tasks done
zhihanyang2022 opened this issue Nov 4, 2021 · 3 comments
Closed
2 tasks done

[Question] Does PPO handle timeout and bootstrap correctly? #651

zhihanyang2022 opened this issue Nov 4, 2021 · 3 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@zhihanyang2022
Copy link

zhihanyang2022 commented Nov 4, 2021

Question

SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return:

def compute_returns_and_advantage(self, last_values: th.Tensor, dones: np.ndarray) -> None:

For example, when GAE lambda is set to 1, the comment says that R - V(s) would be computed, where R is the discounted reward with bootstrap. What if bootstrap is not appropriate for certain envs (done = 1 means done literally)?

Also, as the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real next observation being used in computing the bootstrap.

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@zhihanyang2022 zhihanyang2022 added the question Further information is requested label Nov 4, 2021
@zhihanyang2022 zhihanyang2022 changed the title [Question] Does PPO handle timeout correctly? [Question] Does PPO handle timeout and bootstrap correctly? Nov 4, 2021
@araffin araffin added the duplicate This issue or pull request already exists label Nov 4, 2021
@araffin
Copy link
Member

araffin commented Nov 4, 2021

Hello,
Duplicate of #633

It currently does not (but you can use a TimeFeatureWrapper as in the RL Zoo) and we would welcome a PR that implements proper handling of timeouts ;)

EDIT: but timeouts are handled properly for off-policy algorithms

@zhihanyang2022
Copy link
Author

zhihanyang2022 commented Nov 4, 2021

we would welcome a PR that implements proper handling of timeouts ;)

I'm happy to work on this, as it's related to my current project as well.

Before I proceed, I want to clear up a few things:

  • (I asked this earlier) As the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.
  • How should one interpret the naming convention _last_episode_starts, and why isn't the naming done used instead? I've read Add test for GAE + rename RolloutBuffer.dones for clarification #375 but I'm still not sure.

@araffin
Copy link
Member

araffin commented Nov 4, 2021

However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.

it is not used because we don't bootstrap when done=True currently.

nd why isn't the naming done used instead?

compared to off-policy algorithms, last_episode_start is shifted by one (hence the renaming), and we initialize _last_episode_starts to true:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/base_class.py#L430

PS: I will close this to have all the discussion in #633

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants