[Question] Does PPO handle timeout and bootstrap correctly? #651

zhihanyang2022 · 2021-11-04T08:04:24Z

Question

SB3's PPO does not seem to distinguish between done and timeout, and only relies on done flags when computing GAE return:

stable-baselines3/stable_baselines3/common/buffers.py

Line 349 in 2bb4500

    
           def compute_returns_and_advantage(self, last_values: th.Tensor, dones: np.ndarray) -> None:

For example, when GAE lambda is set to 1, the comment says that R - V(s) would be computed, where R is the discounted reward with bootstrap. What if bootstrap is not appropriate for certain envs (done = 1 means done literally)?

Also, as the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real next observation being used in computing the bootstrap.

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

araffin · 2021-11-04T08:32:29Z

Hello,
Duplicate of #633

It currently does not (but you can use a TimeFeatureWrapper as in the RL Zoo) and we would welcome a PR that implements proper handling of timeouts ;)

EDIT: but timeouts are handled properly for off-policy algorithms

zhihanyang2022 · 2021-11-04T08:42:01Z

we would welcome a PR that implements proper handling of timeouts ;)

I'm happy to work on this, as it's related to my current project as well.

Before I proceed, I want to clear up a few things:

(I asked this earlier) As the documentation for VecEnv points out, the real next observation is only available in terminal_observation key of the info dictionary. However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.
How should one interpret the naming convention _last_episode_starts, and why isn't the naming done used instead? I've read Add test for GAE + rename RolloutBuffer.dones for clarification #375 but I'm still not sure.

araffin · 2021-11-04T09:06:55Z

However, I don't see the real observation being used for bootstrapping in the current code for PPO, which is a bit weird to me.

it is not used because we don't bootstrap when done=True currently.

nd why isn't the naming done used instead?

compared to off-policy algorithms, last_episode_start is shifted by one (hence the renaming), and we initialize _last_episode_starts to true:

https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/base_class.py#L430

PS: I will close this to have all the discussion in #633

zhihanyang2022 added the question Further information is requested label Nov 4, 2021

zhihanyang2022 changed the title ~~[Question] Does PPO handle timeout correctly?~~ [Question] Does PPO handle timeout and bootstrap correctly? Nov 4, 2021

araffin added the duplicate This issue or pull request already exists label Nov 4, 2021

araffin closed this as completed Nov 4, 2021

dtch1997 mentioned this issue Nov 26, 2021

PPO does not correctly calculate reward on timeout mcx-lab/rl-baselines3-zoo#68

Open

Miffyli mentioned this issue Jul 7, 2022

[Question] The update of rewards in the on_policy_algorithm.py #953

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Does PPO handle timeout and bootstrap correctly? #651

[Question] Does PPO handle timeout and bootstrap correctly? #651

zhihanyang2022 commented Nov 4, 2021 •

edited

Loading

araffin commented Nov 4, 2021 •

edited

Loading

zhihanyang2022 commented Nov 4, 2021 •

edited

Loading

araffin commented Nov 4, 2021 •

edited

Loading

[Question] Does PPO handle timeout and bootstrap correctly? #651

[Question] Does PPO handle timeout and bootstrap correctly? #651

Comments

zhihanyang2022 commented Nov 4, 2021 • edited Loading

Question

Checklist

araffin commented Nov 4, 2021 • edited Loading

zhihanyang2022 commented Nov 4, 2021 • edited Loading

araffin commented Nov 4, 2021 • edited Loading

zhihanyang2022 commented Nov 4, 2021 •

edited

Loading

araffin commented Nov 4, 2021 •

edited

Loading

zhihanyang2022 commented Nov 4, 2021 •

edited

Loading

araffin commented Nov 4, 2021 •

edited

Loading