[question] [Proposal] Maximum Iterations Per Episode #633

AMR- · 2019-12-27T18:18:42Z

In this Issue, I identify a common use-case not currently addressed, propose a solution (adding "maximum timesteps per episode" argument to learn), and offer to implement the change if the community is open to it.

Sometimes we have environments where the agent can get stuck, and the episode does not end. We want to abort the episode prematurely, reset, and continue training. When we train, one way to account for this is to have a maximum number of timesteps per episode in training. When this maximum is hit, the environment is reset even if the episode is not done.

It's possible that this functionality already exists in the repository and I just missed it, but I looked through the documentation and the code itself and did not find it.

I propose adding a max_timesteps_per_episode argument to the learn methods.

.learn(total_timesteps, max_timesteps_per_episode=None, ....)
.learn(total_timesteps, max_timesteps_per_episode=2000, ....)

When max_timesteps_per_episode is set to None (default), behavior is as it currently is.

When max_timesteps_per_episode is set to a positive integer, then the following should occur: after an environment is reset, and then after this integer number of timesteps on this environment, the environment will be reset again even if the state is not done.

In the case of multiple environments, max timesteps per episode is of course per-environment.

If others like this proposal, I'm happy to implement and submit a PR for it. (At least for models that use Runner.)

Miffyli · 2019-12-27T18:23:48Z

This is job of a wrapper like the classical TimeLimit wrapper, which does similar job you described. While I agree it is a common utility, I do not think it should be part of stable-baselines features as it is a feature of environments used.

AMR- · 2020-01-09T16:52:09Z

Thank you for the information @Miffyli

CeBrendel · 2022-03-20T18:00:22Z

This is job of a wrapper like the classical TimeLimit wrapper, which does similar job you described. While I agree it is a common utility, I do not think it should be part of stable-baselines features as it is a feature of environments used.

I disagree! If you prematurely terminate an episode from the side of the environment via a "done" signal (i. e. wrap your environment in a TimeLimit) this will indicate a terminal state to the agent. In particular the estimation of the expected future reward will be skewed. Consider a scenario where there are two possible future paths at the second to last time step before premature termination. The first one is an eternal loop (if we would ignore termination) with each following reward -1. The second one is an actual terminal state - but to reach it you receive a reward of size -2. Now with the artificial horizon you choose the loop and receive a reward of -1 and update your agent accordinlgy. Assume for a moment we are working with the pure value function and update it with one step lookaheads: In this case you would update the value function as V(current state) <- (-1) + g * 0 = (-1) as the next state is a terminal state and doesnt provide any further rewards (even though it should, namely -g-g^2-g^3-... = - g/(1-g)). The normal update rule V(s) <- r + g * V(s') * (1 - done) skews the update to V(s) if "done" is not an indicator of a true terminal state!

TLDR: A TimeLimit wrapper changes the underlying (PO)MDP and will (more likely than) not yield an agent that behaves poorly in the unwrapped environment!

araffin · 2022-03-20T18:19:27Z

@CeBrendel you should probably take a look at DLR-RM/stable-baselines3#633

Miffyli · 2022-03-20T18:20:12Z

@CeBrendel Yup, you are right! TimeLimit wrapper adds an boolean that tells episode was truncated, and the next iteration of stable-baselines handles this info...

Aaaand ninja'd by @araffin :). See his link. We also recommend moving to stable-baselines3.

CeBrendel · 2022-03-20T22:23:52Z

Oh! Very nice to see that SB3 handles it. Thanks for the quick responses!

araffin added enhancement New feature or request openai gym related to OpenAI Gym interface question Further information is requested labels Dec 27, 2019

araffin closed this as completed Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] [Proposal] Maximum Iterations Per Episode #633

[question] [Proposal] Maximum Iterations Per Episode #633

AMR- commented Dec 27, 2019

Miffyli commented Dec 27, 2019

AMR- commented Jan 9, 2020

CeBrendel commented Mar 20, 2022 •

edited

Loading

araffin commented Mar 20, 2022

Miffyli commented Mar 20, 2022

CeBrendel commented Mar 20, 2022

[question] [Proposal] Maximum Iterations Per Episode #633

[question] [Proposal] Maximum Iterations Per Episode #633

Comments

AMR- commented Dec 27, 2019

Miffyli commented Dec 27, 2019

AMR- commented Jan 9, 2020

CeBrendel commented Mar 20, 2022 • edited Loading

araffin commented Mar 20, 2022

Miffyli commented Mar 20, 2022

CeBrendel commented Mar 20, 2022

CeBrendel commented Mar 20, 2022 •

edited

Loading