-
-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Question] What is "dead step" and when do I use _accumulate_rewards? #1260
Comments
Hi wpm, I'm also a recent user of AECEnv and agree that these two things could be clarified. From what I've gathered I think the following is true:
|
Maybe the tic-tac-toe example is poorly written and I should just stop looking at it. I have focused on it because I'm currently writing an extremely simple two-player turn-taking game that is very similar to tic-tac-toe. Has your proposed change passed the |
I agree with you to just not look at the tictactoe example if thats the case. My proposed change does not pass the .api test for the reason of my observation space is a dictionary. It seems like the test does not support dictionary spaces as of now. I should probably open an issue on this. If all you are doing is removing the _was_dead_step. I'm 95% sure that your environment will be fine. So long as the ending of your game is not contingent on the number of agents (only 1 agent remaining). I'd ignore or comment out that api_test section if it looks like its related to _was_dead_step if I were you lol. Happy to help further but I'm far from a pro |
Question
Custom Environment Documentation Gaps
There are details about how to write the
step
function of a custom environment that appear to be undocumented and left to be inferred from examples and comments in source code. After a lot of study and experimentation I've got a partial handle on what they do, but would like to confirm my understanding and suggest documentation to make it easier for others in the future.Specific Areas of Confusion
What is all this dead step stuff?
When an agent is terminated or truncated it is marked as such and then goes through one more null action during which it is removed from the agents list and all dictionaries that have an agent as a key, correct? I don't understand why this is. It seems to work, but is a little convoluted. An explanation in the documentation would be helpful.
It also doesn't appear to always be necessary. The comment for AECEnv._was_dead_step says that it is "highly recommended" to put an
if
condition handling dead steps at the start ofAECEnv.step
method, and several example environments do this. However,the code for the Tutorial: Environment Logic does not do this. Which is it? What does "highly recommended" mean?I suspect that the dead step logic is a way to handle complicated multi-user scenarios while simpler ones can forego it. However the classic Tic-Tac-Toe environment, which seems like it should be about simplest possible multi-agent environment, has dead step logic in its
AECEnv.step
implementation. When must you do this and when should you skip it?What should rewards look at like at the end of an episode?
In addition to the initial
if
clause, "dead step" logic appears to also require thatAECEnv._accumulate_rewards
be called when an episode is over. There's no documentation for this, not even in source code comments. You just have to infer it from looking at examples. I really don't understand what's going on here. Say you have a simple two-person game in which the winner gets one point and the loser gets none. There are no intermediate rewards. Yet I'm supposed to, what?—put the scores in therewards
dictionary so that they can be copied over to_cumulative_rewards
? That seems backwards from the way it should be. Plus_accumulate_rewards
is marked as a private function, so I shouldn't be calling it at all.I'd be willing to accept this all blindly, except that
_was_dead_step
deletes bothrewards
and_cumulative_rewards
. When I run an episode of a simple two-player environment that I wrote, I see two "extra" dead steps (presumably one for each player) and afterwards all rewards dictionaries are empty. I have no idea if I'm doing this correctly or not. Also, the Environment Logic Tutorial has no call to_cumulative_rewards
and it looks as though it won't have its rewards dictionaries deleted at the end of an episode.The best way I can think to figure this out for myself is to pick a training algorithm and step through it with a few different environments, seeing when exactly it reads rewards information from the environment, but then I'm in the situation of reverse engineering both Petting Zoo and some third party training code. I'd like to know I've got the environment correctly implemented before I bring in another source of confusion.
Recommendations
I don't mean to sound cross. Petting Zoo is really hard to write. But if I'm hitting these problems I bet that others are too, and just don't want to take the time to write down exactly what they're confused about. I also realize that some of my questions may already be answered in the documentation, but I've been reading and rereading that documentation for a week now, so if they do contain answers the answers have to be clearer.
There needs to be more documentation to reduce the confusion around these two issues. It doesn't have to spell out every contingency, just make a newcomer feel like they're on solid ground when writing a custom environment.
I think the best place for this additional documentation would be on the Tutorial: Environment Logic page. In addition to the current example, have an additional example that uses the "dead step" and
_cumulative_rewards
strategies along with an explanation of why you'd choose one over the other. I suspect there are lots of contingencies here and a complete description in prose would be prohibitively difficult, but a pair of contrastive examples would go a long way.I'd take a shot at writing a first draft myself, but I can't because I don't understand how
AECEnv
works. 😀The text was updated successfully, but these errors were encountered: