[Question] What is "dead step" and when do I use _accumulate_rewards? #1260

wpm · 2025-02-09T00:02:56Z

Question

Custom Environment Documentation Gaps

There are details about how to write the step function of a custom environment that appear to be undocumented and left to be inferred from examples and comments in source code. After a lot of study and experimentation I've got a partial handle on what they do, but would like to confirm my understanding and suggest documentation to make it easier for others in the future.

Specific Areas of Confusion

What is all this dead step stuff?

When an agent is terminated or truncated it is marked as such and then goes through one more null action during which it is removed from the agents list and all dictionaries that have an agent as a key, correct? I don't understand why this is. It seems to work, but is a little convoluted. An explanation in the documentation would be helpful.

It also doesn't appear to always be necessary. The comment for AECEnv._was_dead_step says that it is "highly recommended" to put an if condition handling dead steps at the start of AECEnv.step method, and several example environments do this. However,the code for the Tutorial: Environment Logic does not do this. Which is it? What does "highly recommended" mean?

I suspect that the dead step logic is a way to handle complicated multi-user scenarios while simpler ones can forego it. However the classic Tic-Tac-Toe environment, which seems like it should be about simplest possible multi-agent environment, has dead step logic in its AECEnv.step implementation. When must you do this and when should you skip it?

What should rewards look at like at the end of an episode?

In addition to the initial if clause, "dead step" logic appears to also require that AECEnv._accumulate_rewards be called when an episode is over. There's no documentation for this, not even in source code comments. You just have to infer it from looking at examples. I really don't understand what's going on here. Say you have a simple two-person game in which the winner gets one point and the loser gets none. There are no intermediate rewards. Yet I'm supposed to, what?—put the scores in the rewards dictionary so that they can be copied over to _cumulative_rewards? That seems backwards from the way it should be. Plus _accumulate_rewards is marked as a private function, so I shouldn't be calling it at all.

I'd be willing to accept this all blindly, except that _was_dead_step deletes both rewards and _cumulative_rewards. When I run an episode of a simple two-player environment that I wrote, I see two "extra" dead steps (presumably one for each player) and afterwards all rewards dictionaries are empty. I have no idea if I'm doing this correctly or not. Also, the Environment Logic Tutorial has no call to _cumulative_rewards and it looks as though it won't have its rewards dictionaries deleted at the end of an episode.

The best way I can think to figure this out for myself is to pick a training algorithm and step through it with a few different environments, seeing when exactly it reads rewards information from the environment, but then I'm in the situation of reverse engineering both Petting Zoo and some third party training code. I'd like to know I've got the environment correctly implemented before I bring in another source of confusion.

Recommendations

I don't mean to sound cross. Petting Zoo is really hard to write. But if I'm hitting these problems I bet that others are too, and just don't want to take the time to write down exactly what they're confused about. I also realize that some of my questions may already be answered in the documentation, but I've been reading and rereading that documentation for a week now, so if they do contain answers the answers have to be clearer.

There needs to be more documentation to reduce the confusion around these two issues. It doesn't have to spell out every contingency, just make a newcomer feel like they're on solid ground when writing a custom environment.

I think the best place for this additional documentation would be on the Tutorial: Environment Logic page. In addition to the current example, have an additional example that uses the "dead step" and _cumulative_rewards strategies along with an explanation of why you'd choose one over the other. I suspect there are lots of contingencies here and a complete description in prose would be prohibitively difficult, but a pair of contrastive examples would go a long way.

I'd take a shot at writing a first draft myself, but I can't because I don't understand how AECEnv works. 😀

The text was updated successfully, but these errors were encountered:

AlexAdrian-Hamazaki · 2025-02-11T21:42:36Z

Hi wpm, I'm also a recent user of AECEnv and agree that these two things could be clarified.

From what I've gathered I think the following is true:

_was_dead_step() should only be included if you expect agents to "die"/"dropout" in a game. I'm not sure why its present in the ticktacktoe example like you saw but It does not need it. It also does not change anything I think because the game ends as soon as a termination signal is observed.
_accumulated_rewards is only needed if you are rewarding agents at a time other than the game-end. It is meant to hold an agent's reward as it progresses through the game. For example, in the connect4 tutorial, the agent get rewarded for non-game ending actions (such as having a row of 3). From my understanding, if you are only rewarding the agent on game-end you don't need this?

wpm · 2025-02-11T23:17:12Z

If you are correct, "_was_dead_step() is necessary to handle the case in the AECEnv when an agent is eliminated when it's not their turn" is precisely the sort of documentation I'm asking for. In retrospect it makes sense that this case might require special handling.
What you say is my understanding of how _accumulated_rewards() works which is why I'm confused when I see it called in examples (like tic-tac-toe) that only assign rewards at the end of the episode.

Maybe the tic-tac-toe example is poorly written and I should just stop looking at it. I have focused on it because I'm currently writing an extremely simple two-player turn-taking game that is very similar to tic-tac-toe.

Has your proposed change passed the pettingzoo.test.api_test? I'm trying to remove the dead step logic from my simple program because the situation in (1) does not occur, but when I do pettingzoo.test.api_test throws an assertion for reasons that are very difficult to debug.

AlexAdrian-Hamazaki · 2025-02-12T21:42:05Z

I agree with you to just not look at the tictactoe example if thats the case.

My proposed change does not pass the .api test for the reason of my observation space is a dictionary. It seems like the test does not support dictionary spaces as of now. I should probably open an issue on this.

If all you are doing is removing the _was_dead_step. I'm 95% sure that your environment will be fine. So long as the ending of your game is not contingent on the number of agents (only 1 agent remaining). I'd ignore or comment out that api_test section if it looks like its related to _was_dead_step if I were you lol.

Happy to help further but I'm far from a pro

wpm added the question Further information is requested label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What is "dead step" and when do I use _accumulate_rewards? #1260

[Question] What is "dead step" and when do I use _accumulate_rewards? #1260

wpm commented Feb 9, 2025 •

edited

Loading

AlexAdrian-Hamazaki commented Feb 11, 2025

wpm commented Feb 11, 2025

AlexAdrian-Hamazaki commented Feb 12, 2025 •

edited

Loading

[Question] What is "dead step" and when do I use _accumulate_rewards? #1260

[Question] What is "dead step" and when do I use _accumulate_rewards? #1260

Comments

wpm commented Feb 9, 2025 • edited Loading

Question

Custom Environment Documentation Gaps

Specific Areas of Confusion

What is all this dead step stuff?

What should rewards look at like at the end of an episode?

Recommendations

AlexAdrian-Hamazaki commented Feb 11, 2025

wpm commented Feb 11, 2025

AlexAdrian-Hamazaki commented Feb 12, 2025 • edited Loading

wpm commented Feb 9, 2025 •

edited

Loading

AlexAdrian-Hamazaki commented Feb 12, 2025 •

edited

Loading