-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
_trust_region_loss variations #1
Comments
Also, just using TRPO over the sum of the whole rollout sequence runs significantly faster (but is less accurate) than using TRPO on each individual step like in paper. |
Thanks for looking into this! So looking at the Chainer code, KL is defined as follows. Whereas with PyTorch, it is as follows. So I've done my best to make sure it matches the Chainer code, but I've done it pretty quickly so I quite possibly made a mistake. Anyway, this should make it easier to check (assuming the Chainer code is correct). I haven't checked myself if With regards to TRPO over the rollout vs. over each step, I'd go with the latter to be consistent with the paper. |
Figure 1 from 1611.01224 says TRPO should help on Atari when most of the threads are off-policy threads. Figure 3 shows that TRPO has an even larger improvement when the task is a continuous control task (e.g. Reacher). Try testing TRPO on Atari with 8 off-policy threads like in Figure 1. It might be that one needs to run on Atari (instead of Cartpole) to see a noticeable difference. |
Figure 1 is a little bit confusing, but after consulting the text, it is actually comparing the "replay ratio" - the number of times experience replay is sampled per on-policy trajectory. They always use 16 threads. And indeed, it does seem that trust region updates are, according to the reported results, more useful in continuous domains. Now I realise we're missing a crucial part of the "efficient" trust region. Looking at the right of Figure 1, it doesn't look like there's much of speed difference. They say:
Do you know how |
are you abandoning acer trpo in favor of openai's newest version of ppo? |
No - just looking at PPO in parallel, but would like to get this fully implemented. I asked about |
Hey, |
@jingweiz I had a quick look through |
So the core of this is done in Chainer here, and I have a PyTorch request to replicate the functionality that makes it easy in Chainer, but unfortunately it doesn't look like we'll be able to do a direct port for a long time, if at all. It should be possible to build this particular algorithm in PyTorch, but yeah it's going to be tricky. One route for debugging might be to check the |
Yeah, thanks a lot! I'll check the code in more detail :) Hope to get it to work:) |
In current form (without minus # front of _trust_region_loss), reward obtained just sits at ~9 on cartpole; it might take off after more steps but haven't tried it.
With minus # front, reward obtained starts changing immediately.
The text was updated successfully, but these errors were encountered: