Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

_trust_region_loss variations #1

Closed
ethancaballero opened this issue May 22, 2017 · 12 comments
Closed

_trust_region_loss variations #1

ethancaballero opened this issue May 22, 2017 · 12 comments

Comments

@ethancaballero
Copy link
Collaborator

ethancaballero commented May 22, 2017

In current form (without minus # front of _trust_region_loss), reward obtained just sits at ~9 on cartpole; it might take off after more steps but haven't tried it.
With minus # front, reward obtained starts changing immediately.

@ethancaballero
Copy link
Collaborator Author

ethancaballero commented May 22, 2017

Also, just using TRPO over the sum of the whole rollout sequence runs significantly faster (but is less accurate) than using TRPO on each individual step like in paper.

@ethancaballero ethancaballero changed the title _trust_region_loss maybe should have minus # front of it _trust_region_loss variation May 22, 2017
@ethancaballero ethancaballero changed the title _trust_region_loss variation _trust_region_loss variations May 22, 2017
@Kaixhin
Copy link
Owner

Kaixhin commented May 23, 2017

Thanks for looking into this! So looking at the Chainer code, KL is defined as follows. Whereas with PyTorch, it is as follows. So I've done my best to make sure it matches the Chainer code, but I've done it pretty quickly so I quite possibly made a mistake. Anyway, this should make it easier to check (assuming the Chainer code is correct). I haven't checked myself if size_average is better on or off. Needs thinking about since on-policy is a single sample, but off-policy is in a batch.

With regards to TRPO over the rollout vs. over each step, I'd go with the latter to be consistent with the paper.

@Kaixhin
Copy link
Owner

Kaixhin commented Jul 15, 2017

Trust region seems to be learning with the current code, but apart from being slow it doesn't seem to prevent collapses. Do you know if this is just a matter of tuning or something else?

newplot

@ethancaballero
Copy link
Collaborator Author

ethancaballero commented Jul 16, 2017

Figure 1 from 1611.01224 says TRPO should help on Atari when most of the threads are off-policy threads.

Figure 3 shows that TRPO has an even larger improvement when the task is a continuous control task (e.g. Reacher).

Try testing TRPO on Atari with 8 off-policy threads like in Figure 1. It might be that one needs to run on Atari (instead of Cartpole) to see a noticeable difference.

@Kaixhin
Copy link
Owner

Kaixhin commented Jul 16, 2017

Figure 1 is a little bit confusing, but after consulting the text, it is actually comparing the "replay ratio" - the number of times experience replay is sampled per on-policy trajectory. They always use 16 threads. And indeed, it does seem that trust region updates are, according to the reported results, more useful in continuous domains.

Now I realise we're missing a crucial part of the "efficient" trust region. Looking at the right of Figure 1, it doesn't look like there's much of speed difference. They say:

The trust region step is carried out in the space of the statistics of the distribution f, and not in the space of the policy parameters. This is done deliberately so as to avoid an additional backpropagation step through the policy network.

Do you know how backprop_truncated can be implimented in PyTorch such that we can achieve this?

@ethancaballero
Copy link
Collaborator Author

ethancaballero commented Jul 24, 2017

are you abandoning acer trpo in favor of openai's newest version of ppo?
https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf
https://blog.openai.com/openai-baselines-ppo/

@Kaixhin
Copy link
Owner

Kaixhin commented Jul 24, 2017

No - just looking at PPO in parallel, but would like to get this fully implemented. I asked about backprop_truncated on the forums but the only answer is quite messy. Just asked on Slack, will see if anyone has other ideas. I've updated the readme to note the current status of this implementation.

@jingweiz
Copy link

jingweiz commented Aug 23, 2017

Hey,
just finished an implementation of acer in my repo here: https://github.com/jingweiz/pytorch-rl
the backprop part are mostly in _backward and _1st_order_trpo, if you have time maybe you can check it? it's rather complicated so it would be good if I can have a double check :) Thanks!
I have ran test on acer yet but would do it in the next days

@Kaixhin
Copy link
Owner

Kaixhin commented Aug 27, 2017

@jingweiz I had a quick look through acer_single_process.py and didn't spot any issues. Unfortunately I don't have the time to look through thoroughly. Your solution for 1st order TRPO is interesting - it'd be great if you could let us know here if it works.

@jingweiz
Copy link

Hey,
thank you so much for looking into it!
Unfortunately my acer implementation hasn't worked yet, here what I get for mountaincar:
screenshot from 2017-08-28 09 44 05
sort of lost here, cos from I got from the paper, the policy loss is first backproped up to the output, then from the output back through the whole net, maybe something's still missing ... :/

@Kaixhin
Copy link
Owner

Kaixhin commented Aug 28, 2017

So the core of this is done in Chainer here, and I have a PyTorch request to replicate the functionality that makes it easy in Chainer, but unfortunately it doesn't look like we'll be able to do a direct port for a long time, if at all.

It should be possible to build this particular algorithm in PyTorch, but yeah it's going to be tricky. One route for debugging might be to check the creators of Variables to see that they are either truncated or not, plus also seeing if the gradients have sensible values.

@jingweiz
Copy link

Yeah, thanks a lot! I'll check the code in more detail :) Hope to get it to work:)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants