_trust_region_loss variations #1

ethancaballero · 2017-05-22T19:40:21Z

In current form (without minus # front of _trust_region_loss), reward obtained just sits at ~9 on cartpole; it might take off after more steps but haven't tried it.
With minus # front, reward obtained starts changing immediately.

ethancaballero · 2017-05-22T19:47:43Z

Also, just using TRPO over the sum of the whole rollout sequence runs significantly faster (but is less accurate) than using TRPO on each individual step like in paper.

Kaixhin · 2017-05-23T16:53:00Z

Thanks for looking into this! So looking at the Chainer code, KL is defined as follows. Whereas with PyTorch, it is as follows. So I've done my best to make sure it matches the Chainer code, but I've done it pretty quickly so I quite possibly made a mistake. Anyway, this should make it easier to check (assuming the Chainer code is correct). I haven't checked myself if size_average is better on or off. Needs thinking about since on-policy is a single sample, but off-policy is in a batch.

With regards to TRPO over the rollout vs. over each step, I'd go with the latter to be consistent with the paper.

Kaixhin · 2017-07-15T21:12:00Z

Trust region seems to be learning with the current code, but apart from being slow it doesn't seem to prevent collapses. Do you know if this is just a matter of tuning or something else?

ethancaballero · 2017-07-16T00:02:10Z

Figure 1 from 1611.01224 says TRPO should help on Atari when most of the threads are off-policy threads.

Figure 3 shows that TRPO has an even larger improvement when the task is a continuous control task (e.g. Reacher).

Try testing TRPO on Atari with 8 off-policy threads like in Figure 1. It might be that one needs to run on Atari (instead of Cartpole) to see a noticeable difference.

Kaixhin · 2017-07-16T13:03:58Z

Figure 1 is a little bit confusing, but after consulting the text, it is actually comparing the "replay ratio" - the number of times experience replay is sampled per on-policy trajectory. They always use 16 threads. And indeed, it does seem that trust region updates are, according to the reported results, more useful in continuous domains.

Now I realise we're missing a crucial part of the "efficient" trust region. Looking at the right of Figure 1, it doesn't look like there's much of speed difference. They say:

The trust region step is carried out in the space of the statistics of the distribution f, and not in the space of the policy parameters. This is done deliberately so as to avoid an additional backpropagation step through the policy network.

Do you know how backprop_truncated can be implimented in PyTorch such that we can achieve this?

ethancaballero · 2017-07-24T01:06:06Z

are you abandoning acer trpo in favor of openai's newest version of ppo?
https://openai-public.s3-us-west-2.amazonaws.com/blog/2017-07/ppo/ppo-arxiv.pdf
https://blog.openai.com/openai-baselines-ppo/

Kaixhin · 2017-07-24T08:50:17Z

No - just looking at PPO in parallel, but would like to get this fully implemented. I asked about backprop_truncated on the forums but the only answer is quite messy. Just asked on Slack, will see if anyone has other ideas. I've updated the readme to note the current status of this implementation.

jingweiz · 2017-08-23T15:28:56Z

Hey,
just finished an implementation of acer in my repo here: https://github.com/jingweiz/pytorch-rl
the backprop part are mostly in _backward and _1st_order_trpo, if you have time maybe you can check it? it's rather complicated so it would be good if I can have a double check :) Thanks!
I have ran test on acer yet but would do it in the next days

Kaixhin · 2017-08-27T22:01:46Z

@jingweiz I had a quick look through acer_single_process.py and didn't spot any issues. Unfortunately I don't have the time to look through thoroughly. Your solution for 1st order TRPO is interesting - it'd be great if you could let us know here if it works.

jingweiz · 2017-08-28T07:47:39Z

Hey,
thank you so much for looking into it!
Unfortunately my acer implementation hasn't worked yet, here what I get for mountaincar:

sort of lost here, cos from I got from the paper, the policy loss is first backproped up to the output, then from the output back through the whole net, maybe something's still missing ... :/

Kaixhin · 2017-08-28T09:55:35Z

So the core of this is done in Chainer here, and I have a PyTorch request to replicate the functionality that makes it easy in Chainer, but unfortunately it doesn't look like we'll be able to do a direct port for a long time, if at all.

It should be possible to build this particular algorithm in PyTorch, but yeah it's going to be tricky. One route for debugging might be to check the creators of Variables to see that they are either truncated or not, plus also seeing if the gradients have sensible values.

jingweiz · 2017-08-29T09:32:32Z

Yeah, thanks a lot! I'll check the code in more detail :) Hope to get it to work:)

ethancaballero changed the title ~~_trust_region_loss maybe should have minus # front of it~~ _trust_region_loss variation May 22, 2017

ethancaballero changed the title ~~_trust_region_loss variation~~ _trust_region_loss variations May 22, 2017

Kaixhin closed this as completed in 4326295 Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_trust_region_loss variations #1

_trust_region_loss variations #1

ethancaballero commented May 22, 2017 •

edited

Loading

ethancaballero commented May 22, 2017 •

edited

Loading

Kaixhin commented May 23, 2017 •

edited

Loading

Kaixhin commented Jul 15, 2017

ethancaballero commented Jul 16, 2017 •

edited

Loading

Kaixhin commented Jul 16, 2017

ethancaballero commented Jul 24, 2017 •

edited

Loading

Kaixhin commented Jul 24, 2017

jingweiz commented Aug 23, 2017 •

edited

Loading

Kaixhin commented Aug 27, 2017

jingweiz commented Aug 28, 2017

Kaixhin commented Aug 28, 2017

jingweiz commented Aug 29, 2017

_trust_region_loss variations #1

_trust_region_loss variations #1

Comments

ethancaballero commented May 22, 2017 • edited Loading

ethancaballero commented May 22, 2017 • edited Loading

Kaixhin commented May 23, 2017 • edited Loading

Kaixhin commented Jul 15, 2017

ethancaballero commented Jul 16, 2017 • edited Loading

Kaixhin commented Jul 16, 2017

ethancaballero commented Jul 24, 2017 • edited Loading

Kaixhin commented Jul 24, 2017

jingweiz commented Aug 23, 2017 • edited Loading

Kaixhin commented Aug 27, 2017

jingweiz commented Aug 28, 2017

Kaixhin commented Aug 28, 2017

jingweiz commented Aug 29, 2017

ethancaballero commented May 22, 2017 •

edited

Loading

ethancaballero commented May 22, 2017 •

edited

Loading

Kaixhin commented May 23, 2017 •

edited

Loading

ethancaballero commented Jul 16, 2017 •

edited

Loading

ethancaballero commented Jul 24, 2017 •

edited

Loading

jingweiz commented Aug 23, 2017 •

edited

Loading