Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Various minor PPO refactors #167

Closed
Tracked by #206
vwxyzjn opened this issue Apr 21, 2022 · 1 comment
Closed
Tracked by #206

Various minor PPO refactors #167

vwxyzjn opened this issue Apr 21, 2022 · 1 comment

Comments

@vwxyzjn
Copy link
Owner

vwxyzjn commented Apr 21, 2022

Problem Description

A lot of the formatting changes are suggested by @Howuhh

1. Refactor on next_done

The current code to handle done looks like this

            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the to(device) code is no longer needed so just did

            next_obs, reward, done, info = envs.step(action)

but this is wrong because I should have done next_done = done. The current next_done = torch.Tensor(done).to(device) just does not make a lot of sense.

We should refactor it to

            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)

2. make_env refactor

if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

to

if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

3. flatten batch

        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

to

        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)

4.


            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

to

            if args.target_kl is not None and approx_kl > args.target_kl:
                break

5.

global_step += 1 * args.num_envs

to

global_step += args.num_envs

6.

move

num_updates = args.total_timesteps // args.batch_size

to the argparse.

This was referenced Jun 20, 2022
@vwxyzjn vwxyzjn changed the title Minor refactor on CleanRL's PPO Various PPO refactors Jun 21, 2022
@vwxyzjn vwxyzjn changed the title Various PPO refactors Various minor PPO refactors Jun 21, 2022
@vwxyzjn vwxyzjn mentioned this issue Sep 27, 2022
19 tasks
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Nov 28, 2023

Closed by #424

@vwxyzjn vwxyzjn closed this as completed Nov 28, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant