Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Choice of RL Algorithms #1498

Open
MaikRe opened this issue Nov 11, 2024 · 0 comments
Open

Choice of RL Algorithms #1498

MaikRe opened this issue Nov 11, 2024 · 0 comments

Comments

@MaikRe
Copy link
Contributor

MaikRe commented Nov 11, 2024

There are several options for RL algorithms that would work with this, I am going to break down the main two that I think apply and explain differences of them:

Proximal Policy Optimization (PPO)

  • PPO is an on policy algorithm, this means it only learns based off state transition the current policy itself generated
  • Experiences are gathered in a replay buffer for some amount of time steps, learned from, and then the buffer is discarded and new experiences are collected
  • It uses Generalized Advantage Estimation (GAE), this means that once the rollout buffer is filled, the rewards are propagated backwards through time to calculate advantages and returns. This means that if we are dealing with a RL environment which has sparse rewards (0 if robot is on the floor or trying to get up, 1 if standing upright) the policy has an easier time understanding which transitions lead to this reward of 1 through the advantages and returns
  • The algorithm itself is focused on the creation of a specific loss function composed of 3 pars
    1. Policy loss based on a clipped surrogate objective. In simple terms: how much the current policy deviates from the likely target policy that achieves best results, which takes into account the advantages calculated with GAE
    2. The entropy loss, which is based on how much the policy has converged. At the beginning the policy has not converged so the entropy loss is negative in order to make the total loss smaller (assuming we are calculating with positive loss) so that the policy is not too heavily updated. It essentially encourages exploration of differing states early on.
    3. The value loss. Along with the actor policy PPO also trains a critic policy which assigns a value to the current state. The used value is also based on the returns calculated through GAE. The critic is thus able to learn/predict the correct values of a state. (i.e. being able to judge if having a leg in a certain position is beneficial)
  • As PPO throws away the replay buffer after every n steps it is super critical to have the right hyper parameters as updates that modify the policy too quickly cause wild fluctuations in training data and updates that modify the policy too slowly means the converges at a local optimum as it decides doing nothing is safer than doing something.

Soft Actor-Critic (SAC)

  • SAC is an off-policy algorithm, this means it learns off data that older versions of the policy may have generated and can also learn from external sources such as transitions created by an expert policy that are then distilled into the final policy
  • SAC continuously gathers experiences in a experience buffer, which can be as large as RAM will allow.
  • During training SAC samples from this experience buffer, preferring new transitions with high value over old ones, but never throwing away experiences until the buffer is filled (FIFO). This means SAC can continue to be positively or negatively reinforced with even entirely exploratory observations that by coincidence happened to be good. Generally speaking this makes SAC very stable
  • SAC also has an actor network, but two critic networks that instead of a value function in PPO learn a Q-Function which essentially is a value function that is also able to take the action as input and not just the state
  • SAC trains based on entropy regularization, i.e. it tries to maximize both the reward and the randomness of the actions in order to ensure that the policy continues exploring
  • SAC can be used with a Hindsight Experience Replay (HER) Buffer. It requires certain compatibility on the environment side but essentially makes it so that SAC can function with extremely sparse rewards by occasionally relabeling the value of transitions based on the progress towards a certain goal. (Think soccer where scoring a goal is +1 but being scored on is -1 for the goal difference. During play all rewards are 0. Even if an episode does not end in a goal HER can take the last observation (player position) the achieved goal (location of the ball) and the desired goal (location of the goal) and pretend that getting the ball to lets say the corner of the goal box was what we wanted all along and still assign rewards accordingly)

Application to RL in Walking

  • If the entire goal is to train specific behaviors for walking purposes (such as a standing up motion) based entirely on the simulated NAO such that there is a clear goal with dense rewards then PPO is likely the better candidate to use
  • If the goal is to combine every possible set of actions, walking, standing up, dribbling, passing etc into this, SAC is likely a better choice. It is much more able to deal with changing goals, and the fact that it can be trained off-policy i.e. from previous data means our current walking engine could provide starting data to train off of.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant