This README contains all the details required to run the code to reproduce
the results of the paper Goal-conditioned Batch Reinforcement Learning for Rotation Invariant Locomotion
.
- Python 3.5+
- Pytorch (>= 0.4.0 with CUDA version >= 9.0)
- Tensorflow (>= 1.14)
- NumPy
- OpenAI Gym
- MuJoCo
- Pybullet
The standard RL and goal-conditioned RL baselines use existing code
from OpenAI baselines and stable-baselines. The standard RL baseline for
the Ant is trained using the Proximal Policy Optimization (PPO) code from
baselines
. For the Humanoid and Minitaur, the standard RL baselines are
trained using the Soft Actor-Critic (SAC) code from stable-baselines
with
default hyperparameters.
- Ant - OpenAI Gym Mujoco
- Humanoid - Pybullet
- Minitaur - Pybullet
The scripts for the goal-conditioned batch RL methods discussed in the paper
are in the directory scripts/
. The modified environments to convert
the locomotion task to goal-directed locomotion are present in modified_envs/
.
Pretrained models of all methods (including standard RL and goal-conditioned
RL baselines) can be downloaded here.
The data used to train all baselines can be downloaded here.
In order to train/test any of the goal-conditioned batch RL methods (using either equivalence or the naive goal-conditioned policy), use the following command:
python main.py
-e env-name
--resume path-to-checkpoint-file (only if you want to evaluate or resume training)
-s random-seed
--no-gpu (to train using CPU)
# Training
-i path-to-data-file
--dir-name save-path (path to save checkpoints, optional)
--log-perf-file log (path to log file to record losses/metrics)
--n-epochs ne (number of training epochs)
--learning-rate lr
--batch-size b
--exp-name ex (to organize models and results)
--start-index st (optionally start training from index != 0)
--n-training-samples n-tr
-k embedding-dim (dimension of the embeddings produced by the encoder - used
only in this approach)
# Testing
--test-only (evaluate an existing model)
--visualize (render episodes)
--n-test-steps n (number of test episodes during model evaluation)
--min-distance min (minimum distance from agent that goal should lie)
--max-distance max (maximum distance from agent that goal should lie)
--y-range y (deviation of goal position from agent's initial direction)
--threshold th (Distance between agent and goal at which the agent is
considered successful at reaching the goal)
For instance, to train an Ant for 2M timesteps using the data file
ant-on-policy-samples
with learning rate 0.001 and batch size 512, do
the following:
cd scripts/naive-gcp/
python main.py -e Ant-v2 -i ant-on-policy-samples --learning-rate 0.001 --batch-size 512
If not specified, the other hyperparameters will assume default values.
The performance.py
script in the scripts/
directory measures the
performance of the goal-conditioned batch RL methods. This script computes the
performance metric: the closest distance to the goal that the agent is able to
achieve, for 1000 episodes spread uniformly over 10 random seeds. In order to
ensure fair comparison, the random seeds are maintained across all methods. The
details of goal generation are provided in the Appendix.
Both quantitative and qualitative results are reported in the paper. Furthermore, qualitative results of comparisons between this approach and the best standard RL baseline, in the form of videos, can be found here. Note that the Humanoid trained using this approach walks much faster towards the goals, and in a much better manner than the one trained using standard RL techniques, even though both agents ultimately reach the goals in the example provided.
The following gifs show qualitative examples of this approach in the Ant and Humanoid environments:
If you use this work, please consider citing the following paper:
@misc{mavalankar2020goalconditioned,
title={Goal-conditioned Batch Reinforcement Learning for Rotation Invariant Locomotion},
author={Aditi Mavalankar},
year={2020},
eprint={2004.08356},
archivePrefix={arXiv},
primaryClass={cs.LG}
}