In this tutorial we are interested in reproducible reinforcement learning research. The experiments in this repository aim to reproduce some deep reinforcement learning results from the paper Learning Value Functions in Deep Policy Gradients using Residual Variance. To do so we use a specific experimental protocole and open source libraries that we introduce next.
Reinforcement learning (RL) suffers a lot from a lack of reproducibility. Hidden implementation details, and small number of seeds or different machines, hinders advances in the field. In this paper, authors recommend an experimental protocole that ensures thorough comparison of agents, say PPO vs SAC. By following such a standardized protocole, RL research can move forward faster and safer: if you claim your agent is new state-of-the-art on some benchmarks, following such protocole could be a validity stamp in your paper!
We recommend to use the well-maintained, well-documented, and stable, agent implementations from stable-baselines3
. We implement AVEC-PPO from Learning Value Functions in Deep Policy Gradients using Residual Variance by overwriting the train()
method from the base PPO agent.
One needs to find the best instantiation of each agent to compare. This is done with hyperparameter optimization. For each set of hyperparameters, each agent is trained 3 times. For each agent, the set of hyperparameters giving the best score averaged on runs is kept for the actual comparison. The hyperparameters optimization is done with simple python nested loops Add link.
We give an overview of the protocole behind the training code
Briefly, rlberry
provides tool to evaluate agents. In particular it handles running multiple seeds of agent instantiations and saves training data. One can do so with the ExperimentManager
class by feeding it a seed, an agent class, the agent hyperpameters found in the hyperparameters optimization phase, number of training steps and number of runs (This paper paper recommends 15 runs but we will see later how to choose the number of runs adaptively to minimize it).
Seeding is a key component of reproducible research, beyond RL. By fixing seeds, one can ensure that the stochastic process in the empirical protocole will give the same results, e.g. same neural network weights, same environment starting states and so on. However each machine learning libraries have their own way to do seeding. For example, RL environments from gymnasium
use the seeding from numpy
and actor and critic neural nets from stable-baselines3
use the seeding from torch
. The seeding of both environements and agents is handled automatically by the ExperimentManager
from rlberry
!
Tested on Python 3.10
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd empirical_rl
python3 training.py
python3 plotting.py
python3 evaluating.py
python3 statistical_comparing.py
[INFO] 13:10: Test finished
[INFO] 13:10: Results are
Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff decisions
0 default_ppo vs avec_ppo -86.636 -118.6952 32.0592 equal
- Ant-v4
- Loop over hyperparams and expand boundaris (hyperparam optim as per Patterson 2023)
- Docstrings ?
- Fix bug data loading for plotting data.