This is a small and simple collection of some reinforcement learning algorithms. The core idea of this repo is to have minimal structure, such that each algorithm is easy to understand and to modify. For this reason, each algorithm has a separate folder, independent from the others. Only approximators (neural network, linear functions, ...), policy classes, and auxiliary functions (for plotting or collecting data with gym-like environments) are shared.
Note that an algorithm can have different versions. For example, SPG can learn the critic by using Monte-Carlo estimates or by temporal difference.
The repository has a modular structure and no installation is needed. To run an algorithm, from the root folder execute
python3 -m <ALG>.<RUN_SCRIPT> <ENV_NAME> <SEED>
(seed is optional, default is 1). At each iteration, data about the most important statistics (average return, value function loss, entropy, ...) is saved in
data-trial/<ALG_NAME>/<ENV_NAME>/<DATE_TIME>.dat
.
For example, running
python3 -m ddpg.ddpg Pendulum-v0 0
will generate
data-trial/ddpg/Pendulum-v0/180921_155842.dat
.
You can also save/load the learned model and visualize the graph. For more info, check demo.py
. The demo also shows how to use the LQR environment and how to plot value functions.
Finally, use any of the run
scripts in the root folder to run several trials of the same algorithm in parallel (see the scripts for instructions).
With data generated from the runs, you can plot the average results with 95% confidence interval using plot_shaded.py
, or you can plot all learning curves together with plot_all.py
(see the scripts for instructions).
Note that all scripts use flexible memory, i.e.,
config_tf = tf.ConfigProto() config_tf.gpu_options.allow_growth=True session = tf.Session(config=config_tf)
python 3.5
tensorflow 1.12.0
tensorflow probability 0.5
gym 0.12.5
numpy 1.16
scipy 1.2
matplotlib
seaborn
Later versions of tensorflow may raise warnings.
You can also use other physics simulators, such as Roboschool, PyBullet and MuJoCo.
approximators.py
: neural network, random Fourier features, polynomial featuresaverage_env.py
: introduces state resets to consider average return MDPscross_validation.py
: function to minimize a loss function with cross-validationdata_collection.py
: functions for sampling MDP transitions and getting mini-batchesfilter_env.py
: modifies a gym environment to have states and actions normalized in [-1,1]logger.py
: creates folders for saving datanoise.py
: noise functionsplotting.py
: to plot value functionspolicy.py
: implementation of common policiesrl_utils.py
: RL functions, such as generalized advantage estimation and retrace
solver.py
: (optional) defines optimization routines required by the algorithmhyperparameters.py
: defines the hyperparameters (e.g., number of transitions per iteration, network sizes and learning rates)<NAME>.py
: script to run the algorithm (e.g.,ppo.py
orddpg.py
)
- Stochastic policy gradient (SPG). The folder includes REINFORCE and two actor-critic versions.
- Deep deterministic policy gradient (DDPG).
- Twin delayed DDPG (TD3).
- Trust region policy optimization (TRPO).
- Proximal policy optimization (PPO).
- Asynchronous advantage actor-critic (A3C).
- Soft Actor-Critic (SAC), first and second version.
- Relative entropy policy search (REPS).
- Actor-critic REPS (AC-REPS).
- TD-regularized actor-critic methods (TD-REG and GAE-REG) is implemented for PPO, TRPO, and DDPG.
- Curiosity-driven exploration by self-supervised prediction (ICM) is implemented for PPO.
- Prioritized experience replay (PER) is implemented for DDPG.
- Projections for approximate policy iteration algorithms (HPROJ) is implemented for PPO.
All implementations are very basic, there is no reward/gradient clipping, hyperparameters tuning, decaying KL/entropy coefficient, batch normalization, standardization with running mean and std, ...