Reinforcement learning streamlined.
Easier and faster reinforcement learning with RLOps. Visit our website. View documentation.
Join the Discord Server for questions, help and collaboration.
AgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.
This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.
AgileRL offers 10x faster hyperparameter optimization than SOTA.
To see the full AgileRL documentation, including tutorials, visit our documentation site. To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the AgileRL Discord Server.
Install as a package with pip:
pip install agilerl
Or install in development mode:
git clone && cd AgileRL
pip install -e .
cd demos
Reinforcement learning algorithms and libraries are usually benchmarked once the optimal hyperparameters for training are known, but it often takes hundreds or thousands of experiments to discover these. This is unrealistic and does not reflect the true, total time taken for training. What if we could remove the need to conduct all these prior experiments?
In the charts below, a single AgileRL run, which automatically tunes hyperparameters, is benchmarked against Optuna's multiple training runs traditionally required for hyperparameter optimization, demonstrating the real time savings possible. Global steps is the sum of every step taken by any agent in the environment, including across an entire population.
AgileRL offers an order of magnitude speed up in hyperparameter optimization vs popular reinforcement learning training frameworks combined with Optuna. Remove the need for multiple training runs and save yourself hours.
AgileRL also supports multi-agent reinforcement learning using the Petting Zoo-style (parallel API). The charts below highlight the performance of our MADDPG and MATD3 algorithms with evolutionary hyper-parameter optimisation (HPO), benchmarked against epymarl's MADDPG algorithm with grid-search HPO for the simple speaker listener and simple spread environments.
We are in the process of creating tutorials on how to use AgileRL and train agents on a variety of tasks.
Currently, we have tutorials for single-agent tasks that will guide you through the process of training both on and off-policy agents to beat a variety of Gymnasium environments. Additionally, we have multi-agent tutorials that make use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and also for multi-agent tasks in MPE environments. The tutorial on using hierarchical curriculum learning shows how to teach agents Skills and combine them to achieve an end goal. There are also tutorials for contextual multi-arm bandits, which learn to make the correct decision in environments that only have one timestep.
The demo files in demos
also provide examples on how to train agents using AgileRL, and more information can be found in our documentation.
RL | Algorithm |
Multi-agent | Multi-Agent Deep Deterministic Policy Gradient (MADDPG) Multi-Agent Twin-Delayed Deep Deterministic Policy Gradient (MATD3) |
RL | Algorithm |
Bandits | Neural Contextual Bandits with UCB-based Exploration (NeuralUCB) Neural Contextual Bandits with Thompson Sampling (NeuralTS) |
Before starting training, there are some meta-hyperparameters and settings that must be set. These are defined in INIT_HP
, for general parameters, and MUTATION_PARAMS
, which define the evolutionary probabilities, and NET_CONFIG
, which defines the network architecture. For example:
'ENV_NAME': 'LunarLander-v2', # Gym environment name
'ALGO': 'DQN', # Algorithm
'DOUBLE': True, # Use double Q-learning
'CHANNELS_LAST': False, # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
'BATCH_SIZE': 256, # Batch size
'LR': 1e-3, # Learning rate
'MAX_STEPS': 1_000_000, # Max no. steps
'TARGET_SCORE': 200., # Early training stop at avg score of last 100 episodes
'GAMMA': 0.99, # Discount factor
'MEMORY_SIZE': 10000, # Max memory buffer size
'LEARN_STEP': 1, # Learning frequency
'TAU': 1e-3, # For soft update of target parameters
'TOURN_SIZE': 2, # Tournament size
'ELITISM': True, # Elitism in tournament selection
'POP_SIZE': 6, # Population size
'EVO_STEPS': 10_000, # Evolution frequency
'EVAL_STEPS': None, # Evaluation steps
'EVAL_LOOP': 1, # Evaluation episodes
'LEARNING_DELAY': 1000, # Steps before starting learning
'WANDB': True, # Log with Weights and Biases
# Relative probabilities
'NO_MUT': 0.4, # No mutation
'ARCH_MUT': 0.2, # Architecture mutation
'NEW_LAYER': 0.2, # New layer mutation
'PARAMS_MUT': 0.2, # Network parameters mutation
'ACT_MUT': 0, # Activation layer mutation
'RL_HP_MUT': 0.2, # Learning HP mutation
'RL_HP_SELECTION': ['lr', 'batch_size'], # Learning HPs to choose from
'MUT_SD': 0.1, # Mutation strength
'RAND_SEED': 1, # Random seed
'arch': 'mlp', # Network architecture
'hidden_size': [32, 32], # Actor hidden size
First, use utils.utils.create_population
to create a list of agents - our population that will evolve and mutate to the optimal hyperparameters.
from agilerl.utils.utils import make_vect_envs, create_population
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 16
env = make_vect_envs(env_name=INIT_HP['ENV_NAME'], num_envs=num_envs)
state_dim = env.single_observation_space.n # Discrete observation space
one_hot = True # Requires one-hot encoding
except Exception:
state_dim = env.single_observation_space.shape # Continuous observation space
one_hot = False # Does not require one-hot encoding
action_dim = env.single_action_space.n # Discrete action space
except Exception:
action_dim = env.single_action_space.shape[0] # Continuous action space
state_dim = (state_dim[2], state_dim[0], state_dim[1])
agent_pop = create_population(
algo=INIT_HP['ALGO'], # Algorithm
state_dim=state_dim, # State dimension
action_dim=action_dim, # Action dimension
one_hot=one_hot, # One-hot encoding
net_config=NET_CONFIG, # Network configuration
INIT_HP=INIT_HP, # Initial hyperparameters
population_size=INIT_HP['POP_SIZE'], # Population size
num_envs=num_envs, # Number of vectorized environments
Next, create the tournament, mutations and experience replay buffer objects that allow agents to share memory and efficiently perform evolutionary HPO.
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations
field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(
memory_size=INIT_HP['MEMORY_SIZE'], # Max replay buffer size
field_names=field_names, # Field names to store in memory
tournament = TournamentSelection(
tournament_size=INIT_HP['TOURN_SIZE'], # Tournament selection size
elitism=INIT_HP['ELITISM'], # Elitism in tournament selection
population_size=INIT_HP['POP_SIZE'], # Population size
eval_loop=INIT_HP['EVAL_LOOP'], # Evaluate using last N fitness scores
mutations = Mutations(
algo=INIT_HP['ALGO'], # Algorithm
no_mutation=MUTATION_PARAMS['NO_MUT'], # No mutation
architecture=MUTATION_PARAMS['ARCH_MUT'], # Architecture mutation
new_layer_prob=MUTATION_PARAMS['NEW_LAYER'], # New layer mutation
parameters=MUTATION_PARAMS['PARAMS_MUT'], # Network parameters mutation
activation=MUTATION_PARAMS['ACT_MUT'], # Activation layer mutation
rl_hp=MUTATION_PARAMS['RL_HP_MUT'], # Learning HP mutation
rl_hp_selection=MUTATION_PARAMS['RL_HP_SELECTION'], # Learning HPs to choose from
mutation_sd=MUTATION_PARAMS['MUT_SD'], # Mutation strength
arch=NET_CONFIG['arch'], # Network architecture
rand_seed=MUTATION_PARAMS['RAND_SEED'], # Random seed
The easiest training loop implementation is to use our train_off_policy()
function. It requires the agent
have methods get_action()
and learn().
from import train_off_policy
trained_pop, pop_fitnesses = train_off_policy(
env=env, # Gym-style environment
env_name=INIT_HP['ENV_NAME'], # Environment name
algo=INIT_HP['ALGO'], # Algorithm
pop=agent_pop, # Population of agents
memory=memory, # Replay buffer
swap_channels=INIT_HP['CHANNELS_LAST'], # Swap image channel from last to first
max_steps=INIT_HP["MAX_STEPS"], # Max number of training steps
evo_steps=INIT_HP['EVO_STEPS'], # Evolution frequency
eval_steps=INIT_HP["EVAL_STEPS"], # Number of steps in evaluation episode
eval_loop=INIT_HP["EVAL_LOOP"], # Number of evaluation episodes
learning_delay=INIT_HP['LEARNING_DELAY'], # Steps before starting learning
target=INIT_HP['TARGET_SCORE'], # Target score for early stopping
tournament=tournament, # Tournament selection object
mutation=mutations, # Mutations object
wb=INIT_HP['WANDB'], # Weights and Biases tracking
If you use AgileRL in your work, please cite the repository:
author = {Ustaran-Anderegg, Nicholas and Pratt, Michael},
license = {Apache-2.0},
title = {{AgileRL}},
url = {}