Algorithms¶
ApexRL provides implementations of state-of-the-art reinforcement learning algorithms.
Available Algorithms¶
Algorithm |
Type |
Status |
Description |
|---|---|---|---|
PPO |
On-policy |
✅ Available |
Proximal Policy Optimization |
RecurrentPPO |
On-policy |
✅ Available |
PPO with sequence minibatches and recurrent actor/critic state |
DQN |
Off-policy |
✅ Available |
Deep Q-Network |
SAC |
Off-policy |
✅ Available |
Soft Actor-Critic |
TD3 |
Off-policy |
✅ Available |
Twin Delayed DDPG for continuous control |
MAPPO |
Multi-agent on-policy |
✅ Available |
Multi-Agent PPO with centralized critic support |
IPPO |
Multi-agent on-policy |
✅ Available |
Independent PPO with decentralized critics |
HAPPO |
Multi-agent on-policy |
✅ Available |
Heterogeneous-Agent PPO with sequential policy updates |
PPO (Proximal Policy Optimization)¶
PPO is an on-policy algorithm known for its stability and ease of use.
Key Features¶
Clipped surrogate objective for stable updates
Generalized Advantage Estimation (GAE)
Support for both continuous and discrete actions
Correct timeout bootstrapping with
terminated/truncatedsemanticsAsymmetric actor-critic (privileged information for critic)
Separate or joint policy/value optimizers
Basic Usage¶
from apexrl.algorithms.ppo import PPO, PPOConfig
from apexrl.envs.vecenv import DummyVecEnv
from apexrl.models.mlp import MLPActor, MLPCritic
# Create environment
env = DummyVecEnv(num_envs=4096, num_obs=48, num_actions=12)
# Configure PPO
cfg = PPOConfig(
num_steps=24,
num_epochs=5,
learning_rate=3e-4,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
)
# Create agent
agent = PPO(
env=env,
cfg=cfg,
actor_class=MLPActor,
critic_class=MLPCritic,
)
# Train
# PPO.learn() is a thin convenience wrapper around OnPolicyRunner.
agent.learn(total_timesteps=10_000_000)
For new projects, prefer OnPolicyRunner as the primary training entrypoint
and treat PPO as the algorithm implementation plugged into that runner.
Recurrent PPO¶
RecurrentPPO keeps actor and critic hidden state during rollout collection
and trains on contiguous sequence minibatches instead of shuffled single-step
transitions. It accepts custom recurrent actor_class and critic_class
arguments, matching the normal PPO construction pattern.
Multi-Agent PPO Algorithms¶
MAPPO, IPPO and HAPPO share the same multi-agent runner and rollout storage.
MAPPO uses centralized training with decentralized execution: each actor
consumes local agent observations, while critics can consume a centralized
environment state. IPPO keeps the same per-agent actor interface but uses local
observations for each critic by setting centralized_critic=False. HAPPO uses
separate actors and sequential policy updates with correction factors from
agents updated earlier in the current update order.
from apexrl.models import MLPActor, MLPCritic
from apexrl.multiagent import HAPPO, HAPPOConfig, IPPO, IPPOConfig, MAPPO, MAPPOConfig
mappo_cfg = MAPPOConfig(centralized_critic=True, share_actor=True)
mappo_agent = MAPPO(
env=multiagent_env,
cfg=mappo_cfg,
actor_class=MLPActor,
critic_class=MLPCritic,
)
ippo_cfg = IPPOConfig(share_actor=True)
ippo_agent = IPPO(
env=multiagent_env,
cfg=ippo_cfg,
actor_class=MLPActor,
critic_class=MLPCritic,
)
happo_cfg = HAPPOConfig(centralized_critic=True, share_actor=False)
happo_agent = HAPPO(
env=multiagent_env,
cfg=happo_cfg,
actor_class=MLPActor,
critic_class=MLPCritic,
)
Paper References¶
Algorithm |
Reference |
Link |
|---|---|---|
PPO |
Proximal Policy Optimization Algorithms |
|
DQN |
Playing Atari with Deep Reinforcement Learning |
|
SAC |
Soft Actor-Critic Algorithms and Applications |
|
FlashSAC |
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control |
|
TD3 |
Addressing Function Approximation Error in Actor-Critic Methods |
|
MAPPO |
The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games |
|
IPPO |
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? |
|
HAPPO |
Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning |
Configuration¶
API Reference¶
Algorithm Details¶
PPO-Clip Objective¶
The PPO-Clip objective function:
where:
\(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio
\(\hat{A}_t\) is the estimated advantage
\(\epsilon\) is the clip range (typically 0.2)
Total Loss Function¶
where:
\(L^{VF}\) is the value function loss (MSE)
\(S\) is the entropy bonus
\(c_1\), \(c_2\) are coefficients
Hyperparameter Tuning¶
General Guidelines¶
Parameter |
Typical Range |
Description |
|---|---|---|
|
2048-8192 |
Steps per environment per update |
|
3-10 |
Optimization epochs per batch |
|
1e-5 to 1e-3 |
Step size for optimization |
|
0.99-0.999 |
Discount factor |
|
0.9-0.99 |
GAE lambda parameter |
|
0.1-0.3 |
Clipping parameter |
|
0.0-0.01 |
Entropy coefficient |
|
0.25-1.0 |
Value function loss coefficient |
Environment-Specific Recommendations¶
Isaac Gym (Legged Robots):
cfg = PPOConfig(
num_steps=24,
num_epochs=5,
learning_rate=1e-3,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.0,
batch_size=98304,
minibatch_size=32768,
)
Gymnasium (Atari):
cfg = PPOConfig(
num_steps=128,
num_epochs=4,
learning_rate=2.5e-4,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.1,
ent_coef=0.01,
)
Gymnasium (Mujoco):
cfg = PPOConfig(
num_steps=2048,
num_epochs=10,
learning_rate=3e-4,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.0,
)
For continuous control tasks, the default PPO configuration uses
use_tanh_squash=False and clamps learned log standard deviations with
min_log_std / max_log_std to keep the policy numerically stable.
Advanced Features¶
Learning Rate Scheduling¶
ApexRL supports multiple learning rate schedules:
cfg = PPOConfig(
learning_rate_schedule="adaptive", # or "linear", "constant"
max_learning_rate=1e-3,
min_learning_rate=1e-5,
)
constant: Fixed learning rate
linear: Linear decay from initial to 0
adaptive: Custom decay schedule
Value Function Clipping¶
Enable value function clipping for more stable training:
cfg = PPOConfig(
clip_range_vf=0.2, # None to disable
)
Early Stopping¶
Stop updates when KL divergence exceeds threshold:
cfg = PPOConfig(
target_kl=0.015, # None to disable
)
Separate Optimizers¶
Use different learning rates for policy and value:
cfg = PPOConfig(
use_policy_optimizer=True,
policy_learning_rate=1e-4,
value_learning_rate=3e-4,
)
See Also¶
Your First RL Agent - Basic usage tutorial
Custom Network Architectures - Custom network architectures
apexrl.algorithms.ppo package - Full API reference
DQN (Deep Q-Network)¶
DQN is available for discrete-action environments through ReplayBuffer,
OffPolicyRunner, and MLP-based Q networks. The current implementation
supports standard DQN, Double DQN, and Dueling DQN.
Key Features¶
Experience replay with device-resident sampling
Target network updates with hard or soft synchronization
double_dqntarget computationduelingQ-network architectureEpsilon-greedy exploration
Basic Usage¶
import torch
from gymnasium import make
from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.dqn import DQNConfig
from apexrl.envs.gym_wrapper import GymVecEnv
from apexrl.models import MLPQNetwork
env = GymVecEnv([lambda: make("CartPole-v1") for _ in range(4)], device="cpu")
cfg = DQNConfig(
double_dqn=True,
dueling=True,
learning_starts=1_000,
batch_size=128,
)
runner = OffPolicyRunner(
env=env,
cfg=cfg,
q_network_class=MLPQNetwork,
device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)
DQN.learn() is also available as a convenience wrapper, but
OffPolicyRunner is the canonical training entrypoint for off-policy methods.
Configuration¶
API Reference¶
Implementation Notes¶
Set
double_dqn=Trueto reduce overestimation bias.Set
dueling=Trueto split value and advantage estimation in the Q network.MLPQNetworksupports both standard and dueling layouts through config only.
Smoke Benchmarks¶
The benchmark script includes lightweight DQN and Dueling DQN smoke tasks:
python benchmarks/run_smoke_benchmarks.py --iterations 1 --num-envs 1
Included off-policy smoke tasks:
CartPole-v1 (DQN)CartPole-v1 (Dueling DQN)Acrobot-v1 (DQN)Acrobot-v1 (Dueling DQN)Pendulum-v1 (SAC)MountainCarContinuous-v0 (SAC)
SAC (Soft Actor-Critic)¶
SAC is available for continuous-control environments through
ReplayBuffer, OffPolicyRunner, a squashed Gaussian actor, and
twin Q(s, a) critics.
Key Features¶
Off-policy continuous control with replay reuse
Squashed Gaussian actor with action-bound rescaling
Twin critics and target critics
Automatic entropy-temperature tuning
Shared
OffPolicyRunnertraining entrypoint
Basic Usage¶
import torch
from gymnasium import make
from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.sac import SACConfig
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
env = GymVecEnvContinuous(
[lambda: make("Pendulum-v1") for _ in range(2)],
device="cpu",
)
cfg = SACConfig(
batch_size=256,
buffer_size=100_000,
learning_starts=5_000,
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
alpha_learning_rate=3e-4,
tau=0.005,
)
runner = OffPolicyRunner(
env=env,
cfg=cfg,
algorithm="sac",
device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)
SAC.learn() is also available as a convenience wrapper, but
OffPolicyRunner remains the canonical training entrypoint.
Configuration¶
API Reference¶
Algorithm Details¶
SAC critic target:
Twin critic losses:
Actor loss:
Temperature loss:
Implementation Notes¶
The default actor is
MLPSquashedGaussianActor.The default critics are twin
MLPContinuousQNetworkinstances.ReplayBufferstores continuous vector actions by settingaction_shape=env.action_space.shape.Bootstrap masking follows Gymnasium semantics: true terminals stop bootstrapping; truncation should preserve the final observation for value estimation.
Smoke Benchmarks¶
The benchmark script includes lightweight SAC smoke tasks:
python benchmarks/run_smoke_benchmarks.py --iterations 1 --num-envs 1
Included SAC smoke tasks:
Pendulum-v1 (SAC)MountainCarContinuous-v0 (SAC)
FlashSAC¶
FlashSAC is a SAC-style algorithm for high-throughput continuous control. It uses the same off-policy runner interface as SAC, but defaults to larger batches and networks and adds optional critic feature and weight norm controls.
import torch
from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.flash_sac import FlashSACConfig
cfg = FlashSACConfig(
batch_size=2048,
buffer_size=2_000_000,
learning_starts=32_768,
gradient_steps=1,
critic_feature_norm_coef=1e-4,
)
runner = OffPolicyRunner(
env=env,
cfg=cfg,
algorithm="flash_sac",
device=torch.device("cuda"),
)
Configuration¶
API Reference¶
TD3 (Twin Delayed DDPG)¶
TD3 is available for continuous-control environments through
ReplayBuffer, OffPolicyRunner, a deterministic actor, and twin
Q(s, a) critics. It improves DDPG with clipped double-Q targets, delayed
policy updates, and target policy smoothing.
Key Features¶
Deterministic bounded actor
Twin critics with conservative
min(Q1, Q2)targetsDelayed actor and target-network updates
Clipped target policy smoothing noise
Shared
OffPolicyRunnertraining entrypoint
Basic Usage¶
import torch
from gymnasium import make
from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.td3 import TD3Config
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
env = GymVecEnvContinuous(
[lambda: make("Pendulum-v1") for _ in range(2)],
device="cpu",
)
cfg = TD3Config(
batch_size=256,
buffer_size=100_000,
learning_starts=5_000,
policy_delay=2,
)
runner = OffPolicyRunner(
env=env,
cfg=cfg,
algorithm="td3",
device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)
Configuration¶
API Reference¶
Algorithm Details¶
TD3 critic target:
Target action with policy smoothing:
Actor loss: