Algorithms

ApexRL provides implementations of state-of-the-art reinforcement learning algorithms.

Available Algorithms

Algorithm

Type

Status

Description

PPO

On-policy

✅ Available

Proximal Policy Optimization

RecurrentPPO

On-policy

✅ Available

PPO with sequence minibatches and recurrent actor/critic state

DQN

Off-policy

✅ Available

Deep Q-Network

SAC

Off-policy

✅ Available

Soft Actor-Critic

TD3

Off-policy

✅ Available

Twin Delayed DDPG for continuous control

MAPPO

Multi-agent on-policy

✅ Available

Multi-Agent PPO with centralized critic support

IPPO

Multi-agent on-policy

✅ Available

Independent PPO with decentralized critics

HAPPO

Multi-agent on-policy

✅ Available

Heterogeneous-Agent PPO with sequential policy updates

PPO (Proximal Policy Optimization)

PPO is an on-policy algorithm known for its stability and ease of use.

Key Features

  • Clipped surrogate objective for stable updates

  • Generalized Advantage Estimation (GAE)

  • Support for both continuous and discrete actions

  • Correct timeout bootstrapping with terminated / truncated semantics

  • Asymmetric actor-critic (privileged information for critic)

  • Separate or joint policy/value optimizers

Basic Usage

from apexrl.algorithms.ppo import PPO, PPOConfig
from apexrl.envs.vecenv import DummyVecEnv
from apexrl.models.mlp import MLPActor, MLPCritic

# Create environment
env = DummyVecEnv(num_envs=4096, num_obs=48, num_actions=12)

# Configure PPO
cfg = PPOConfig(
    num_steps=24,
    num_epochs=5,
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
)

# Create agent
agent = PPO(
    env=env,
    cfg=cfg,
    actor_class=MLPActor,
    critic_class=MLPCritic,
)

# Train
# PPO.learn() is a thin convenience wrapper around OnPolicyRunner.
agent.learn(total_timesteps=10_000_000)

For new projects, prefer OnPolicyRunner as the primary training entrypoint and treat PPO as the algorithm implementation plugged into that runner.

Recurrent PPO

RecurrentPPO keeps actor and critic hidden state during rollout collection and trains on contiguous sequence minibatches instead of shuffled single-step transitions. It accepts custom recurrent actor_class and critic_class arguments, matching the normal PPO construction pattern.

Multi-Agent PPO Algorithms

MAPPO, IPPO and HAPPO share the same multi-agent runner and rollout storage. MAPPO uses centralized training with decentralized execution: each actor consumes local agent observations, while critics can consume a centralized environment state. IPPO keeps the same per-agent actor interface but uses local observations for each critic by setting centralized_critic=False. HAPPO uses separate actors and sequential policy updates with correction factors from agents updated earlier in the current update order.

from apexrl.models import MLPActor, MLPCritic
from apexrl.multiagent import HAPPO, HAPPOConfig, IPPO, IPPOConfig, MAPPO, MAPPOConfig

mappo_cfg = MAPPOConfig(centralized_critic=True, share_actor=True)
mappo_agent = MAPPO(
    env=multiagent_env,
    cfg=mappo_cfg,
    actor_class=MLPActor,
    critic_class=MLPCritic,
)

ippo_cfg = IPPOConfig(share_actor=True)
ippo_agent = IPPO(
    env=multiagent_env,
    cfg=ippo_cfg,
    actor_class=MLPActor,
    critic_class=MLPCritic,
)

happo_cfg = HAPPOConfig(centralized_critic=True, share_actor=False)
happo_agent = HAPPO(
    env=multiagent_env,
    cfg=happo_cfg,
    actor_class=MLPActor,
    critic_class=MLPCritic,
)

Paper References

Algorithm

Reference

Link

PPO

Proximal Policy Optimization Algorithms

https://arxiv.org/abs/1707.06347

DQN

Playing Atari with Deep Reinforcement Learning

https://arxiv.org/abs/1312.5602

SAC

Soft Actor-Critic Algorithms and Applications

https://arxiv.org/abs/1812.05905

FlashSAC

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

https://arxiv.org/abs/2604.04539

TD3

Addressing Function Approximation Error in Actor-Critic Methods

https://arxiv.org/abs/1802.09477

MAPPO

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

https://arxiv.org/abs/2103.01955

IPPO

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

https://arxiv.org/abs/2011.09533

HAPPO

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

https://arxiv.org/abs/2109.11251

Configuration

API Reference

Algorithm Details

PPO-Clip Objective

The PPO-Clip objective function:

\[L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]\]

where:

  • \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio

  • \(\hat{A}_t\) is the estimated advantage

  • \(\epsilon\) is the clip range (typically 0.2)

Total Loss Function

\[L^{TOTAL}(\theta) = L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 S[\pi_\theta](s_t)\]

where:

  • \(L^{VF}\) is the value function loss (MSE)

  • \(S\) is the entropy bonus

  • \(c_1\), \(c_2\) are coefficients

Hyperparameter Tuning

General Guidelines

PPO Hyperparameters

Parameter

Typical Range

Description

num_steps

2048-8192

Steps per environment per update

num_epochs

3-10

Optimization epochs per batch

learning_rate

1e-5 to 1e-3

Step size for optimization

gamma

0.99-0.999

Discount factor

gae_lambda

0.9-0.99

GAE lambda parameter

clip_range

0.1-0.3

Clipping parameter

ent_coef

0.0-0.01

Entropy coefficient

vf_coef

0.25-1.0

Value function loss coefficient

Environment-Specific Recommendations

Isaac Gym (Legged Robots):

cfg = PPOConfig(
    num_steps=24,
    num_epochs=5,
    learning_rate=1e-3,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.0,
    batch_size=98304,
    minibatch_size=32768,
)

Gymnasium (Atari):

cfg = PPOConfig(
    num_steps=128,
    num_epochs=4,
    learning_rate=2.5e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.1,
    ent_coef=0.01,
)

Gymnasium (Mujoco):

cfg = PPOConfig(
    num_steps=2048,
    num_epochs=10,
    learning_rate=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.0,
)

For continuous control tasks, the default PPO configuration uses use_tanh_squash=False and clamps learned log standard deviations with min_log_std / max_log_std to keep the policy numerically stable.

Advanced Features

Learning Rate Scheduling

ApexRL supports multiple learning rate schedules:

cfg = PPOConfig(
    learning_rate_schedule="adaptive",  # or "linear", "constant"
    max_learning_rate=1e-3,
    min_learning_rate=1e-5,
)
  • constant: Fixed learning rate

  • linear: Linear decay from initial to 0

  • adaptive: Custom decay schedule

Value Function Clipping

Enable value function clipping for more stable training:

cfg = PPOConfig(
    clip_range_vf=0.2,  # None to disable
)

Early Stopping

Stop updates when KL divergence exceeds threshold:

cfg = PPOConfig(
    target_kl=0.015,  # None to disable
)

Separate Optimizers

Use different learning rates for policy and value:

cfg = PPOConfig(
    use_policy_optimizer=True,
    policy_learning_rate=1e-4,
    value_learning_rate=3e-4,
)

See Also

DQN (Deep Q-Network)

DQN is available for discrete-action environments through ReplayBuffer, OffPolicyRunner, and MLP-based Q networks. The current implementation supports standard DQN, Double DQN, and Dueling DQN.

Key Features

  • Experience replay with device-resident sampling

  • Target network updates with hard or soft synchronization

  • double_dqn target computation

  • dueling Q-network architecture

  • Epsilon-greedy exploration

Basic Usage

import torch
from gymnasium import make

from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.dqn import DQNConfig
from apexrl.envs.gym_wrapper import GymVecEnv
from apexrl.models import MLPQNetwork

env = GymVecEnv([lambda: make("CartPole-v1") for _ in range(4)], device="cpu")

cfg = DQNConfig(
    double_dqn=True,
    dueling=True,
    learning_starts=1_000,
    batch_size=128,
)

runner = OffPolicyRunner(
    env=env,
    cfg=cfg,
    q_network_class=MLPQNetwork,
    device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)

DQN.learn() is also available as a convenience wrapper, but OffPolicyRunner is the canonical training entrypoint for off-policy methods.

Configuration

API Reference

Implementation Notes

  • Set double_dqn=True to reduce overestimation bias.

  • Set dueling=True to split value and advantage estimation in the Q network.

  • MLPQNetwork supports both standard and dueling layouts through config only.

Smoke Benchmarks

The benchmark script includes lightweight DQN and Dueling DQN smoke tasks:

python benchmarks/run_smoke_benchmarks.py --iterations 1 --num-envs 1

Included off-policy smoke tasks:

  • CartPole-v1 (DQN)

  • CartPole-v1 (Dueling DQN)

  • Acrobot-v1 (DQN)

  • Acrobot-v1 (Dueling DQN)

  • Pendulum-v1 (SAC)

  • MountainCarContinuous-v0 (SAC)

SAC (Soft Actor-Critic)

SAC is available for continuous-control environments through ReplayBuffer, OffPolicyRunner, a squashed Gaussian actor, and twin Q(s, a) critics.

Key Features

  • Off-policy continuous control with replay reuse

  • Squashed Gaussian actor with action-bound rescaling

  • Twin critics and target critics

  • Automatic entropy-temperature tuning

  • Shared OffPolicyRunner training entrypoint

Basic Usage

import torch
from gymnasium import make

from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.sac import SACConfig
from apexrl.envs.gym_wrapper import GymVecEnvContinuous

env = GymVecEnvContinuous(
    [lambda: make("Pendulum-v1") for _ in range(2)],
    device="cpu",
)

cfg = SACConfig(
    batch_size=256,
    buffer_size=100_000,
    learning_starts=5_000,
    actor_learning_rate=3e-4,
    critic_learning_rate=3e-4,
    alpha_learning_rate=3e-4,
    tau=0.005,
)

runner = OffPolicyRunner(
    env=env,
    cfg=cfg,
    algorithm="sac",
    device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)

SAC.learn() is also available as a convenience wrapper, but OffPolicyRunner remains the canonical training entrypoint.

Configuration

API Reference

Algorithm Details

SAC critic target:

\[y = r + \gamma (1-d)\left(\min(Q_1'(s', a'), Q_2'(s', a')) - \alpha \log \pi(a'|s')\right)\]

Twin critic losses:

\[L_{Q_i} = \mathbb{E}\left[(Q_i(s, a) - y)^2\right]\]

Actor loss:

\[L_{\pi} = \mathbb{E}\left[\alpha \log \pi(a|s) - \min(Q_1(s, a), Q_2(s, a))\right]\]

Temperature loss:

\[L_{\alpha} = -\mathbb{E}\left[\log \alpha \cdot (\log \pi(a|s) + \mathcal{H}_{target})\right]\]

Implementation Notes

  • The default actor is MLPSquashedGaussianActor.

  • The default critics are twin MLPContinuousQNetwork instances.

  • ReplayBuffer stores continuous vector actions by setting action_shape=env.action_space.shape.

  • Bootstrap masking follows Gymnasium semantics: true terminals stop bootstrapping; truncation should preserve the final observation for value estimation.

Smoke Benchmarks

The benchmark script includes lightweight SAC smoke tasks:

python benchmarks/run_smoke_benchmarks.py --iterations 1 --num-envs 1

Included SAC smoke tasks:

  • Pendulum-v1 (SAC)

  • MountainCarContinuous-v0 (SAC)

FlashSAC

FlashSAC is a SAC-style algorithm for high-throughput continuous control. It uses the same off-policy runner interface as SAC, but defaults to larger batches and networks and adds optional critic feature and weight norm controls.

import torch

from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.flash_sac import FlashSACConfig

cfg = FlashSACConfig(
    batch_size=2048,
    buffer_size=2_000_000,
    learning_starts=32_768,
    gradient_steps=1,
    critic_feature_norm_coef=1e-4,
)

runner = OffPolicyRunner(
    env=env,
    cfg=cfg,
    algorithm="flash_sac",
    device=torch.device("cuda"),
)

Configuration

API Reference

TD3 (Twin Delayed DDPG)

TD3 is available for continuous-control environments through ReplayBuffer, OffPolicyRunner, a deterministic actor, and twin Q(s, a) critics. It improves DDPG with clipped double-Q targets, delayed policy updates, and target policy smoothing.

Key Features

  • Deterministic bounded actor

  • Twin critics with conservative min(Q1, Q2) targets

  • Delayed actor and target-network updates

  • Clipped target policy smoothing noise

  • Shared OffPolicyRunner training entrypoint

Basic Usage

import torch
from gymnasium import make

from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.td3 import TD3Config
from apexrl.envs.gym_wrapper import GymVecEnvContinuous

env = GymVecEnvContinuous(
    [lambda: make("Pendulum-v1") for _ in range(2)],
    device="cpu",
)

cfg = TD3Config(
    batch_size=256,
    buffer_size=100_000,
    learning_starts=5_000,
    policy_delay=2,
)

runner = OffPolicyRunner(
    env=env,
    cfg=cfg,
    algorithm="td3",
    device=torch.device("cpu"),
)
runner.learn(total_timesteps=200_000)

Configuration

API Reference

Algorithm Details

TD3 critic target:

\[y = r + \gamma (1-d)\min(Q_1'(s', a'), Q_2'(s', a'))\]

Target action with policy smoothing:

\[a' = \mathrm{clip}(\mu'(s') + \epsilon, a_{low}, a_{high})\]

Actor loss:

\[L_{\mu} = -\mathbb{E}[Q_1(s, \mu(s))]\]