Changelog¶
All notable changes to ApexRL will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.3.0] - 2026-05-24¶
Added¶
single-agent:
add support for Recurrent-PPO
multi-agent:
Full multi-agent reinforcement learning support with
MAPPO,IPPO, andHAPPOalgorithms, including dedicated configs and training loops.MultiAgentRunnerfor unified multi-agent training orchestration with logging, checkpointing, and callback support on par with single-agent runners.MultiAgentVecEnvbase class and cooperative environment wrappers for batched multi-agent episode collection.MultiAgentRolloutBufferfor structured storage of multi-agent observations, actions, rewards, and terminal flags.
TODO¶
Recurrent-Network support for multi-agent RL.
JAX support.
[0.2.2] - 2026-05-13¶
Changed¶
Improved runner logging so environment extras are recorded only from user-selected
extra_log_keysinstead of hard-coded extras names.Refined the logging documentation to describe configurable extras logging more clearly.
[0.2.1] - 2026-04-22¶
Added¶
Official
Muonoptimizer support across PPO, DQN, and SAC using the bundled mixed Muon-plus-AuxAdam implementation.Smoke-test coverage for PPO, DQN, and SAC training with
optimizer="muon".
Changed¶
Optimizer construction now routes
Muonthrough parameter grouping so matrix-like hidden weights use Muon while scalar, bias, and output-head parameters stay on the auxiliary Adam path.
Fixed¶
Structured observation tensors now preserve leaf dtypes end to end, so multimodal environments keep
uint8image leaves and other non-float modalities intact through wrappers, buffers, and algorithm input paths.API reference now includes the missing DQN pages, removes duplicate SAC entries, and restores the missing OffPolicyRunner documentation entry.
Environment documentation drops the unverified Brax, Isaac Gym, and Isaac Lab examples until tested integration guides are added back.
[0.2.0] - 2026-04-22¶
Added¶
End-to-end
TensorDictand nested-dict observation support across PPO, DQN, SAC, vectorized Gymnasium wrappers, replay buffers, and rollout buffers.Multimodal observation support for default training stacks, including common combinations such as image plus vector inputs.
Privileged critic observation support for asymmetric actor-critic training, with separate actor and critic observation branches in PPO and SAC.
Structured-observation smoke and regression coverage for PPO, DQN, SAC, and buffer behavior.
Changed¶
Default MLP actor, critic, and Q-network implementations now flatten nested observation leaves recursively, so they can be used directly with structured inputs.
README and bilingual documentation now describe the structured observation format, PPO training flow, SAC critic branches, and multimodal custom-network authoring.
Fixed¶
Off-policy runners and Gymnasium wrappers now preserve structured final observations consistently when episodes terminate or truncate.
[0.0.3] - 2026-04-20¶
Fixed¶
Release workflow now skips PyPI publishing when
PYPI_API_TOKENis not configured.
Documentation¶
Updated English and Chinese docs to reflect the new PPO training flow, continuous-action defaults, and timeout semantics.
[0.0.2] - 2026-04-20¶
Changed¶
PPO.learn()now delegates toOnPolicyRunnerso there is a single on-policy training loop for logging, checkpointing, and callbacks.Continuous-action PPO defaults now use an unsquashed Gaussian policy with bounded log standard deviation.
Fixed¶
RolloutBuffernow stores multi-dimensional continuous actions correctly.Gymnasium wrappers now expose
terminated,truncated, andfinal_observationso PPO can bootstrap truncated episodes correctly.Added smoke coverage for
CartPole-v1,Pendulum-v1, andMountainCarContinuous-v0.
[0.0.1] - 2026-02-11¶
Initial release of ApexRL.
Added¶
Core Features¶
PPO (Proximal Policy Optimization) algorithm implementation
OnPolicyRunner for managing training loops
Vectorized environment interface (VecEnv)
Gymnasium environment wrappers (GymVecEnv, GymVecEnvContinuous)
Networks¶
Base classes: Actor, ContinuousActor, DiscreteActor, Critic
MLP implementations: MLPActor, MLPCritic, MLPDiscreteActor
CNN implementations: CNNActor, CNNCritic
Network construction utilities (build_mlp)
Buffers¶
RolloutBuffer for on-policy algorithms
ReplayBuffer for off-policy algorithms (planned)
DistillationBuffer for policy distillation (planned)
Optimizers¶
Support for Adam, AdamW optimizers
Experimental Muon optimizer support
Configuration¶
PPOConfig dataclass with comprehensive hyperparameters
Learning rate scheduling (constant, linear, adaptive)
Documentation¶
Sphinx documentation with Furo theme
API reference documentation
Tutorial guides
English and Chinese documentation
Planned¶
Algorithms¶
DQN (Deep Q-Network)
SAC (Soft Actor-Critic)
TD3 (Twin Delayed DDPG)
Features¶
Observation normalization
Reward normalization
Multi-GPU training support
Distributed training