Changelog¶

All notable changes to ApexRL will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.3.0] - 2026-05-24¶

Added¶

single-agent:

add support for Recurrent-PPO

multi-agent:

Full multi-agent reinforcement learning support with MAPPO, IPPO, and HAPPO algorithms, including dedicated configs and training loops.
MultiAgentRunner for unified multi-agent training orchestration with logging, checkpointing, and callback support on par with single-agent runners.
MultiAgentVecEnv base class and cooperative environment wrappers for batched multi-agent episode collection.
MultiAgentRolloutBuffer for structured storage of multi-agent observations, actions, rewards, and terminal flags.

TODO¶

Recurrent-Network support for multi-agent RL.
JAX support.

[0.2.2] - 2026-05-13¶

Changed¶

Improved runner logging so environment extras are recorded only from user-selected extra_log_keys instead of hard-coded extras names.
Refined the logging documentation to describe configurable extras logging more clearly.

[0.2.1] - 2026-04-22¶

Added¶

Official Muon optimizer support across PPO, DQN, and SAC using the bundled mixed Muon-plus-AuxAdam implementation.
Smoke-test coverage for PPO, DQN, and SAC training with optimizer="muon".

Changed¶

Optimizer construction now routes Muon through parameter grouping so matrix-like hidden weights use Muon while scalar, bias, and output-head parameters stay on the auxiliary Adam path.

Fixed¶

Structured observation tensors now preserve leaf dtypes end to end, so multimodal environments keep uint8 image leaves and other non-float modalities intact through wrappers, buffers, and algorithm input paths.
API reference now includes the missing DQN pages, removes duplicate SAC entries, and restores the missing OffPolicyRunner documentation entry.
Environment documentation drops the unverified Brax, Isaac Gym, and Isaac Lab examples until tested integration guides are added back.

[0.2.0] - 2026-04-22¶

Added¶

End-to-end TensorDict and nested-dict observation support across PPO, DQN, SAC, vectorized Gymnasium wrappers, replay buffers, and rollout buffers.
Multimodal observation support for default training stacks, including common combinations such as image plus vector inputs.
Privileged critic observation support for asymmetric actor-critic training, with separate actor and critic observation branches in PPO and SAC.
Structured-observation smoke and regression coverage for PPO, DQN, SAC, and buffer behavior.

Changed¶

Default MLP actor, critic, and Q-network implementations now flatten nested observation leaves recursively, so they can be used directly with structured inputs.
README and bilingual documentation now describe the structured observation format, PPO training flow, SAC critic branches, and multimodal custom-network authoring.

Fixed¶

Off-policy runners and Gymnasium wrappers now preserve structured final observations consistently when episodes terminate or truncate.

[0.0.3] - 2026-04-20¶

Fixed¶

Release workflow now skips PyPI publishing when PYPI_API_TOKEN is not configured.

Documentation¶

Updated English and Chinese docs to reflect the new PPO training flow, continuous-action defaults, and timeout semantics.

[0.0.2] - 2026-04-20¶

Changed¶

PPO.learn() now delegates to OnPolicyRunner so there is a single on-policy training loop for logging, checkpointing, and callbacks.
Continuous-action PPO defaults now use an unsquashed Gaussian policy with bounded log standard deviation.

Fixed¶

RolloutBuffer now stores multi-dimensional continuous actions correctly.
Gymnasium wrappers now expose terminated, truncated, and final_observation so PPO can bootstrap truncated episodes correctly.
Added smoke coverage for CartPole-v1, Pendulum-v1, and MountainCarContinuous-v0.

[0.0.1] - 2026-02-11¶

Initial release of ApexRL.

Added¶

Core Features¶

PPO (Proximal Policy Optimization) algorithm implementation
OnPolicyRunner for managing training loops
Vectorized environment interface (VecEnv)
Gymnasium environment wrappers (GymVecEnv, GymVecEnvContinuous)

Networks¶

Base classes: Actor, ContinuousActor, DiscreteActor, Critic
MLP implementations: MLPActor, MLPCritic, MLPDiscreteActor
CNN implementations: CNNActor, CNNCritic
Network construction utilities (build_mlp)

Buffers¶

RolloutBuffer for on-policy algorithms
ReplayBuffer for off-policy algorithms (planned)
DistillationBuffer for policy distillation (planned)

Optimizers¶

Support for Adam, AdamW optimizers
Experimental Muon optimizer support

Configuration¶

PPOConfig dataclass with comprehensive hyperparameters
Learning rate scheduling (constant, linear, adaptive)

Documentation¶

Sphinx documentation with Furo theme
API reference documentation
Tutorial guides
English and Chinese documentation

Planned¶

Algorithms¶

DQN (Deep Q-Network)
SAC (Soft Actor-Critic)
TD3 (Twin Delayed DDPG)

Features¶

Observation normalization
Reward normalization
Multi-GPU training support
Distributed training

[Unreleased]¶

Added¶

Changed¶

Deprecated¶

Removed¶

Fixed¶

Security¶