Quick Start¶
This guide shows the current recommended training entrypoints in ApexRL.
Installation¶
Install from source:
git clone https://github.com/Atticlmr/Apex_rl.git
cd Apex_rl
pip install -e .
or with uv:
git clone https://github.com/Atticlmr/Apex_rl.git
cd Apex_rl
uv pip install -e .
Core requirements:
Python >= 3.10
PyTorch >= 2.0
Gymnasium >= 0.29
TensorDict >= 0.6
Optional logging extras:
pip install -e ".[wandb]"
pip install -e ".[swanlab]"
Training Entry Points¶
OnPolicyRunneris the canonical entrypoint for PPOOffPolicyRunneris the canonical entrypoint for DQN and SACPPO.learn(),DQN.learn(), andSAC.learn()remain available as thin wrappers
First PPO Agent¶
Discrete control:
import gymnasium as gym
import torch
from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.algorithms.ppo import PPOConfig
from apexrl.envs.gym_wrapper import GymVecEnv
from apexrl.models import MLPDiscreteActor, MLPCritic
def make_env():
return gym.make("CartPole-v1")
env = GymVecEnv([make_env for _ in range(8)], device="cpu")
runner = OnPolicyRunner(
env=env,
cfg=PPOConfig(device="cpu", learning_rate_schedule="constant"),
actor_class=MLPDiscreteActor,
critic_class=MLPCritic,
log_dir="./logs/cartpole_ppo",
device=torch.device("cpu"),
)
runner.learn(total_timesteps=100_000)
runner.close()
Continuous control:
import gymnasium as gym
import torch
from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.algorithms.ppo import PPOConfig
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
from apexrl.models import MLPActor, MLPCritic
def make_env():
return gym.make("Pendulum-v1")
env = GymVecEnvContinuous([make_env for _ in range(8)], device="cpu")
runner = OnPolicyRunner(
env=env,
cfg=PPOConfig(device="cpu"),
actor_class=MLPActor,
critic_class=MLPCritic,
log_dir="./logs/pendulum_ppo",
device=torch.device("cpu"),
)
runner.learn(total_timesteps=100_000)
runner.close()
Structured Observations¶
The current repository version supports structured observations all the way through environment wrappers, buffers, algorithms, and default MLP models.
Recommended environment output format:
{
"obs": {
"image": image,
"vector": vector,
},
"privileged_obs": {
"state": state,
"context": context,
},
}
In this format:
the actor receives
obsPPO with
use_asymmetric=Truesendsprivileged_obsto the criticSAC stores actor and critic branches separately in replay
Logging Backends¶
Runner and algorithm configs support tensorboard, wandb, and
swanlab through the shared logger_backend and logger_kwargs fields.
Single backend example:
cfg = PPOConfig(
logger_backend="wandb",
logger_kwargs={
"project": "apexrl",
"entity": "your_team",
"tags": ["ppo", "cartpole"],
},
)
tensorboard is available in the default install. wandb and swanlab
require the optional extras from the installation guide.
Next Steps¶
Read Train PPO for the standard PPO flow
Read Train DQN for the standard DQN flow
Read Train SAC for the standard SAC flow
Read Custom Network Architectures for multimodal custom actors and critics
Read Custom Environment Integration for TensorDict-based environment integration