Quick Start¶

This guide shows the current recommended training entrypoints in ApexRL.

Installation¶

Install from source:

git clone https://github.com/Atticlmr/Apex_rl.git
cd Apex_rl
pip install -e .

or with uv:

git clone https://github.com/Atticlmr/Apex_rl.git
cd Apex_rl
uv pip install -e .

Core requirements:

Python >= 3.10
PyTorch >= 2.0
Gymnasium >= 0.29
TensorDict >= 0.6

Optional logging extras:

pip install -e ".[wandb]"
pip install -e ".[swanlab]"

Training Entry Points¶

OnPolicyRunner is the canonical entrypoint for PPO
OffPolicyRunner is the canonical entrypoint for DQN and SAC
PPO.learn(), DQN.learn(), and SAC.learn() remain available as thin wrappers

First PPO Agent¶

Discrete control:

import gymnasium as gym
import torch

from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.algorithms.ppo import PPOConfig
from apexrl.envs.gym_wrapper import GymVecEnv
from apexrl.models import MLPDiscreteActor, MLPCritic


def make_env():
    return gym.make("CartPole-v1")


env = GymVecEnv([make_env for _ in range(8)], device="cpu")

runner = OnPolicyRunner(
    env=env,
    cfg=PPOConfig(device="cpu", learning_rate_schedule="constant"),
    actor_class=MLPDiscreteActor,
    critic_class=MLPCritic,
    log_dir="./logs/cartpole_ppo",
    device=torch.device("cpu"),
)

runner.learn(total_timesteps=100_000)
runner.close()

Continuous control:

import gymnasium as gym
import torch

from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.algorithms.ppo import PPOConfig
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
from apexrl.models import MLPActor, MLPCritic


def make_env():
    return gym.make("Pendulum-v1")


env = GymVecEnvContinuous([make_env for _ in range(8)], device="cpu")

runner = OnPolicyRunner(
    env=env,
    cfg=PPOConfig(device="cpu"),
    actor_class=MLPActor,
    critic_class=MLPCritic,
    log_dir="./logs/pendulum_ppo",
    device=torch.device("cpu"),
)

runner.learn(total_timesteps=100_000)
runner.close()

Structured Observations¶

The current repository version supports structured observations all the way through environment wrappers, buffers, algorithms, and default MLP models.

Recommended environment output format:

{
    "obs": {
        "image": image,
        "vector": vector,
    },
    "privileged_obs": {
        "state": state,
        "context": context,
    },
}

In this format:

the actor receives obs
PPO with use_asymmetric=True sends privileged_obs to the critic
SAC stores actor and critic branches separately in replay

Logging Backends¶

Runner and algorithm configs support tensorboard, wandb, and swanlab through the shared logger_backend and logger_kwargs fields.

Single backend example:

cfg = PPOConfig(
    logger_backend="wandb",
    logger_kwargs={
        "project": "apexrl",
        "entity": "your_team",
        "tags": ["ppo", "cartpole"],
    },
)

tensorboard is available in the default install. wandb and swanlab require the optional extras from the installation guide.

Next Steps¶

Read Train PPO for the standard PPO flow
Read Train DQN for the standard DQN flow
Read Train SAC for the standard SAC flow
Read Custom Network Architectures for multimodal custom actors and critics
Read Custom Environment Integration for TensorDict-based environment integration