Train SAC¶
This tutorial shows the current SAC workflow in ApexRL.
Overview¶
Recommended stack:
GymVecEnvContinuousfor continuous Gymnasium tasksOffPolicyRunneras the canonical training entrypointMLPSquashedGaussianActoras the default actorMLPContinuousQNetworkas the default twin-critic baseline
Standard Example¶
import gymnasium as gym
import torch
from apexrl.agent.off_policy_runner import OffPolicyRunner
from apexrl.algorithms.sac import SACConfig
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
def make_env():
return gym.make("Pendulum-v1")
env = GymVecEnvContinuous([make_env for _ in range(2)], device="cpu")
cfg = SACConfig(
batch_size=256,
buffer_size=100_000,
learning_starts=5_000,
actor_learning_rate=3e-4,
critic_learning_rate=3e-4,
alpha_learning_rate=3e-4,
tau=0.005,
device="cpu",
)
runner = OffPolicyRunner(
env=env,
cfg=cfg,
algorithm="sac",
log_dir="./logs/sac_pendulum",
save_dir="./checkpoints/sac_pendulum",
device=torch.device("cpu"),
)
runner.learn(total_timesteps=100_000)
print(runner.eval(num_episodes=10))
runner.close()
Structured Observations¶
SAC still requires Box actions, but observations no longer need to be a single flat tensor.
The current implementation supports structured actor observations and optional critic-only privileged observations:
{
"obs": {
"image": image,
"vector": vector,
},
"privileged_obs": {
"state": state,
"context": context,
},
}
Internally, SAC now:
sends
obsto the actorsends
privileged_obsto both critics when presentstores actor and critic branches separately in replay
Notes¶
SAC supports
Boxaction spacesobservations can be plain tensors or structured
TensorDict/ nested dict treesthe default policy is a squashed Gaussian actor, unlike PPO’s unsquashed Gaussian
OffPolicyRunnerremains the preferred entrypoint
Next Steps¶
Read Custom Network Architectures for custom actor / critic implementations
Read Custom Environment Integration for structured observation environment design
Read Algorithms for SAC-specific details