Your First RL Agent¶
This tutorial walks you through creating and training your first reinforcement learning agent using ApexRL.
Overview¶
By the end of this tutorial, you will:
Understand the basic components of ApexRL
Create a vectorized environment
Configure and train a PPO agent
Evaluate and save your trained model
Prerequisites¶
Ensure you have ApexRL installed:
pip install -e .
Step 1: Import Libraries¶
import gymnasium as gym
import torch
from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
from apexrl.models.mlp import MLPActor, MLPCritic
Step 2: Create the Environment¶
ApexRL uses vectorized environments for parallel training. Let’s create 8 parallel instances of Pendulum-v1:
def make_env():
"""Factory function to create a single environment."""
return gym.make("Pendulum-v1")
# Create 8 parallel environments
num_envs = 8
env = GymVecEnvContinuous([make_env for _ in range(num_envs)], device="cpu")
print(f"Number of environments: {env.num_envs}")
print(f"Observation dimension: {env.num_obs}")
print(f"Action dimension: {env.num_actions}")
Step 3: Configure the Runner¶
The OnPolicyRunner handles the training loop, logging, and checkpointing.
It is also the canonical training entrypoint for PPO:
runner = OnPolicyRunner(
env=env,
algorithm="ppo", # Algorithm to use
actor_class=MLPActor, # Actor network class
critic_class=MLPCritic, # Critic network class
log_dir="./logs", # TensorBoard log directory
save_dir="./checkpoints", # Checkpoint save directory
log_interval=10, # Log every 10 iterations
save_interval=100, # Save every 100 iterations
)
Step 4: Train the Agent¶
Train the agent for a specified number of timesteps:
# Train for 100,000 timesteps
runner.learn(total_timesteps=100_000)
During training, you’ll see output like:
Training for 520 iterations (104,000 steps)
Iter 0/520 | Steps 0 | FPS 0 | Policy Loss -0.0012 | Value Loss 0.0234 | KL 0.0012
Iter 10/520 | Steps 1,920 | FPS 3421 | Policy Loss -0.0023 | Value Loss 0.0187 | KL 0.0008 | Reward -456.23
...
Step 5: Evaluate the Agent¶
Evaluate the trained agent:
eval_stats = runner.eval(num_episodes=10)
print(f"Mean reward: {eval_stats['eval/mean_reward']:.2f}")
print(f"Std reward: {eval_stats['eval/std_reward']:.2f}")
print(f"Min reward: {eval_stats['eval/min_reward']:.2f}")
print(f"Max reward: {eval_stats['eval/max_reward']:.2f}")
Step 6: Save and Load¶
Save the trained model:
runner.save_checkpoint("final_model.pt")
Load a saved model:
runner.load_checkpoint("final_model.pt")
Complete Code¶
Here’s the complete training script:
import gymnasium as gym
from apexrl.agent.on_policy_runner import OnPolicyRunner
from apexrl.envs.gym_wrapper import GymVecEnvContinuous
from apexrl.models.mlp import MLPActor, MLPCritic
def main():
# Create environment
def make_env():
return gym.make("Pendulum-v1")
env = GymVecEnvContinuous([make_env for _ in range(8)], device="cpu")
# Create runner
runner = OnPolicyRunner(
env=env,
algorithm="ppo",
actor_class=MLPActor,
critic_class=MLPCritic,
log_dir="./logs",
)
# Train
runner.learn(total_timesteps=100_000)
# Evaluate
stats = runner.eval(num_episodes=10)
print(f"Final mean reward: {stats['eval/mean_reward']:.2f}")
# Save
runner.save_checkpoint("pendulum_model.pt")
env.close()
if __name__ == "__main__":
main()
Visualizing Training¶
View training metrics with TensorBoard:
tensorboard --logdir=./logs
Open your browser at http://localhost:6006 to see:
Episode rewards
Policy and value losses
KL divergence
Gradient norms
Learning rate schedule
PPO.learn() remains available, but it delegates to the same
OnPolicyRunner implementation shown here.
Next Steps¶
Follow Train PPO for a dedicated PPO tutorial
Follow Train DQN for a dedicated DQN tutorial
Follow Train SAC for a dedicated SAC tutorial
Learn to create Custom Environment Integration
Explore Custom Network Architectures architectures
Read about advanced Algorithms features