Nisha, 16, from Hyderabad had beaten Lesson 3. Her CartPole Q-agent could balance for 500 steps. She was proud โ until her classmate showed her LunarLander-v2, where a rocket must fire its thrusters to land softly on a pad. The state space was continuous, the actions were continuous, and her Q-table approach was completely useless.
"Q-Learning is a value method," her teacher explained. "It learns what states are worth. Policy Gradient methods are different โ they directly optimise the probability of taking good actions. That's what PPO does."
Three hours after installing stable-baselines3, Nisha's PPO agent was landing the rocket cleanly. She had moved from hand-crafted Q-tables to production-grade RL in an afternoon.
Lesson 3 covered value-based RL: learn Q(s,a), then pick the action with highest Q. This works for discrete action spaces but fails for continuous ones โ you can't take argmax over infinitely many actions.
Value-Based (Q-Learning, DQN)
- Learns Q(s,a) โ how good is each action
- Policy is implicit: ฯ(s) = argmax Q(s,a)
- Works for discrete actions only
- Off-policy: can learn from old experience
- Examples: DQN, Rainbow, C51
Policy-Based (REINFORCE, PPO)
- Learns ฯ(a|s) โ probability of each action
- Policy is explicit: sample from ฯ
- Works for continuous AND discrete actions
- On-policy: must use fresh experience
- Examples: REINFORCE, A2C, PPO, SAC
REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. Collect a full episode, compute the return (sum of discounted rewards) at each step, then update the policy to make good actions more probable and bad actions less probable.
REINFORCE Policy Gradient Theorem
โJ(ฮธ) = E[ ฮฃ_t โlog ฯ(a_t|s_t; ฮธ) * G_t ] G_t = ฮฃ_{k=t}^{T} ฮณ^(k-t) * r_k (return from step t) โlog ฯ(...) = gradient of log probability of action taken If G_t > 0: increase probability of action a_t in state s_t If G_t < 0: decrease probability of action a_t in state s_timport torch, torch.nn as nn, torch.optim as optim
import gymnasium as gym, numpy as np
from torch.distributions import Categorical
class PolicyNet(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, action_dim), nn.Softmax(dim=-1)
)
def forward(self, x):
return self.net(x)
env = gym.make("CartPole-v1")
policy = PolicyNet(4, 2)
optim = torch.optim.Adam(policy.parameters(), lr=1e-3)
GAMMA = 0.99
for episode in range(2000):
states, actions, rewards = [], [], []
state, _ = env.reset()
# โโ Collect one episode โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
done = False
while not done:
s_tensor = torch.FloatTensor(state)
probs = policy(s_tensor)
dist = Categorical(probs)
action = dist.sample().item()
next_state, reward, terminated, truncated, _ = env.step(action)
states.append(s_tensor)
actions.append(action)
rewards.append(reward)
done = terminated or truncated
state = next_state
# โโ Compute discounted returns G_t โโโโโโโโโโโโโโโโโโโโโโโ
G, returns = 0, []
for r in reversed(rewards):
G = r + GAMMA * G
returns.insert(0, G)
returns = torch.tensor(returns, dtype=torch.float32)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # normalise
# โโ Compute policy gradient loss โโโโโโโโโโโโโโโโโโโโโโโโโ
loss = 0
for s, a, g in zip(states, actions, returns):
probs = policy(s)
log_prob = Categorical(probs).log_prob(torch.tensor(a))
loss += -log_prob * g # negative because we maximise
optim.zero_grad()
loss.backward()
optim.step()
PPO (Proximal Policy Optimisation, Schulman et al. 2017) is the most widely used deep RL algorithm. It trains on multiple mini-batches of recent experience while preventing updates that change the policy too drastically โ the "proximal" (near/close) constraint.
PPO Clipped Objective
r_t(ฮธ) = ฯ_ฮธ(a_t|s_t) / ฯ_ฮธ_old(a_t|s_t) (probability ratio) L_CLIP(ฮธ) = E[ min(r_t*A_t, clip(r_t, 1-ฮต, 1+ฮต)*A_t) ] A_t = advantage = Q(s,a) - V(s) (how much better than average?) ฮต (epsilon) = clipping parameter, typically 0.2 clip(r_t, 0.8, 1.2): ratio cannot deviate more than 20% from 1.0 This prevents catastrophic policy updates that destroy good behaviourThe clipping is the key insight: if a gradient update would change the policy too much (ratio outside [0.8, 1.2]), clip it. This makes PPO conservative โ it improves reliably without the instability of older policy gradient methods.
# โโ PPO with Stable Baselines 3 (industry library) โโโโโโโโโโโโโโ
# pip install stable-baselines3[extra] gymnasium
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym
# โโ CartPole โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
env = make_vec_env("CartPole-v1", n_envs=4) # 4 parallel environments
model = PPO(
"MlpPolicy", # multi-layer perceptron policy
env,
verbose=1,
learning_rate=3e-4,
n_steps=2048, # steps to collect per rollout per env
batch_size=64,
n_epochs=10, # PPO update epochs per rollout
gamma=0.99,
clip_range=0.2, # ฮต โ clipping parameter
ent_coef=0.0, # entropy bonus (encourage exploration)
)
model.learn(total_timesteps=200_000)
model.save("ppo_cartpole")
# โโ LunarLander (requires continuous action support) โโโโโโโโโโโโโ
from stable_baselines3 import PPO as PPO_C
lunar_env = make_vec_env("LunarLander-v2", n_envs=8)
lunar_model = PPO_C("MlpPolicy", lunar_env, verbose=1,
learning_rate=3e-4, n_steps=1024,
batch_size=64, n_epochs=4, gamma=0.999)
lunar_model.learn(total_timesteps=1_000_000)
# โโ Evaluate โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(lunar_model, lunar_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
# Target: mean_reward > 200 for LunarLander-v2 = "solved"
| Algorithm | Action Space | Pros | Cons | Use Case |
|---|---|---|---|---|
| Q-Learning | Discrete, small | Simple, provably convergent, no neural network needed | Fails on large/continuous spaces | Tabular problems, teaching |
| DQN | Discrete | Scales to complex visual inputs (Atari) | No continuous actions, experience replay needed | Atari games, grid worlds |
| REINFORCE | Both | Conceptually clean, directly optimises policy | High variance, slow convergence | Teaching policy gradients |
| A2C | Both | Faster than REINFORCE (baseline reduces variance) | Less stable than PPO, synchronous updates | When PPO is overkill |
| PPO | Both | Stable, reliable, reuses experience, parallel envs | On-policy (must discard old experience) | Almost everything โ default choice |
| SAC | Continuous | Off-policy, sample-efficient, entropy maximisation | Harder to tune than PPO | Robotics, continuous control |