Deep RL: Policy Gradients and PPO ๐Ÿš€

Class 11Age 15โ€“16Lesson 4 of 12๐Ÿ†“ Free
Nisha from Hyderabad training a PPO agent on LunarLander โ€” reward curves climbing from -200 to +200 over training steps
Watch first - 2-3 minutes

Class 11 Lesson 4 - Deep RL: Policy Gradients and PPO

No sign-in needed - English narration - Safe for all school ages

Story
Nisha Takes on LunarLander

Nisha, 16, from Hyderabad had beaten Lesson 3. Her CartPole Q-agent could balance for 500 steps. She was proud โ€” until her classmate showed her LunarLander-v2, where a rocket must fire its thrusters to land softly on a pad. The state space was continuous, the actions were continuous, and her Q-table approach was completely useless.

"Q-Learning is a value method," her teacher explained. "It learns what states are worth. Policy Gradient methods are different โ€” they directly optimise the probability of taking good actions. That's what PPO does."

Three hours after installing stable-baselines3, Nisha's PPO agent was landing the rocket cleanly. She had moved from hand-crafted Q-tables to production-grade RL in an afternoon.

Section 1
Value Methods vs Policy Methods

Lesson 3 covered value-based RL: learn Q(s,a), then pick the action with highest Q. This works for discrete action spaces but fails for continuous ones โ€” you can't take argmax over infinitely many actions.

Value-Based (Q-Learning, DQN)

  • Learns Q(s,a) โ€” how good is each action
  • Policy is implicit: ฯ€(s) = argmax Q(s,a)
  • Works for discrete actions only
  • Off-policy: can learn from old experience
  • Examples: DQN, Rainbow, C51

Policy-Based (REINFORCE, PPO)

  • Learns ฯ€(a|s) โ€” probability of each action
  • Policy is explicit: sample from ฯ€
  • Works for continuous AND discrete actions
  • On-policy: must use fresh experience
  • Examples: REINFORCE, A2C, PPO, SAC
Actor-Critic methods combine both: an Actor network learns the policy ฯ€(a|s), and a Critic network learns V(s) (state value). The Critic's value estimates reduce variance in the actor's gradient โ€” better than pure policy gradient and faster than pure value-based.
Section 2
REINFORCE: The First Policy Gradient

REINFORCE (Williams, 1992) is the simplest policy gradient algorithm. Collect a full episode, compute the return (sum of discounted rewards) at each step, then update the policy to make good actions more probable and bad actions less probable.

REINFORCE Policy Gradient Theorem

โˆ‡J(ฮธ) = E[ ฮฃ_t โˆ‡log ฯ€(a_t|s_t; ฮธ) * G_t ] G_t = ฮฃ_{k=t}^{T} ฮณ^(k-t) * r_k (return from step t) โˆ‡log ฯ€(...) = gradient of log probability of action taken If G_t > 0: increase probability of action a_t in state s_t If G_t < 0: decrease probability of action a_t in state s_t
import torch, torch.nn as nn, torch.optim as optim
import gymnasium as gym, numpy as np
from torch.distributions import Categorical

class PolicyNet(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, action_dim), nn.Softmax(dim=-1)
        )
    def forward(self, x):
        return self.net(x)

env    = gym.make("CartPole-v1")
policy = PolicyNet(4, 2)
optim  = torch.optim.Adam(policy.parameters(), lr=1e-3)
GAMMA  = 0.99

for episode in range(2000):
    states, actions, rewards = [], [], []
    state, _ = env.reset()

    # โ”€โ”€ Collect one episode โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    done = False
    while not done:
        s_tensor = torch.FloatTensor(state)
        probs    = policy(s_tensor)
        dist     = Categorical(probs)
        action   = dist.sample().item()

        next_state, reward, terminated, truncated, _ = env.step(action)
        states.append(s_tensor)
        actions.append(action)
        rewards.append(reward)
        done = terminated or truncated
        state = next_state

    # โ”€โ”€ Compute discounted returns G_t โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    G, returns = 0, []
    for r in reversed(rewards):
        G = r + GAMMA * G
        returns.insert(0, G)
    returns = torch.tensor(returns, dtype=torch.float32)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalise

    # โ”€โ”€ Compute policy gradient loss โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    loss = 0
    for s, a, g in zip(states, actions, returns):
        probs = policy(s)
        log_prob = Categorical(probs).log_prob(torch.tensor(a))
        loss += -log_prob * g   # negative because we maximise

    optim.zero_grad()
    loss.backward()
    optim.step()
REINFORCE problem: high variance. Returns G_t can vary wildly between episodes, causing noisy gradient updates. Solutions: (1) subtract a baseline V(s) from G_t โ€” this is Actor-Critic, (2) use advantage estimation โ€” this is A2C/PPO.
Section 3
PPO: The Industry-Standard Algorithm

PPO (Proximal Policy Optimisation, Schulman et al. 2017) is the most widely used deep RL algorithm. It trains on multiple mini-batches of recent experience while preventing updates that change the policy too drastically โ€” the "proximal" (near/close) constraint.

PPO Clipped Objective

r_t(ฮธ) = ฯ€_ฮธ(a_t|s_t) / ฯ€_ฮธ_old(a_t|s_t) (probability ratio) L_CLIP(ฮธ) = E[ min(r_t*A_t, clip(r_t, 1-ฮต, 1+ฮต)*A_t) ] A_t = advantage = Q(s,a) - V(s) (how much better than average?) ฮต (epsilon) = clipping parameter, typically 0.2 clip(r_t, 0.8, 1.2): ratio cannot deviate more than 20% from 1.0 This prevents catastrophic policy updates that destroy good behaviour

The clipping is the key insight: if a gradient update would change the policy too much (ratio outside [0.8, 1.2]), clip it. This makes PPO conservative โ€” it improves reliably without the instability of older policy gradient methods.

# โ”€โ”€ PPO with Stable Baselines 3 (industry library) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# pip install stable-baselines3[extra] gymnasium

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym

# โ”€โ”€ CartPole โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
env = make_vec_env("CartPole-v1", n_envs=4)  # 4 parallel environments
model = PPO(
    "MlpPolicy",       # multi-layer perceptron policy
    env,
    verbose=1,
    learning_rate=3e-4,
    n_steps=2048,      # steps to collect per rollout per env
    batch_size=64,
    n_epochs=10,       # PPO update epochs per rollout
    gamma=0.99,
    clip_range=0.2,    # ฮต โ€” clipping parameter
    ent_coef=0.0,      # entropy bonus (encourage exploration)
)
model.learn(total_timesteps=200_000)
model.save("ppo_cartpole")

# โ”€โ”€ LunarLander (requires continuous action support) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from stable_baselines3 import PPO as PPO_C

lunar_env = make_vec_env("LunarLander-v2", n_envs=8)
lunar_model = PPO_C("MlpPolicy", lunar_env, verbose=1,
                     learning_rate=3e-4, n_steps=1024,
                     batch_size=64, n_epochs=4, gamma=0.999)
lunar_model.learn(total_timesteps=1_000_000)

# โ”€โ”€ Evaluate โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(lunar_model, lunar_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
# Target: mean_reward > 200 for LunarLander-v2 = "solved"
Section 4
Algorithm Comparison: When to Use What
AlgorithmAction SpaceProsConsUse Case
Q-LearningDiscrete, smallSimple, provably convergent, no neural network neededFails on large/continuous spacesTabular problems, teaching
DQNDiscreteScales to complex visual inputs (Atari)No continuous actions, experience replay neededAtari games, grid worlds
REINFORCEBothConceptually clean, directly optimises policyHigh variance, slow convergenceTeaching policy gradients
A2CBothFaster than REINFORCE (baseline reduces variance)Less stable than PPO, synchronous updatesWhen PPO is overkill
PPOBothStable, reliable, reuses experience, parallel envsOn-policy (must discard old experience)Almost everything โ€” default choice
SACContinuousOff-policy, sample-efficient, entropy maximisationHarder to tune than PPORobotics, continuous control
When NOT to use RL: If you have labelled data, use supervised learning โ€” it's 10x simpler. RL shines when (1) you can simulate the environment cheaply, (2) the reward signal is clear, and (3) there are no labelled examples. Don't use RL to classify images โ€” train a CNN instead.

๐Ÿš€ Lesson 4 Quiz โ€” Deep RL and PPO

1. Policy gradient methods learn ฯ€(a|s) directly rather than Q(s,a). The key advantage over Q-Learning for continuous control (robot joints) is:
a) Policy networks train 10x faster than Q-networks on all hardware
b) For continuous actions, taking argmax over Q(s,a) is impossible โ€” there are infinitely many possible actions. A policy network outputs action probabilities or a Gaussian distribution, from which you sample. No argmax required, so it scales naturally to continuous action spaces like joint torques.
c) Policy methods require less memory because they don't store a Q-table
d) Policy gradient theorem is mathematically simpler than the Bellman equation
2. REINFORCE has high variance because:
a) It uses random initialisation for the policy network
b) The return G_t is computed from a single episode's rewards, which vary significantly between episodes due to environment stochasticity and exploration. A single lucky (or unlucky) trajectory can cause a large policy update in the wrong direction. Actor-Critic reduces this by replacing G_t with an advantage estimate A_t = G_t - V(s_t).
c) The Softmax output layer adds multiplicative noise to gradients
d) REINFORCE uses a different optimiser (SGD) that causes high variance by design
3. The PPO clipping parameter ฮต=0.2 prevents updates where r_t (the ratio new/old policy) exceeds 1.2 or falls below 0.8. This is necessary because:
a) Ratios outside [0.8, 1.2] cause numerical overflow in floating point arithmetic
b) Large policy updates can catastrophically destroy previously learned behaviour. If the policy changes drastically in one update, the agent enters a bad region and may never recover โ€” because RL uses on-policy data, the next rollout will come from this now-worse policy. The clip constrains each update to a neighbourhood of the current policy.
c) The gym environment raises an error when policy ratios exceed ยฑ0.2
d) Clipping ensures the advantage function A_t is always normalised to [-1, 1]
4. PPO uses n_envs=4 (4 parallel environments) instead of a single environment. The reason is:
a) Gymnasium requires exactly 4 environments for correct physics simulation
b) Collecting experience from multiple environments simultaneously reduces the correlation between consecutive samples (more diverse rollout batch), allows more transitions per wall-clock second, and stabilises gradient estimates. Each environment independently explores the state space.
c) 4 environments exactly matches a modern 4-core CPU for maximum efficiency
d) Parallel environments allow PPO to be used as an off-policy algorithm
5. In the Actor-Critic architecture, the Advantage A_t = Q(s,a) - V(s) measures:
a) The absolute value of the reward received at each step
b) How much better (or worse) taking action a in state s is compared to the average action in that state. A_t > 0 means "this action was better than average โ€” increase its probability." A_t < 0 means "this action was worse than average โ€” decrease its probability." This centering reduces gradient variance.
c) The difference in reward between two parallel environments
d) The error in the critic's value estimate compared to the true Q-function
6. SAC (Soft Actor-Critic) is preferred over PPO for robotics with continuous actions because:
a) SAC does not require a GPU โ€” it runs entirely on CPU
b) SAC is off-policy โ€” it learns from a replay buffer of past experience, making it much more sample-efficient than PPO (which discards experience after each update). Physical robots are expensive to run; you want to extract maximum learning from every second of real-world interaction.
c) SAC was invented specifically for robotics and cannot be applied to other domains
d) PPO's clipping breaks continuous action distributions, so SAC is the only option
7. Stable Baselines3's PPO uses n_epochs=10, meaning it performs 10 gradient update passes on each collected rollout batch. This reuse of data is possible (compared to REINFORCE which uses each sample once) because:
a) Stable Baselines3 uses gradient checkpointing to avoid memory overflow during 10 passes
b) The PPO clipping constraint limits how much the policy can change per pass, so the data remains "approximately on-policy" even after multiple updates. Without clipping, updating 10 times would make the data stale and lead to divergence โ€” the IS ratio r_t would no longer be close to 1.
c) Each of the 10 epochs uses a completely different neural network initialisation
d) n_epochs=10 refers to 10 full environment training runs, not gradient passes
8. You want to train a trading bot that decides buy/sell/hold every minute based on 50 market features. The reward is daily P&L. The correct algorithm choice is:
a) Q-Learning with a Q-table (50 features ร— 3 actions)
b) PPO with an MLP policy โ€” it handles the continuous state space (50 features), discrete action space (3 actions), and delayed sparse rewards (daily P&L). The on-policy constraint means experience collected with old policies is discarded, which is acceptable since the market environment is non-stationary anyway.
c) REINFORCE โ€” simpler algorithms are always better in finance
d) Supervised learning with labelled buy/sell/hold decisions from a profitable human trader
โ† Lesson 3: MDPs and Q-Learning Lesson 5: MLOps: Docker โ†’