Vikram, 16, from Jaipur was obsessed with chess. He had read that DeepMind's AlphaZero had taught itself chess from scratch โ no human games, just the rules โ and became the strongest chess player in history in 24 hours. He wanted to understand how.
"AlphaZero uses deep reinforcement learning," his computer science teacher explained. "But let's start somewhere simpler. Can you teach a computer to balance a pole on a cart?"
Vikram spent three days building a Q-learning agent for CartPole, a classic RL benchmark. His first agent was terrible โ the pole fell immediately. But after tuning epsilon decay and the learning rate, his agent could balance the pole for 500 steps โ the maximum possible. That's when he understood why DeepMind could conquer chess.
Reinforcement Learning is different from supervised learning. There are no labelled examples. Instead, an agent takes actions in an environment, observes the resulting state, and receives a reward signal. It learns a policy โ a mapping from states to actions โ that maximises cumulative reward over time.
| RL Term | CartPole Example | Real-World AI Example |
|---|---|---|
| State (s) | Cart position, velocity, pole angle, angular velocity (4 floats) | Chess board position (8ร8 grid with piece encoding) |
| Action (a) | Push cart left or right (2 discrete actions) | Place a chess piece (up to 4672 legal moves) |
| Reward (r) | +1 for every step pole stays upright, -100 for falling | +1 win, 0 draw, -1 loss (only at game end) |
| Policy (ฯ) | Given current state, which direction to push | Given board position, which move to make |
| Episode | One game โ from reset until pole falls or 500 steps | One chess game โ from start to checkmate |
An MDP is the formal mathematical framework for RL problems. It has five components: (S, A, P, R, ฮณ) where S is states, A is actions, P is transition probabilities, R is rewards, and ฮณ is discount factor.
The Markov Property is the key assumption: the next state depends only on the current state and action โ not on any history. For CartPole, (position, velocity, angle, angular velocity) captures all the information you need. Past positions don't matter if you know the current velocity.
The Bellman Equation โ the heart of RL
Q(s, a) = R(s, a) + ฮณ * max_a'[ Q(s', a') ] Q(s, a) = value of taking action a in state s R(s, a) = immediate reward received ฮณ (gamma) = discount factor (0โ1): how much future rewards matter max_a'[Q(s',a')] = best Q-value in the next state (greedy)Intuitively: the value of an action = immediate reward + discounted best future value. If ฮณ=0, the agent only cares about immediate rewards (short-sighted). If ฮณ=1, it values future rewards equally to current ones (far-sighted).
Q-Learning maintains a table Q[state][action] of estimated values. After each step, it updates using the Bellman equation. Over many episodes, Q-values converge to their true values.
import numpy as np
import gymnasium as gym # pip install gymnasium
# โโ CartPole with Q-Learning (discretised state space) โโโโโโโโโโ
env = gym.make("CartPole-v1")
# CartPole state is continuous โ we must discretise it for Q-table
# State: [cart_pos, cart_vel, pole_angle, pole_vel]
N_BINS = 10 # bins per dimension
BINS = [
np.linspace(-4.8, 4.8, N_BINS), # cart position
np.linspace(-4, 4, N_BINS), # cart velocity
np.linspace(-0.418, 0.418,N_BINS), # pole angle (radians)
np.linspace(-4, 4, N_BINS), # pole angular velocity
]
def discretise(state):
"""Convert continuous state to discrete bin indices."""
return tuple(
int(np.digitize(s, bins) - 1)
for s, bins in zip(state, BINS)
)
# Q-table: shape (10, 10, 10, 10, 2) โ 10^4 states ร 2 actions
Q = np.zeros([N_BINS]*4 + [env.action_space.n])
# Hyperparameters
ALPHA = 0.1 # learning rate
GAMMA = 0.99 # discount factor
EPSILON = 1.0 # start fully exploratory
EPS_MIN = 0.01 # minimum exploration
EPS_DECAY = 0.995
EPISODES = 5000
for ep in range(EPISODES):
state, _ = env.reset()
s = discretise(state)
total_reward = 0
for step in range(500): # max 500 steps per episode
# Epsilon-greedy action selection
if np.random.random() < EPSILON:
action = env.action_space.sample() # explore
else:
action = np.argmax(Q[s]) # exploit
next_state, reward, terminated, truncated, _ = env.step(action)
ns = discretise(next_state)
# Penalise termination heavily
if terminated:
reward = -100
# Q-Learning update (Bellman equation)
best_next = np.max(Q[ns])
Q[s][action] += ALPHA * (reward + GAMMA * best_next - Q[s][action])
s = ns
total_reward += reward
if terminated or truncated:
break
# Decay epsilon
EPSILON = max(EPS_MIN, EPSILON * EPS_DECAY)
if ep % 500 == 0:
print(f"Episode {ep}: reward={total_reward:.1f}, eps={EPSILON:.3f}")
print("Training complete! Q-table converged.")
env.close()
Every RL agent faces this dilemma: should it exploit what it already knows (pick the action with highest Q-value) or explore new actions that might be better?
- Pure exploitation (ฮต=0): Gets stuck in local optima. Misses better strategies it never tried.
- Pure exploration (ฮต=1): Random policy. Never uses what it learned.
- Epsilon-greedy decay: Start exploratory, gradually shift to exploitative as the Q-table improves.
# Epsilon decay strategies
import math
# โโ Linear decay โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# eps = max(EPS_MIN, EPS_START - episode * (EPS_START - EPS_MIN) / TOTAL_EPISODES)
# โโ Exponential decay (more common) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# eps = max(EPS_MIN, EPS_START * EPS_DECAY**episode)
# โโ Episode-aware schedule โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def get_epsilon(episode, total=5000, start=1.0, end=0.01):
"""Linearly decay from start to end over total episodes."""
return max(end, start - (start - end) * episode / total)
# โโ Epsilon values at key points:
print(f"Episode 0: ฮต = {get_epsilon(0):.3f}") # 1.000 (fully random)
print(f"Episode 500: ฮต = {get_epsilon(500):.3f}") # 0.911
print(f"Episode 2500: ฮต = {get_epsilon(2500):.3f}") # 0.506
print(f"Episode 4500: ฮต = {get_epsilon(4500):.3f}") # 0.110
print(f"Episode 5000: ฮต = {get_epsilon(5000):.3f}") # 0.010 (mostly greedy)
Q-tables work beautifully for small, discretisable state spaces like CartPole (10^4 states). But they break down for real problems:
- Chess: ~10^43 possible positions โ a Q-table would require more memory than atoms in the universe
- Atari Pong: 84ร84 pixel frames = 256^7056 possible states
- Continuous control: Robot arm with 7 joints โ infinite continuous state space
The solution: replace the Q-table with a neural network that approximates Q(s, a) โ this is Deep Q-Network (DQN), which we'll build in Lesson 4. The network takes state as input and outputs Q-values for all actions. It generalises across states, so it never needs to have seen the exact state before.
# The key insight: Q-table vs DQN
# Q-table:
Q[s][a] = Q[s][a] + alpha * (reward + gamma*max_Q_next - Q[s][a])
# Problem: requires explicit state lookup โ impossible for continuous/large spaces
# DQN โ replace lookup with neural network:
# Q_network(state) โ [Q(s,a1), Q(s,a2), ..., Q(s,aN)]
# Loss = MSE( Q_network(s)[a], target )
# target = reward + gamma * max( Q_network(next_state) )
# The gradient of this MSE loss trains the network to estimate Q-values
# Without ever building an explicit table!