RL: MDPs and Q-Learning ๐ŸŽฎ

Class 11Age 15โ€“16Lesson 3 of 12๐Ÿ†“ Free
Vikram from Jaipur training a CartPole RL agent โ€” Q-table grids, reward curves, and exploration vs exploitation diagrams
Watch first - 2-3 minutes

Class 11 Lesson 3 - RL: MDPs and Q-Learning

No sign-in needed - English narration - Safe for all school ages

Story
Vikram's Game-Playing Agent

Vikram, 16, from Jaipur was obsessed with chess. He had read that DeepMind's AlphaZero had taught itself chess from scratch โ€” no human games, just the rules โ€” and became the strongest chess player in history in 24 hours. He wanted to understand how.

"AlphaZero uses deep reinforcement learning," his computer science teacher explained. "But let's start somewhere simpler. Can you teach a computer to balance a pole on a cart?"

Vikram spent three days building a Q-learning agent for CartPole, a classic RL benchmark. His first agent was terrible โ€” the pole fell immediately. But after tuning epsilon decay and the learning rate, his agent could balance the pole for 500 steps โ€” the maximum possible. That's when he understood why DeepMind could conquer chess.

Section 1
The RL Framework: Agent, Environment, Reward

Reinforcement Learning is different from supervised learning. There are no labelled examples. Instead, an agent takes actions in an environment, observes the resulting state, and receives a reward signal. It learns a policy โ€” a mapping from states to actions โ€” that maximises cumulative reward over time.

๐Ÿค– Agent
selects action
โ†’
๐ŸŒ Environment
CartPole, chess, factory
โ†’
๐Ÿ“Š State + Reward
new observation + +1 or -100
โ†’
๐Ÿค– Agent learns
updates Q-values
RL TermCartPole ExampleReal-World AI Example
State (s)Cart position, velocity, pole angle, angular velocity (4 floats)Chess board position (8ร—8 grid with piece encoding)
Action (a)Push cart left or right (2 discrete actions)Place a chess piece (up to 4672 legal moves)
Reward (r)+1 for every step pole stays upright, -100 for falling+1 win, 0 draw, -1 loss (only at game end)
Policy (ฯ€)Given current state, which direction to pushGiven board position, which move to make
EpisodeOne game โ€” from reset until pole falls or 500 stepsOne chess game โ€” from start to checkmate
Key difference from supervised learning: In SL, you get immediate feedback "this label was wrong." In RL, you may not know if an action was good until 100 steps later. This is the credit assignment problem โ€” which of the 100 past actions deserves credit for this reward?
Section 2
Markov Decision Processes (MDPs)

An MDP is the formal mathematical framework for RL problems. It has five components: (S, A, P, R, ฮณ) where S is states, A is actions, P is transition probabilities, R is rewards, and ฮณ is discount factor.

The Markov Property is the key assumption: the next state depends only on the current state and action โ€” not on any history. For CartPole, (position, velocity, angle, angular velocity) captures all the information you need. Past positions don't matter if you know the current velocity.

The Bellman Equation โ€” the heart of RL

Q(s, a) = R(s, a) + ฮณ * max_a'[ Q(s', a') ] Q(s, a) = value of taking action a in state s R(s, a) = immediate reward received ฮณ (gamma) = discount factor (0โ€“1): how much future rewards matter max_a'[Q(s',a')] = best Q-value in the next state (greedy)

Intuitively: the value of an action = immediate reward + discounted best future value. If ฮณ=0, the agent only cares about immediate rewards (short-sighted). If ฮณ=1, it values future rewards equally to current ones (far-sighted).

Section 3
Q-Learning: Tabular Solution

Q-Learning maintains a table Q[state][action] of estimated values. After each step, it updates using the Bellman equation. Over many episodes, Q-values converge to their true values.

import numpy as np
import gymnasium as gym   # pip install gymnasium

# โ”€โ”€ CartPole with Q-Learning (discretised state space) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
env = gym.make("CartPole-v1")

# CartPole state is continuous โ€” we must discretise it for Q-table
# State: [cart_pos, cart_vel, pole_angle, pole_vel]
N_BINS = 10  # bins per dimension
BINS = [
    np.linspace(-4.8,   4.8,  N_BINS),   # cart position
    np.linspace(-4,     4,    N_BINS),   # cart velocity
    np.linspace(-0.418, 0.418,N_BINS),   # pole angle (radians)
    np.linspace(-4,     4,    N_BINS),   # pole angular velocity
]

def discretise(state):
    """Convert continuous state to discrete bin indices."""
    return tuple(
        int(np.digitize(s, bins) - 1)
        for s, bins in zip(state, BINS)
    )

# Q-table: shape (10, 10, 10, 10, 2) โ€” 10^4 states ร— 2 actions
Q = np.zeros([N_BINS]*4 + [env.action_space.n])

# Hyperparameters
ALPHA   = 0.1    # learning rate
GAMMA   = 0.99   # discount factor
EPSILON = 1.0    # start fully exploratory
EPS_MIN = 0.01   # minimum exploration
EPS_DECAY = 0.995

EPISODES = 5000

for ep in range(EPISODES):
    state, _ = env.reset()
    s = discretise(state)
    total_reward = 0

    for step in range(500):   # max 500 steps per episode
        # Epsilon-greedy action selection
        if np.random.random() < EPSILON:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[s])            # exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        ns = discretise(next_state)

        # Penalise termination heavily
        if terminated:
            reward = -100

        # Q-Learning update (Bellman equation)
        best_next = np.max(Q[ns])
        Q[s][action] += ALPHA * (reward + GAMMA * best_next - Q[s][action])

        s = ns
        total_reward += reward
        if terminated or truncated:
            break

    # Decay epsilon
    EPSILON = max(EPS_MIN, EPSILON * EPS_DECAY)

    if ep % 500 == 0:
        print(f"Episode {ep}: reward={total_reward:.1f}, eps={EPSILON:.3f}")

print("Training complete! Q-table converged.")
env.close()
Section 4
Exploration vs Exploitation: The Core Tradeoff

Every RL agent faces this dilemma: should it exploit what it already knows (pick the action with highest Q-value) or explore new actions that might be better?

# Epsilon decay strategies
import math

# โ”€โ”€ Linear decay โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# eps = max(EPS_MIN, EPS_START - episode * (EPS_START - EPS_MIN) / TOTAL_EPISODES)

# โ”€โ”€ Exponential decay (more common) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# eps = max(EPS_MIN, EPS_START * EPS_DECAY**episode)

# โ”€โ”€ Episode-aware schedule โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def get_epsilon(episode, total=5000, start=1.0, end=0.01):
    """Linearly decay from start to end over total episodes."""
    return max(end, start - (start - end) * episode / total)

# โ”€โ”€ Epsilon values at key points:
print(f"Episode    0: ฮต = {get_epsilon(0):.3f}")     # 1.000 (fully random)
print(f"Episode  500: ฮต = {get_epsilon(500):.3f}")   # 0.911
print(f"Episode 2500: ฮต = {get_epsilon(2500):.3f}")  # 0.506
print(f"Episode 4500: ฮต = {get_epsilon(4500):.3f}")  # 0.110
print(f"Episode 5000: ฮต = {get_epsilon(5000):.3f}")  # 0.010 (mostly greedy)
Rule of thumb: Use EPS_DECAY โ‰ˆ 0.995 for problems that need ~1000 episodes to solve, and 0.9999 for problems needing 100,000+ episodes. If your agent never improves, epsilon might be decaying too fast (not enough exploration). If it learns slowly, epsilon decays too slowly.
Section 5
Limits of Tabular Q-Learning: Why We Need Deep RL

Q-tables work beautifully for small, discretisable state spaces like CartPole (10^4 states). But they break down for real problems:

The solution: replace the Q-table with a neural network that approximates Q(s, a) โ€” this is Deep Q-Network (DQN), which we'll build in Lesson 4. The network takes state as input and outputs Q-values for all actions. It generalises across states, so it never needs to have seen the exact state before.

# The key insight: Q-table vs DQN
# Q-table:
Q[s][a] = Q[s][a] + alpha * (reward + gamma*max_Q_next - Q[s][a])
# Problem: requires explicit state lookup โ€” impossible for continuous/large spaces

# DQN โ€” replace lookup with neural network:
# Q_network(state) โ†’ [Q(s,a1), Q(s,a2), ..., Q(s,aN)]
# Loss = MSE( Q_network(s)[a], target )
# target = reward + gamma * max( Q_network(next_state) )

# The gradient of this MSE loss trains the network to estimate Q-values
# Without ever building an explicit table!
This is the bridge from Lesson 3 to Lesson 4. Q-Learning is the algorithm. DQN is what happens when you swap the table for a neural network. Policy Gradient (PPO) is what happens when you swap the Q-value for a direct policy โ€” predicting action probabilities instead of action values.

๐ŸŽฎ Lesson 3 Quiz โ€” MDPs and Q-Learning

1. The Markov Property states that the next state depends only on the current state and action, not history. For CartPole, this means:
a) The cart must reset to the centre position after every step
b) Knowing (position, velocity, pole angle, angular velocity) at time t gives all information needed to predict the next state โ€” past positions and past actions are irrelevant if you know the current full state vector
c) The agent cannot look more than one step ahead when planning
d) The environment resets after every step to maintain the Markov Property
2. In the Bellman equation Q(s,a) = R(s,a) + ฮณ * max_a'[Q(s',a')], setting ฮณ=0.99 vs ฮณ=0.5 means:
a) ฮณ=0.99 trains 2x faster because the gradient is larger
b) ฮณ=0.99 gives the agent a long-term perspective โ€” a reward 100 steps away is still worth 0.99^100 โ‰ˆ 37% of its face value. ฮณ=0.5 makes the agent short-sighted โ€” that same reward is worth only 0.5^100 โ‰ˆ 0%. For chess (reward only at game end), ฮณ must be close to 1.
c) ฮณ=0.99 means the Q-table is 99% accurate while ฮณ=0.5 is 50% accurate
d) ฮณ controls the fraction of states that are randomly explored per episode
3. An epsilon-greedy policy with ฮต=1.0 means the agent always chooses randomly. Why does training start with ฮต=1.0?
a) Random actions produce lower variance in the Q-table updates
b) The Q-table is initialised to all zeros โ€” at the start, Q(s,a)=0 for every state and action. There is no useful information to exploit yet. Random exploration is the only way to collect diverse experience and populate the Q-table with meaningful values.
c) ฮต=1.0 is required by the OpenAI Gym CartPole environment during the first episode
d) High epsilon ensures the learning rate alpha converges to its optimal value faster
4. Why must we discretise CartPole's continuous state (4 floats) before using a Q-table?
a) Gymnasium requires integer states for compatibility with all RL algorithms
b) A Q-table stores values indexed by exact state values. Continuous states have infinitely many possible values โ€” you can never visit the exact same float twice, so Q-table lookups would always miss. Discretisation maps continuous values to finite bins, enabling tractable table indexing.
c) Float states cause numerical instability in the Bellman update formula
d) Python dictionaries cannot use float tuples as keys
5. The Q-Learning update rule is: Q[s][a] += ฮฑ * (reward + ฮณ*max_Q_next - Q[s][a]). The term (reward + ฮณ*max_Q_next - Q[s][a]) is called the TD error. When TD error = 0:
a) The agent has found an optimal policy and should stop training
b) The current Q-value estimate perfectly matches the Bellman target โ€” the agent's prediction of the value of taking action a in state s is exactly consistent with what it received plus the best future value. This is the fixed point Q-Learning converges to.
c) The learning rate alpha has become zero and no more updates occur
d) The episode has ended and no future rewards are possible
6. The core reason tabular Q-Learning cannot solve chess is:
a) Chess is too fast โ€” the environment transitions happen in microseconds
b) Chess has approximately 10^43 legal positions. Storing one float (8 bytes) per state-action pair would require ~10^44 bytes of memory โ€” many orders of magnitude more than all storage ever built by humanity. DQN solves this by approximating Q with a neural network that generalises across unseen states.
c) Chess rewards are too sparse โ€” tabular Q-Learning requires rewards every step
d) Chess has a branching factor greater than 2, which tabular Q-Learning cannot handle
7. In the CartPole environment, a reward of -100 is given on termination (pole falls) instead of the default 0. This modification:
a) Causes the Q-Learning algorithm to violate the Markov Property
b) Provides a strong negative signal for terminal states, giving the agent a clear incentive to avoid falling. Without it, falling earns the same reward as continuing โ€” the agent has no gradient signal distinguishing good balance from immediate failure.
c) Is required to prevent the Q-table from having negative values
d) Doubles the effective learning rate by increasing the TD error magnitude
8. AlphaZero learns chess "from scratch with only the rules." In RL terms, this means:
a) AlphaZero starts with a policy trained on 10 million human grandmaster games
b) AlphaZero's only reward signal is +1 for a win, 0 for a draw, -1 for a loss at game end โ€” no human move labels, no opening books. It learns entirely through self-play: it plays against itself, generates experience, and uses policy gradient methods (MCTS + neural network) to improve from that experience alone.
c) AlphaZero ignores the chess rules and discovers them through trial and error
d) "From scratch" means AlphaZero uses a randomly initialised Q-table that converges in 24 hours
โ† Lesson 2: Deep Learning Theory Lesson 4: Deep RL and PPO โ†’