Lesson 2 — Deep Learning Theory | Class 11

Story

Ananya Wants to Understand Why

Ananya, 15, from Ahmedabad had been using Keras for six months. She could write a CNN. She could fine-tune MobileNetV2. But when her model's loss exploded during training, she had no idea why. When her validation accuracy plateaued at 72%, she didn't know if changing the optimiser would help.

"You're using a machine you don't understand," her maths teacher said. "Let's open it up."

Ananya spent two weeks working through the maths underneath Keras. She derived backpropagation with pen and paper. She implemented a two-layer neural network in pure NumPy — without Keras — and watched the gradient flow. After that, when her loss exploded, she knew exactly why (learning rate too high). When it plateaued, she switched from SGD to Adam and broke through. Theory turned debugging from guessing into diagnosing.

Section 1

Forward Pass: From Input to Loss

A neural network forward pass is just matrix multiplication + a non-linearity + a loss function. Let's make that concrete:

Forward Pass — 2-layer network

Layer 1: z1 = W1 @ X + b1 # linear transform Layer 1: a1 = ReLU(z1) # activation: max(0, z1) Layer 2: z2 = W2 @ a1 + b2 # output logits Output: ŷ = softmax(z2) # class probabilities Loss: L = -Σ y * log(ŷ) # cross-entropy

import numpy as np

# ── Pure NumPy 2-layer neural network ──────────────────────────
class TinyNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier / Glorot initialisation: variance = 2/(fan_in + fan_out)
        scale1 = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale2 = np.sqrt(2.0 / (hidden_dim + output_dim))
        self.W1 = np.random.randn(hidden_dim, input_dim) * scale1
        self.b1 = np.zeros((hidden_dim, 1))
        self.W2 = np.random.randn(output_dim, hidden_dim) * scale2
        self.b2 = np.zeros((output_dim, 1))

    def relu(self, z):
        return np.maximum(0, z)

    def softmax(self, z):
        # Numerically stable: subtract max before exp
        e = np.exp(z - z.max(axis=0, keepdims=True))
        return e / e.sum(axis=0, keepdims=True)

    def forward(self, X):
        """X: (input_dim, batch_size)"""
        self.X  = X
        self.z1 = self.W1 @ X + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.W2 @ self.a1 + self.b2
        self.yhat = self.softmax(self.z2)
        return self.yhat

    def cross_entropy_loss(self, yhat, y_onehot):
        """y_onehot: (num_classes, batch_size)"""
        m = y_onehot.shape[1]
        return -np.sum(y_onehot * np.log(yhat + 1e-8)) / m

Xavier initialisation is critical. If weights start too large, activations saturate and gradients vanish. If too small, the network never learns. Xavier initialises weights with variance = 2/(fan_in + fan_out) — keeping signal magnitude consistent across layers.

Section 2

Backpropagation: The Chain Rule in Action

Backpropagation is just the chain rule from calculus applied to a computation graph. We want to find ∂L/∂W for every weight so we can move it in the direction that reduces loss.

Chain Rule — backward pass

∂L/∂z2 = ŷ - y # softmax + cross-entropy gradient (simplifies beautifully) ∂L/∂W2 = (1/m) * (∂L/∂z2) @ a1.T ∂L/∂b2 = (1/m) * sum(∂L/∂z2, axis=1) ∂L/∂a1 = W2.T @ (∂L/∂z2) ∂L/∂z1 = ∂L/∂a1 * ReLU'(z1) # ReLU'(z) = 1 if z>0 else 0 ∂L/∂W1 = (1/m) * (∂L/∂z1) @ X.T ∂L/∂b1 = (1/m) * sum(∂L/∂z1, axis=1)

    def backward(self, y_onehot, lr=0.01):
        """Compute gradients and update weights."""
        m = y_onehot.shape[1]

        # Output layer gradient (softmax + cross-entropy combined)
        dz2 = self.yhat - y_onehot           # (output_dim, m)

        dW2 = (dz2 @ self.a1.T) / m
        db2 = dz2.sum(axis=1, keepdims=True) / m

        # Backprop through layer 2 to layer 1
        da1 = self.W2.T @ dz2                # (hidden_dim, m)
        dz1 = da1 * (self.z1 > 0)           # ReLU derivative: 1 where z1 > 0

        dW1 = (dz1 @ self.X.T) / m
        db1 = dz1.sum(axis=1, keepdims=True) / m

        # Gradient descent update
        self.W2 -= lr * dW2
        self.b2 -= lr * db2
        self.W1 -= lr * dW1
        self.b1 -= lr * db1

# Training loop
nn = TinyNN(input_dim=784, hidden_dim=128, output_dim=10)
for epoch in range(100):
    yhat = nn.forward(X_train)             # X_train: (784, m)
    loss = nn.cross_entropy_loss(yhat, Y_train)
    nn.backward(Y_train, lr=0.01)
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}")

Why does ∂L/∂z2 = ŷ - y simplify so cleanly? The softmax and cross-entropy gradient perfectly cancel each other's complexity. This is one of the beautiful coincidences of deep learning math — the combined gradient is just the prediction error. This is why we almost always pair softmax with cross-entropy loss.

Section 3

Optimisers: SGD vs Adam vs RMSprop

Vanilla SGD updates every weight by the same learning rate: W = W - lr * dW. The problem: some weights need large updates, others need tiny ones. Adaptive optimisers solve this.

Optimiser	Key Idea	When to Use	Keras Code
SGD	Fixed learning rate for all weights. Add momentum to escape local minima.	CNNs with careful LR tuning, research reproducibility	SGD(lr=0.01, momentum=0.9)
RMSprop	Divides gradient by running RMS of past gradients. Adapts per-weight.	RNNs, noisy gradients	RMSprop(lr=1e-3)
Adam	RMSprop + momentum. Maintains moving average of gradients AND squared gradients. Most robust.	Almost everything — default choice	Adam(lr=1e-3)
AdamW	Adam + weight decay decoupled from gradient. Better generalisation.	Fine-tuning LLMs, Transformers	AdamW(lr=3e-4, weight_decay=0.01)

# Adam update equations (implement it once to understand it forever)
import numpy as np

class Adam:
    def __init__(self, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1, self.beta2, self.eps = beta1, beta2, eps
        self.m, self.v, self.t = {}, {}, 0

    def update(self, params, grads):
        self.t += 1
        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])

            # Momentum: exponential moving average of gradients
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            # RMS: exponential moving average of squared gradients
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * grads[key]**2

            # Bias correction (important early in training when m, v ≈ 0)
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)

            # Update: normalise by sqrt(v_hat) → adaptive learning rate per weight
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        return params

Section 4

Batch Normalisation and Dropout

Batch Normalisation (2015, Ioffe & Szegedy) normalises activations within each mini-batch to have zero mean and unit variance, then scales and shifts with learned parameters γ and β. This:

Allows 10× higher learning rates
Makes training less sensitive to weight initialisation
Acts as a mild regulariser (reduces need for dropout)

# Batch Norm: what Keras does under the hood
def batch_norm_forward(z, gamma, beta, eps=1e-8):
    """
    z: (batch_size, features) activations BEFORE the activation function
    gamma, beta: learned scale and shift (trainable parameters)
    """
    mu    = z.mean(axis=0)                   # mean per feature
    sigma = z.var(axis=0)                    # variance per feature
    z_hat = (z - mu) / np.sqrt(sigma + eps)  # normalise
    out   = gamma * z_hat + beta             # scale and shift
    return out

# In Keras — always place BEFORE the activation:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256),
    tf.keras.layers.BatchNormalization(),    # ← before activation
    tf.keras.layers.Activation('relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# ── Dropout: random neuron deactivation during training ─────────
# During TRAINING: randomly zero out p fraction of neurons → forces redundancy
# During INFERENCE: all neurons active, outputs scaled by (1-p)

# In Keras — Dropout goes AFTER the activation:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.4),   # ← after activation, p=0.4
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(10, activation='softmax')
])

BatchNorm placement rule: Dense → BatchNorm → Activation. NOT Dense → Activation → BatchNorm. The reason: BatchNorm normalises the raw pre-activation values, which have a more stable distribution than post-activation outputs.

Section 5

Learning Rate Schedules

A fixed learning rate is rarely optimal. Start large to explore, then decay to fine-tune around a minimum. The most important schedules:

import tensorflow as tf

# ── 1. Cosine Annealing (most popular for transformers) ─────────
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3,
    decay_steps=10_000,   # steps to decay to min_lr
    alpha=1e-6            # minimum lr
)

# ── 2. ReduceLROnPlateau (automatic, metric-based) ──────────────
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,           # multiply lr by 0.5 when plateau detected
    patience=5,           # wait 5 epochs before reducing
    min_lr=1e-7
)

# ── 3. Warmup + Decay (transformers standard) ───────────────────
class WarmupCosineSchedule(
    tf.keras.optimizers.schedules.LearningRateSchedule):

    def __init__(self, peak_lr, warmup_steps, total_steps):
        self.peak_lr = peak_lr
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps

    def __call__(self, step):
        warmup = self.peak_lr * (step / self.warmup_steps)
        cosine = self.peak_lr * 0.5 * (
            1 + tf.cos(np.pi * (step - self.warmup_steps)
                       / (self.total_steps - self.warmup_steps))
        )
        return tf.where(step < self.warmup_steps, warmup, cosine)

# Usage:
schedule = WarmupCosineSchedule(peak_lr=3e-4, warmup_steps=1000, total_steps=10000)
optimizer = tf.keras.optimizers.Adam(learning_rate=schedule)

Rule of thumb: Use warmup when fine-tuning pretrained models. The first 1000–2000 steps with a tiny lr let the new head initialise safely before the larger model weights start updating.

🧮 Lesson 2 Quiz — Deep Learning Theory

1. In Xavier/Glorot initialisation, weights are sampled with variance = 2/(fan_in + fan_out). This specific formula is important because:

a) It ensures all weights start at zero to avoid breaking symmetry

b) It keeps the variance of activations and gradients approximately equal across every layer — too large causes saturation and vanishing gradients, too small means the signal decays to zero. The formula is derived from keeping variance(output) = variance(input) for a linear layer.

c) It guarantees the network will converge in a fixed number of epochs

d) It makes the weight matrix orthogonal, which prevents gradient explosion

2. The backpropagation gradient at the output layer simplifies to ∂L/∂z2 = ŷ - y (prediction minus truth). This clean result occurs because:

a) The output layer always uses ReLU, which has a derivative of exactly 1

b) The softmax and cross-entropy loss are mathematical conjugates — when you multiply their individual Jacobians together via the chain rule, nearly all terms cancel, leaving just the prediction error. This is why they are almost always paired together.

c) Cross-entropy loss is linear, so its gradient is always 1

d) The simplification only applies when the network achieves 100% accuracy

3. Adam optimiser maintains two moving averages m (gradient) and v (squared gradient). The bias correction step divides by (1 - β^t) because:

a) It compensates for floating point rounding errors in the gradient accumulation

b) At the start of training (t=1, 2...), both m and v are initialised at zero — they are biased towards zero. Dividing by (1-β^t) corrects this bias and prevents artificially tiny effective learning rates in early steps where the estimates haven't yet accumulated enough history.

c) It normalises the learning rate to be between 0 and 1 at all times

d) It prevents the learning rate from ever exceeding the initial value

4. Batch Normalisation should be placed BEFORE the activation function (Dense → BN → ReLU) rather than after. The correct reason is:

a) BatchNorm cannot process negative values, so it must run before ReLU eliminates them

b) BatchNorm normalises the raw pre-activation distribution (which is approximately Gaussian and well-defined) to zero mean and unit variance — this produces the most stable normalisation. Post-activation values after ReLU are half-rectified and no longer Gaussian, making normalisation less effective.

c) Keras raises an error if BatchNorm is placed after Activation

d) Placing BN before activation makes the model train exactly 2x faster

5. Dropout with p=0.4 means 40% of neurons are zeroed during training. During inference, the correct behaviour is:

a) 40% of neurons are still randomly zeroed to maintain consistency with training

b) All neurons are active and outputs are scaled by (1-p)=0.6 — OR equivalently (the modern approach), outputs are scaled by 1/(1-p) during training so no scaling is needed at inference. This ensures expected activations match between training and inference.

c) The dropout layer is removed entirely and replaced with a BatchNorm layer

d) Dropout is applied only to bias terms, not weight outputs, during inference

6. When fine-tuning a pretrained BERT model, you should use a warmup LR schedule (tiny LR for first 1000 steps, then larger). This is because:

a) BERT's tokeniser requires 1000 warmup steps to load completely

b) The pretrained weights encode valuable representations. Large updates in the first steps can catastrophically overwrite them before the new task-specific head is properly initialised. Warmup lets the head converge slightly first, then both head and body fine-tune together.

c) Google's BERT paper required warmup as a legal condition of use

d) Warmup steps prevent the Adam optimiser from computing gradients before the model has seen at least 1000 samples

7. The ReLU derivative used in backprop is: dz1 = da1 * (z1 > 0). The (z1 > 0) mask means:

a) Neurons that were positive during the forward pass receive full gradient; neurons that were negative receive zero gradient. Dead neurons (always negative) never learn — this is the "dying ReLU" problem.

b) Only the largest 50% of activations are allowed to propagate gradients

c) Negative activations receive a gradient of -1, not 0

d) The mask ensures gradient values are always between 0 and 1

8. AdamW is preferred over Adam for fine-tuning Transformers because it decouples weight decay from the gradient update. The problem with standard Adam + L2 regularisation is:

a) Adam cannot process L2 regularisation terms mathematically

b) In standard Adam, L2 gradient is divided by √v̂ (the adaptive term), making effective weight decay weaker for frequently-updated weights and stronger for rarely-updated ones — the opposite of desired. AdamW applies weight decay directly to the weights, independent of the gradient history.

c) L2 regularisation increases the required warmup steps by 10x

d) Standard Adam + L2 is identical to AdamW — they are equivalent implementations

← Lesson 1: Advanced Python Lesson 3: RL and Q-Learning →

Deep Learning Theory 🧮

Class 11 Lesson 2 - Deep Learning Theory

Forward Pass — 2-layer network

Chain Rule — backward pass

🧮 Lesson 2 Quiz — Deep Learning Theory