Ananya, 15, from Ahmedabad had been using Keras for six months. She could write a CNN. She could fine-tune MobileNetV2. But when her model's loss exploded during training, she had no idea why. When her validation accuracy plateaued at 72%, she didn't know if changing the optimiser would help.
"You're using a machine you don't understand," her maths teacher said. "Let's open it up."
Ananya spent two weeks working through the maths underneath Keras. She derived backpropagation with pen and paper. She implemented a two-layer neural network in pure NumPy โ without Keras โ and watched the gradient flow. After that, when her loss exploded, she knew exactly why (learning rate too high). When it plateaued, she switched from SGD to Adam and broke through. Theory turned debugging from guessing into diagnosing.
A neural network forward pass is just matrix multiplication + a non-linearity + a loss function. Let's make that concrete:
Forward Pass โ 2-layer network
Layer 1: z1 = W1 @ X + b1 # linear transform Layer 1: a1 = ReLU(z1) # activation: max(0, z1) Layer 2: z2 = W2 @ a1 + b2 # output logits Output: ลท = softmax(z2) # class probabilities Loss: L = -ฮฃ y * log(ลท) # cross-entropyimport numpy as np
# โโ Pure NumPy 2-layer neural network โโโโโโโโโโโโโโโโโโโโโโโโโโ
class TinyNN:
def __init__(self, input_dim, hidden_dim, output_dim):
# Xavier / Glorot initialisation: variance = 2/(fan_in + fan_out)
scale1 = np.sqrt(2.0 / (input_dim + hidden_dim))
scale2 = np.sqrt(2.0 / (hidden_dim + output_dim))
self.W1 = np.random.randn(hidden_dim, input_dim) * scale1
self.b1 = np.zeros((hidden_dim, 1))
self.W2 = np.random.randn(output_dim, hidden_dim) * scale2
self.b2 = np.zeros((output_dim, 1))
def relu(self, z):
return np.maximum(0, z)
def softmax(self, z):
# Numerically stable: subtract max before exp
e = np.exp(z - z.max(axis=0, keepdims=True))
return e / e.sum(axis=0, keepdims=True)
def forward(self, X):
"""X: (input_dim, batch_size)"""
self.X = X
self.z1 = self.W1 @ X + self.b1
self.a1 = self.relu(self.z1)
self.z2 = self.W2 @ self.a1 + self.b2
self.yhat = self.softmax(self.z2)
return self.yhat
def cross_entropy_loss(self, yhat, y_onehot):
"""y_onehot: (num_classes, batch_size)"""
m = y_onehot.shape[1]
return -np.sum(y_onehot * np.log(yhat + 1e-8)) / m
Backpropagation is just the chain rule from calculus applied to a computation graph. We want to find โL/โW for every weight so we can move it in the direction that reduces loss.
Chain Rule โ backward pass
โL/โz2 = ลท - y # softmax + cross-entropy gradient (simplifies beautifully) โL/โW2 = (1/m) * (โL/โz2) @ a1.T โL/โb2 = (1/m) * sum(โL/โz2, axis=1) โL/โa1 = W2.T @ (โL/โz2) โL/โz1 = โL/โa1 * ReLU'(z1) # ReLU'(z) = 1 if z>0 else 0 โL/โW1 = (1/m) * (โL/โz1) @ X.T โL/โb1 = (1/m) * sum(โL/โz1, axis=1) def backward(self, y_onehot, lr=0.01):
"""Compute gradients and update weights."""
m = y_onehot.shape[1]
# Output layer gradient (softmax + cross-entropy combined)
dz2 = self.yhat - y_onehot # (output_dim, m)
dW2 = (dz2 @ self.a1.T) / m
db2 = dz2.sum(axis=1, keepdims=True) / m
# Backprop through layer 2 to layer 1
da1 = self.W2.T @ dz2 # (hidden_dim, m)
dz1 = da1 * (self.z1 > 0) # ReLU derivative: 1 where z1 > 0
dW1 = (dz1 @ self.X.T) / m
db1 = dz1.sum(axis=1, keepdims=True) / m
# Gradient descent update
self.W2 -= lr * dW2
self.b2 -= lr * db2
self.W1 -= lr * dW1
self.b1 -= lr * db1
# Training loop
nn = TinyNN(input_dim=784, hidden_dim=128, output_dim=10)
for epoch in range(100):
yhat = nn.forward(X_train) # X_train: (784, m)
loss = nn.cross_entropy_loss(yhat, Y_train)
nn.backward(Y_train, lr=0.01)
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss = {loss:.4f}")
Vanilla SGD updates every weight by the same learning rate: W = W - lr * dW. The problem: some weights need large updates, others need tiny ones. Adaptive optimisers solve this.
| Optimiser | Key Idea | When to Use | Keras Code |
|---|---|---|---|
| SGD | Fixed learning rate for all weights. Add momentum to escape local minima. | CNNs with careful LR tuning, research reproducibility | SGD(lr=0.01, momentum=0.9) |
| RMSprop | Divides gradient by running RMS of past gradients. Adapts per-weight. | RNNs, noisy gradients | RMSprop(lr=1e-3) |
| Adam | RMSprop + momentum. Maintains moving average of gradients AND squared gradients. Most robust. | Almost everything โ default choice | Adam(lr=1e-3) |
| AdamW | Adam + weight decay decoupled from gradient. Better generalisation. | Fine-tuning LLMs, Transformers | AdamW(lr=3e-4, weight_decay=0.01) |
# Adam update equations (implement it once to understand it forever)
import numpy as np
class Adam:
def __init__(self, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1, self.beta2, self.eps = beta1, beta2, eps
self.m, self.v, self.t = {}, {}, 0
def update(self, params, grads):
self.t += 1
for key in params:
if key not in self.m:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
# Momentum: exponential moving average of gradients
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
# RMS: exponential moving average of squared gradients
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * grads[key]**2
# Bias correction (important early in training when m, v โ 0)
m_hat = self.m[key] / (1 - self.beta1**self.t)
v_hat = self.v[key] / (1 - self.beta2**self.t)
# Update: normalise by sqrt(v_hat) โ adaptive learning rate per weight
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return params
Batch Normalisation (2015, Ioffe & Szegedy) normalises activations within each mini-batch to have zero mean and unit variance, then scales and shifts with learned parameters ฮณ and ฮฒ. This:
- Allows 10ร higher learning rates
- Makes training less sensitive to weight initialisation
- Acts as a mild regulariser (reduces need for dropout)
# Batch Norm: what Keras does under the hood
def batch_norm_forward(z, gamma, beta, eps=1e-8):
"""
z: (batch_size, features) activations BEFORE the activation function
gamma, beta: learned scale and shift (trainable parameters)
"""
mu = z.mean(axis=0) # mean per feature
sigma = z.var(axis=0) # variance per feature
z_hat = (z - mu) / np.sqrt(sigma + eps) # normalise
out = gamma * z_hat + beta # scale and shift
return out
# In Keras โ always place BEFORE the activation:
model = tf.keras.Sequential([
tf.keras.layers.Dense(256),
tf.keras.layers.BatchNormalization(), # โ before activation
tf.keras.layers.Activation('relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# โโ Dropout: random neuron deactivation during training โโโโโโโโโ
# During TRAINING: randomly zero out p fraction of neurons โ forces redundancy
# During INFERENCE: all neurons active, outputs scaled by (1-p)
# In Keras โ Dropout goes AFTER the activation:
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(0.4), # โ after activation, p=0.4
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(10, activation='softmax')
])
A fixed learning rate is rarely optimal. Start large to explore, then decay to fine-tune around a minimum. The most important schedules:
import tensorflow as tf
# โโ 1. Cosine Annealing (most popular for transformers) โโโโโโโโโ
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=1e-3,
decay_steps=10_000, # steps to decay to min_lr
alpha=1e-6 # minimum lr
)
# โโ 2. ReduceLROnPlateau (automatic, metric-based) โโโโโโโโโโโโโโ
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5, # multiply lr by 0.5 when plateau detected
patience=5, # wait 5 epochs before reducing
min_lr=1e-7
)
# โโ 3. Warmup + Decay (transformers standard) โโโโโโโโโโโโโโโโโโโ
class WarmupCosineSchedule(
tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, peak_lr, warmup_steps, total_steps):
self.peak_lr = peak_lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
def __call__(self, step):
warmup = self.peak_lr * (step / self.warmup_steps)
cosine = self.peak_lr * 0.5 * (
1 + tf.cos(np.pi * (step - self.warmup_steps)
/ (self.total_steps - self.warmup_steps))
)
return tf.where(step < self.warmup_steps, warmup, cosine)
# Usage:
schedule = WarmupCosineSchedule(peak_lr=3e-4, warmup_steps=1000, total_steps=10000)
optimizer = tf.keras.optimizers.Adam(learning_rate=schedule)