Meera's younger brother has a mango tree in the backyard. Every season, some mangoes develop a fungal disease. Her father usually identifies it by eye โ dark spots with a yellowish ring. Meera wondered: "Could an AI identify diseased fruit from a photo?" She'd heard of apps like Plantix. But how exactly does a computer see an image?
She opened her Class 9 notes. She knew a regular neural network flattens every pixel into a 1D list. For a 224ร224 image with 3 colour channels, that's 150,528 numbers โ before even one neuron processes them. "No wonder they struggle," she thought. "They lose all the spatial structure." Then she read about Convolutional Neural Networks โ and everything clicked.
In Class 9, you learned how a dense (fully-connected) neural network works: every input connects to every neuron. That's fine for small data, but disastrous for images:
- Too many parameters: A 224ร224 RGB image flattened = 150,528 inputs. With 512 neurons in layer 1, that's 77 million weights โ just in one layer.
- No spatial awareness: Flattening destroys position. A pixel at top-left has no relationship to its neighbour once it's in a list.
- Not translation invariant: If a cat moves from centre to corner of the photo, a dense network treats it as a completely different image.
CNNs solve all three problems with one key idea: local connections + shared weights.
A filter (also called a kernel) is a small grid of weights โ typically 3ร3. It slides across the image, computing a dot product at each position. The result is a feature map that highlights where that pattern appears.
In a trained CNN, the network learns the filter weights automatically from the data. You don't hand-design them โ they emerge from backpropagation.
- Conv2D layer: Applies learnable filters across the image. Each filter detects one pattern (edge, curve, texture). More filters = richer feature set.
- ReLU activation: Applied after each convolution. Sets all negative values to 0 โ adds non-linearity without killing spatial structure.
- MaxPool2D: Takes the maximum value from each 2ร2 region. Reduces spatial dimensions by half, keeping the strongest features, reducing parameters.
- Flatten: After the conv/pool stack, converts the 3D feature volume to a 1D vector for the dense classifier.
- Dense + Softmax: The final classifier โ same as Class 9 neural networks but much smaller because CNNs already extracted the key features.
# CNN for Image Classification โ Google Colab
# Task: Classify images into 10 categories (CIFAR-10 dataset)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
# โโ Step 1: Load and normalise CIFAR-10 โโ
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255.0 # scale pixels to 0-1
X_test = X_test.astype('float32') / 255.0
class_names = ['airplane','automobile','bird','cat','deer',
'dog','frog','horse','ship','truck']
print(f"Training set: {X_train.shape}") # (50000, 32, 32, 3)
print(f"Test set: {X_test.shape}") # (10000, 32, 32, 3)
# โโ Step 2: Build CNN architecture โโ
model = keras.Sequential([
# Block 1: Detect edges and textures
layers.Conv2D(32, (3,3), activation='relu', padding='same',
input_shape=(32,32,3)),
layers.Conv2D(32, (3,3), activation='relu', padding='same'),
layers.MaxPool2D(2, 2), # 32x32 -> 16x16
layers.Dropout(0.25),
# Block 2: Detect shapes from edges
layers.Conv2D(64, (3,3), activation='relu', padding='same'),
layers.Conv2D(64, (3,3), activation='relu', padding='same'),
layers.MaxPool2D(2, 2), # 16x16 -> 8x8
layers.Dropout(0.25),
# Classifier head
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax') # 10 CIFAR-10 classes
])
model.summary() # see total parameters
# โโ Step 3: Compile โโ
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# โโ Step 4: Train โโ
history = model.fit(
X_train, y_train,
epochs=20,
batch_size=64,
validation_split=0.1,
verbose=1
)
# โโ Step 5: Evaluate โโ
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.2%}")
# โโ Step 6: Plot training curves โโ
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(history.history['accuracy'], label='Train')
ax1.plot(history.history['val_accuracy'], label='Val')
ax1.set_title('Accuracy'); ax1.legend()
ax2.plot(history.history['loss'], label='Train')
ax2.plot(history.history['val_loss'], label='Val')
ax2.set_title('Loss'); ax2.legend()
plt.tight_layout(); plt.show()
# โโ Step 7: Predict on one image โโ
import numpy as np
idx = 42
pred = model.predict(X_test[idx:idx+1])
print(f"Predicted: {class_names[np.argmax(pred)]} "
f"Actual: {class_names[y_test[idx][0]]}")- Parameter sharing: One filter's weights are shared across the whole image. A 3ร3 filter with 32 output channels = only 3ร3ร3ร32 = 864 parameters, regardless of image size.
- Hierarchical features: Early layers detect edges โ middle layers detect shapes โ deep layers detect objects (eyes, wheels, leaves).
- Data augmentation: Flip, rotate, zoom training images to artificially increase dataset size. Keras has
ImageDataGeneratorfor this. - Batch Normalisation: Often added after Conv layers to stabilise training โ speeds up convergence significantly.