Lesson 1 — Convolutional Neural Networks | Class 10

Meet Meera — Class 10, Delhi

Meera's younger brother has a mango tree in the backyard. Every season, some mangoes develop a fungal disease. Her father usually identifies it by eye — dark spots with a yellowish ring. Meera wondered: "Could an AI identify diseased fruit from a photo?" She'd heard of apps like Plantix. But how exactly does a computer see an image?

She opened her Class 9 notes. She knew a regular neural network flattens every pixel into a 1D list. For a 224×224 image with 3 colour channels, that's 150,528 numbers — before even one neuron processes them. "No wonder they struggle," she thought. "They lose all the spatial structure." Then she read about Convolutional Neural Networks — and everything clicked.

The Core Problem

Why Regular Networks Fail on Images

In Class 9, you learned how a dense (fully-connected) neural network works: every input connects to every neuron. That's fine for small data, but disastrous for images:

Too many parameters: A 224×224 RGB image flattened = 150,528 inputs. With 512 neurons in layer 1, that's 77 million weights — just in one layer.
No spatial awareness: Flattening destroys position. A pixel at top-left has no relationship to its neighbour once it's in a list.
Not translation invariant: If a cat moves from centre to corner of the photo, a dense network treats it as a completely different image.

CNNs solve all three problems with one key idea: local connections + shared weights.

Core Concept

How a Convolutional Filter Works

A filter (also called a kernel) is a small grid of weights — typically 3×3. It slides across the image, computing a dot product at each position. The result is a feature map that highlights where that pattern appears.

🔲 Edge Detector (Vertical)

-1

Activates strongly at vertical edges in the image

🔲 Blur Filter

Averages neighbouring pixels — smooths noise

🔲 Sharpen Filter

-1

Amplifies centre vs neighbours — sharpens detail

In a trained CNN, the network learns the filter weights automatically from the data. You don't hand-design them — they emerge from backpropagation.

Architecture

Layers of a CNN

Input

224×224×3

→

Conv2D

32 filters × 3×3

→

MaxPool2D

2×2, halves size

→

Conv2D + Pool

64 filters

→

Flatten + Dense

128 neurons

→

Softmax Output

N classes

Conv2D layer: Applies learnable filters across the image. Each filter detects one pattern (edge, curve, texture). More filters = richer feature set.
ReLU activation: Applied after each convolution. Sets all negative values to 0 — adds non-linearity without killing spatial structure.
MaxPool2D: Takes the maximum value from each 2×2 region. Reduces spatial dimensions by half, keeping the strongest features, reducing parameters.
Flatten: After the conv/pool stack, converts the 3D feature volume to a 1D vector for the dense classifier.
Dense + Softmax: The final classifier — same as Class 9 neural networks but much smaller because CNNs already extracted the key features.

Why pooling matters: A 224×224 image after two rounds of MaxPool2D becomes 56×56. The dense layer only sees 56×56 × (number of filters) instead of 224×224 × 3. This reduces parameters by ~16× and makes the model translation invariant — the cat at the corner and the cat at the centre produce similar feature activations.

Python Code

Build a CNN in Keras

# CNN for Image Classification — Google Colab
# Task: Classify images into 10 categories (CIFAR-10 dataset)

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

# ── Step 1: Load and normalise CIFAR-10 ──
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255.0  # scale pixels to 0-1
X_test  = X_test.astype('float32')  / 255.0

class_names = ['airplane','automobile','bird','cat','deer',
               'dog','frog','horse','ship','truck']

print(f"Training set: {X_train.shape}")   # (50000, 32, 32, 3)
print(f"Test set:     {X_test.shape}")    # (10000, 32, 32, 3)

# ── Step 2: Build CNN architecture ──
model = keras.Sequential([
    # Block 1: Detect edges and textures
    layers.Conv2D(32, (3,3), activation='relu', padding='same',
                  input_shape=(32,32,3)),
    layers.Conv2D(32, (3,3), activation='relu', padding='same'),
    layers.MaxPool2D(2, 2),     # 32x32 -> 16x16
    layers.Dropout(0.25),

    # Block 2: Detect shapes from edges
    layers.Conv2D(64, (3,3), activation='relu', padding='same'),
    layers.Conv2D(64, (3,3), activation='relu', padding='same'),
    layers.MaxPool2D(2, 2),     # 16x16 -> 8x8
    layers.Dropout(0.25),

    # Classifier head
    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')  # 10 CIFAR-10 classes
])

model.summary()  # see total parameters

# ── Step 3: Compile ──
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# ── Step 4: Train ──
history = model.fit(
    X_train, y_train,
    epochs=20,
    batch_size=64,
    validation_split=0.1,
    verbose=1
)

# ── Step 5: Evaluate ──
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.2%}")

# ── Step 6: Plot training curves ──
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(history.history['accuracy'], label='Train')
ax1.plot(history.history['val_accuracy'], label='Val')
ax1.set_title('Accuracy'); ax1.legend()
ax2.plot(history.history['loss'], label='Train')
ax2.plot(history.history['val_loss'], label='Val')
ax2.set_title('Loss'); ax2.legend()
plt.tight_layout(); plt.show()

# ── Step 7: Predict on one image ──
import numpy as np
idx = 42
pred = model.predict(X_test[idx:idx+1])
print(f"Predicted: {class_names[np.argmax(pred)]}  "
      f"Actual: {class_names[y_test[idx][0]]}")

Expected result: With 20 epochs on CIFAR-10, this architecture reaches roughly 75–80% test accuracy. ResNet-50 (from Class 10 Lesson 2) gets 93%+ using transfer learning — same task, far less training, much better accuracy.

Key Ideas Summary

What Makes CNNs Powerful

Parameter sharing: One filter's weights are shared across the whole image. A 3×3 filter with 32 output channels = only 3×3×3×32 = 864 parameters, regardless of image size.
Hierarchical features: Early layers detect edges → middle layers detect shapes → deep layers detect objects (eyes, wheels, leaves).
Data augmentation: Flip, rotate, zoom training images to artificially increase dataset size. Keras has ImageDataGenerator for this.
Batch Normalisation: Often added after Conv layers to stabilise training — speeds up convergence significantly.

Meera's mango project: To build a mango disease detector, you need ~500–1000 labelled photos (healthy / diseased). With CNNs and transfer learning (next lesson), you can achieve 90%+ accuracy from scratch in Colab — no expensive GPU required for inference.

🧪 Check Your Understanding — Lesson 1 Quiz

1. The main reason regular (dense) neural networks struggle with images is:

a) They are too slow to train

b) They require GPU hardware

c) They lose spatial structure when images are flattened, and have too many parameters for image-sized inputs

d) They can only work with greyscale images

2. A convolutional filter (kernel) is typically:

a) The same size as the full input image

b) A small grid of learnable weights (e.g., 3×3) that slides across the image

c) A pre-defined mathematical function that cannot be changed by training

d) A list of pixel brightness values from the image

3. The purpose of MaxPool2D(2,2) in a CNN is to:

a) Add more feature maps to increase model capacity

b) Apply an activation function to remove negative values

c) Reduce the spatial dimensions of feature maps by half, keeping the strongest activations and reducing parameters

d) Connect all neurons from one layer to all neurons in the next

4. "Parameter sharing" in CNNs means:

a) Multiple users share the same trained model

b) One filter's weights are used at every position as it slides across the image, drastically reducing the number of learnable parameters

c) All layers in the network use the same weight values

d) Dense layers share weights with convolutional layers

5. In a deep CNN, what does the network typically learn in its earliest layers?

a) High-level concepts like "cat" or "car"

b) Simple patterns like edges and colour gradients, which later layers combine into shapes and objects

c) The softmax probabilities for each class

d) The exact pixel colours in the training images

6. In Keras, `padding='same'` in Conv2D means:

a) All layers use the same filter size

b) The output feature map has the same spatial dimensions as the input by adding zeros around the border

c) The model uses the same weights as the previous layer

d) No activation function is applied

7. Why is data augmentation useful when training a CNN?

a) It increases the number of test images to evaluate accuracy more precisely

b) It makes images smaller so training is faster

c) It artificially increases training set diversity (flips, rotations, zooms) so the model generalises better and overfits less

d) It normalises pixel values to the range 0–1

8. After two MaxPool2D(2,2) layers, a 224×224 feature map becomes:

a) 224×224 (pooling doesn't change size)

b) 112×112

c) 56×56

d) 28×28

← Class 10 Hub Lesson 2: Transfer Learning →

Convolutional Neural Networks 👁️

Class 10 Lesson 1 - Convolutional Neural Networks

🧪 Check Your Understanding — Lesson 1 Quiz