Inside Neural Networks ๐Ÿง 

Class 9Age 13โ€“14Lesson 1 of 12๐Ÿ†“ Free
Glowing neural network diagram with layers of interconnected nodes on a dark background, blue and orange highlights
Watch first - 2-3 minutes

Class 9 Lesson 1 - Inside Neural Networks

No sign-in needed - English narration - Safe for all school ages

Meet Arjun โ€” Class 9, Visakhapatnam

Arjun did great in Class 8. He knows AI learns from data. He's used ChatGPT, tried Python in Colab, and even built a simple sorting algorithm. But one question still bothers him: "When I feed data into a neural network, what is actually happening inside?"

His teacher says "the network adjusts its weights." But what does that really mean? Today we'll open the black box together โ€” and by the end, Arjun (and you) will be able to draw a diagram of a neural network and explain exactly what happens at each step.

First โ€” Quick Recap
What We Already Know

From Class 8, we know that a neural network is inspired by the human brain. It takes input, processes it through layers, and produces an output (a prediction). But we treated the inside as a "black box." Now we open it.

Class 9 Promise: After this lesson you will never say "the AI just figures it out." You'll be able to explain exactly which mathematical operations happen โ€” with enough detail to actually write code.
Part 1
The Three Kinds of Layers

Every neural network is made of layers stacked one after another. There are exactly three types:

๐Ÿ“ฅ
Input Layer
Receives your raw data. One node per feature. No computation happens here.
โš™๏ธ
Hidden Layers
Where the real computation happens. Can be 1 to thousands of layers. "Deep learning" just means many hidden layers.
๐Ÿ“ค
Output Layer
Produces the final prediction. One node per class (for classification) or one node for a number (for regression).
Example: Predict if a student will pass an exam (3 features โ†’ 1 output)
Input (3)
H
A
S
โ†’
Hidden (4)
โ†’
Output (1)
P

H = Hours studied ยท A = Attendance % ยท S = Sleep hours โ†’ P = Pass/Fail probability

Part 2
Weights: The "Knobs" of a Network

Every connection between two nodes has a number called a weight. The weight says: "how much should this input influence the next node?" A large positive weight means "pay a lot of attention." A weight near zero means "almost ignore this."

๐ŸŽš Analogy: A Music Mixer

A DJ's mixing board has sliders (volume for each instrument). The neural network's weights are those sliders. Training adjusts the sliders until the output sounds right. Before training, the sliders are set randomly. After training, they're at the exact positions that produce correct predictions.

In a tiny network with 3 inputs and 4 hidden nodes, there are already 3 ร— 4 = 12 weights just for the first layer. Large modern networks have billions of weights โ€” GPT-4 has roughly 1.7 trillion parameters.

Key Insight: When we say "the model learned," we mean: the values of all the weights were adjusted until the output became correct (or close enough). The weights are the knowledge stored in the network.
Part 3
What Happens Inside One Node

Each node in a hidden or output layer does two simple things:

1
Weighted Sum

Multiply each input by its weight, add them all together, then add one more number called the bias. Result: z = (wโ‚ร—xโ‚) + (wโ‚‚ร—xโ‚‚) + ... + bias

2
Activation Function

Pass the sum z through an activation function. This decides whether this node "fires" and with what intensity. Common functions: ReLU, Sigmoid, Softmax.

โšก
ReLU
max(0, z) โ€” if z is negative, output 0. If positive, pass it through. Used in most hidden layers. Fast and simple.
๐Ÿ“ˆ
Sigmoid
Squeezes output between 0 and 1. Used in output layer for yes/no (binary) classification.
๐ŸŽฒ
Softmax
Converts output scores into probabilities that add up to 1. Used in output layer when classifying into many categories.
Why do we need activation functions? Without them, stacking layers is mathematically the same as one layer. Activation functions add non-linearity โ€” they let networks learn complex, curved patterns, not just straight lines.
Part 4
Forward Pass: From Input to Prediction

When you give a trained network an input, data flows forward through the layers, one layer at a time. Each layer produces a set of numbers that become the input to the next layer. This one-way trip is called the forward pass.

# Tiny neural network forward pass (concept โ€” no framework)
import math

# Input: [hours_studied=6, attendance=0.85, sleep=7]
x = [6, 0.85, 7]

# Weights for 2 hidden nodes (made-up values for illustration)
w = [[0.4, 0.1, 0.3],   # weights for hidden node 1
     [0.2, 0.5, 0.1]]   # weights for hidden node 2
bias = [0.2, 0.1]

# Step 1: Weighted sum
z1 = sum(w[0][i]*x[i] for i in range(3)) + bias[0]
z2 = sum(w[1][i]*x[i] for i in range(3)) + bias[1]

# Step 2: ReLU activation
def relu(z): return max(0, z)
h1, h2 = relu(z1), relu(z2)

print(f"Hidden node 1 output: {h1:.4f}")
print(f"Hidden node 2 output: {h2:.4f}")
# These then feed forward to the output layer...

Run this in Google Colab and try changing the weights to see how the output changes. This is exactly what training does automatically โ€” but using calculus.

Part 5
How Training Works: Backpropagation

Training a neural network means finding the right weights. The algorithm that does this is called backpropagation. Here's the intuition:

1
Make a prediction (forward pass)

Feed an input through the network to get a predicted output.

2
Measure the error (loss function)

Compare prediction to the correct answer. The "loss" is a number that tells how wrong we are. Lower loss = better model.

3
Propagate error backwards

Using calculus (the chain rule), calculate how much each weight contributed to the error. This is the "gradient."

4
Update weights (gradient descent)

Adjust each weight by a tiny amount in the direction that reduces the loss. The size of the step is called the learning rate.

5
Repeat thousands of times

Each repetition through the full training dataset is called an epoch. After many epochs, the weights stabilise and the loss is low.

๐Ÿ” Analogy: Rolling Downhill in the Dark

Imagine you're blindfolded on a hilly landscape. Your goal: find the lowest valley. You feel the slope under your feet (the gradient) and take one small step downhill. Then check the slope again, take another step. Repeat until you can't go any lower. That valley is the lowest loss โ€” the best set of weights.

Real World: You don't write backpropagation yourself. Libraries like TensorFlow and PyTorch calculate all the gradients automatically. You just define the network shape, loss function, and learning rate โ€” and call model.fit(). But now you know what's happening inside!
Part 6
Why "Deep" Learning?

A shallow network (1โ€“2 hidden layers) can learn simple patterns โ€” like whether a student passes based on study hours. A deep network (many hidden layers) can learn complex patterns โ€” like recognising a face in a photo, translating a language, or generating realistic text.

Each layer learns to detect increasingly complex features:

GPT-4 has 96 transformer layers. Each layer refines the understanding of language a little more. By layer 96, the network has enough abstraction to generate coherent, contextually accurate text.
Try It in Colab
Your First Keras Neural Network

Here's a minimal neural network you can run right now in Google Colab. It trains on made-up student data to predict exam pass/fail:

# Open Colab: colab.research.google.com โ†’ New Notebook

import numpy as np
from tensorflow import keras

# Fake student data: [hours_studied, attendance, sleep_hours]
X = np.array([[2, 0.5, 5], [7, 0.9, 8], [4, 0.7, 6],
              [1, 0.3, 4], [8, 0.95, 8], [3, 0.6, 6]])
y = np.array([0, 1, 1, 0, 1, 0])  # 0=fail, 1=pass

# Build a simple neural network
model = keras.Sequential([
    keras.layers.Dense(4, activation='relu', input_shape=(3,)),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(X, y, epochs=100, verbose=0)
print(f"Final accuracy: {history.history['accuracy'][-1]:.2f}")

# Predict for a new student: [5 hours, 80% attendance, 7 hours sleep]
prediction = model.predict([[5, 0.8, 7]])
print(f"Pass probability: {prediction[0][0]:.2f}")
Try: Change the number 4 (hidden nodes) to 8 or 16. See if accuracy improves. Change epochs=100 to epochs=200. Watch what happens. You're doing hyperparameter tuning!

๐Ÿงช Check Your Understanding โ€” Lesson 1 Quiz

1. Which layer of a neural network receives the raw input data and does NO computation?
a) Output layer
b) Hidden layer
c) Input layer
d) Loss layer
2. What are "weights" in a neural network?
a) The size of the dataset used for training
b) Numbers on each connection that control how much influence one node has on the next
c) The number of layers in the network
d) The output probabilities from the last layer
3. Why do we need activation functions like ReLU?
a) To speed up data loading
b) To reduce the size of the dataset
c) To add non-linearity so the network can learn complex patterns
d) To convert weights into probabilities
4. In backpropagation, what does the "gradient" tell us?
a) The number of training epochs completed
b) The total number of weights in the network
c) How much each weight contributed to the prediction error
d) The final accuracy of the model
5. What is one "epoch" in neural network training?
a) A single weight update
b) One full pass through the entire training dataset
c) The time taken for one forward pass
d) Adding one new layer to the network
6. What activation function should you use in the output layer for binary (yes/no) classification?
a) ReLU
b) Softmax
c) Sigmoid
d) Tanh
7. What is "gradient descent"?
a) A method for loading datasets faster
b) The process of adjusting weights in small steps to reduce prediction error
c) A way to visualise neural network layers
d) The formula for computing weighted sums
8. Why is a "deep" network better than a "shallow" network for complex tasks?
a) Deep networks have more training data
b) Deep networks run faster on computers
c) Each layer learns increasingly complex features, enabling the network to understand patterns shallow networks cannot
d) Deep networks use less memory
โ† Back to Class 9 Lesson 2: Your First Dataset โ†’