Lesson 4 — Transformers and Attention | Class 10

Meet Zara — Class 10, Srinagar

Zara loves Kashmiri poetry — ghazals, shayari, folk songs. She wanted to build a tool that could translate Kashmiri poetry to Hindi while preserving the feeling, not just the words. She tried Google Translate but the nuance was always lost.

Her AI teacher explained: "Old sequence models process words one-by-one and forget context from the beginning of a long poem by the time they reach the end. Transformers read the entire poem at once and use 'attention' to link words across any distance — even the first line to the last." Zara was fascinated. "So when it reads 'river', it knows to connect it to 'longing' twenty words later?" "Exactly."

The Problem with RNNs

Why Sequence Models Struggled

Before transformers, text was processed by Recurrent Neural Networks (RNNs) and LSTMs. They process tokens one at a time, left-to-right, carrying a "hidden state" forward.

Vanishing gradient: Gradients shrink as they flow back through many time steps. The model forgets early context when processing long sequences.
No parallelisation: Each token must wait for the previous one. You can't use GPU parallelism — training is slow.
Long-range dependency failure: In the sentence "The musical group that won the prize after competing for ten years were happy" — "group" and "were" are far apart. RNNs often get the agreement wrong.

The 2017 paper "Attention Is All You Need" (Vaswani et al.) introduced the Transformer, which fixed all three problems at once.

Core Mechanism

The Attention Mechanism Explained

Attention lets each token "look at" every other token in the sequence simultaneously and decide how much to weight each one. Imagine reading the sentence: "The apple trees in the valley it grew in were beautiful."

When processing "it", attention asks: which word does "it" refer to? The model looks at all previous words, assigns high weight to "valley" and "apple trees", and low weight to "grew" and "were".

Self-attention uses three learned matrices:

Query (Q): "What am I looking for?" — represents the current token asking a question.
Key (K): "What do I have to offer?" — represents each other token's "identity".
Value (V): "What information do I carry?" — the actual content to aggregate if attended to.

The attention score between token i and token j is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Dividing by $\sqrt{d_k}$ prevents very large dot products when the embedding dimension is high.

Below is a simplified attention weight heatmap for the sentence "The river in Srinagar flows swiftly":

	The	river	in	Srinagar	flows	swiftly
The	0.52	0.21	0.09	0.07	0.06	0.05
river	0.18	0.44	0.11	0.17	0.06	0.04
in	0.06	0.22	0.38	0.28	0.04	0.02
Srinagar	0.05	0.29	0.14	0.41	0.07	0.04
flows	0.06	0.31	0.08	0.19	0.27	0.09
swiftly	0.04	0.17	0.05	0.09	0.38	0.27

Read each row as: "When processing this word, how much attention is paid to each other word?" Darker green = higher attention. Notice "flows" attends strongly to "river" — it knows what is flowing.

Architecture Details

Multi-Head Attention and Positional Encoding

Multi-head attention: Run attention multiple times in parallel with different Q, K, V weight matrices. Each "head" learns to attend to different kinds of relationships. For example, one head might track subject-verb agreement, another tracks coreference (pronoun → noun).

Positional encoding: Since transformers process all tokens in parallel, there's no inherent sense of order. Positional encodings are added to token embeddings to inject position information using sine and cosine functions at different frequencies.

# Visualising Attention with HuggingFace Transformers — Google Colab
!pip install transformers bertviz -q

from transformers import BertTokenizer, BertModel
from bertviz import head_view
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

sentence = "The river in Srinagar flows swiftly through the valley"
inputs   = tokenizer(sentence, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

attention = outputs.attentions      # tuple of (layer, batch, heads, seq, seq)
tokens    = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print(f"Layers: {len(attention)}")  # 12 layers in BERT-base
print(f"Heads per layer: {attention[0].shape[1]}")   # 12 heads
print(f"Sequence length: {attention[0].shape[-1]}")  # tokens

# Interactive visualisation — run in Colab to see coloured attention arcs
head_view(attention, tokens)

BERT vs GPT

Two Transformer Architectures

Encoder-only

BERT (Bidirectional)

Reads entire sentence left ↔ right simultaneously
Pre-trained with Masked Language Model (MLM) — predict hidden words
Great for: classification, NER, QA, sentiment
Used in: Google Search, many Indian NLP tools
Can't generate new text

Decoder-only

GPT (Auto-regressive)

Reads left → right only (causal masking)
Pre-trained by predicting next token
Great for: text generation, code, chatbots
Used in: ChatGPT, GitHub Copilot
Can generate fluent, long-form text

The full transformer (encoder + decoder) is used for sequence-to-sequence tasks like translation and summarisation. BERT uses only the encoder; GPT uses only the decoder. For translation, you'd use a model like T5 or mBART (encoder-decoder).

🧪 Check Your Understanding — Lesson 4 Quiz

1. The main problem with RNNs that transformers solve is:

a) RNNs can only process images, not text

b) RNNs struggle with long-range dependencies due to vanishing gradients and cannot be parallelised, making them slow on long sequences

c) RNNs require too much labelled data

d) RNNs produce bounding boxes instead of text

2. In self-attention, the "Query" vector represents:

a) The database of all training examples

b) The current token asking "which other tokens should I attend to?"

c) The output label for classification

d) The gradient update direction

3. Why is attention score divided by √d_k before softmax?

a) To convert attention scores from degrees to radians

b) To ensure all attention scores sum to exactly 1

c) To prevent very large dot products when embedding dimension is high — large values would push softmax into near-zero gradient regions

d) To speed up matrix multiplication on GPU

4. Multi-head attention uses multiple Q, K, V projections in parallel because:

a) It reduces memory usage by splitting computation

b) Each head can learn to attend to different kinds of relationships simultaneously — e.g., one head for syntax, another for coreference

c) It is required to make the model work on multiple languages

d) Each head handles one sentence at a time

5. Positional encoding is added to transformer input because:

a) Transformers process all tokens in parallel and have no inherent sense of word order — positional encodings inject this information

b) It converts words to numbers for the model

c) It normalises token embeddings to unit length

d) It assigns attention weights based on word frequency

6. BERT is described as "bidirectional" because:

a) It can generate text forwards and backwards

b) It reads text in both Devanagari and Latin scripts

c) Its encoder attention can attend to tokens on both the left and right simultaneously — unlike GPT which is causal (left-to-right only)

d) It has two separate models that are merged

7. GPT-style models are better suited than BERT for:

a) Sentence classification tasks like spam detection

b) Named Entity Recognition (NER)

c) Text generation tasks — writing stories, answering open-ended questions, code generation — because they predict the next token autoregressively

d) Object detection in images

8. The landmark paper that introduced the transformer architecture is titled:

a) "Deep Residual Learning for Image Recognition"

b) "Generative Adversarial Networks"

c) "Attention Is All You Need"

d) "BERT: Pre-training of Deep Bidirectional Transformers"

← Lesson 3: Object Detection Lesson 5: Fine-Tuning LLMs →

Transformers and Attention 🔮

Class 10 Lesson 4 - Transformers and Attention

🧪 Check Your Understanding — Lesson 4 Quiz