Transformers and Attention ๐Ÿ”ฎ

Class 10Age 14โ€“15Lesson 4 of 12๐Ÿ†“ Free
Student in Srinagar studying transformer architecture on laptop โ€” attention weight heatmap visualization on screen
Watch first - 2-3 minutes

Class 10 Lesson 4 - Transformers and Attention

No sign-in needed - English narration - Safe for all school ages

Meet Zara โ€” Class 10, Srinagar

Zara loves Kashmiri poetry โ€” ghazals, shayari, folk songs. She wanted to build a tool that could translate Kashmiri poetry to Hindi while preserving the feeling, not just the words. She tried Google Translate but the nuance was always lost.

Her AI teacher explained: "Old sequence models process words one-by-one and forget context from the beginning of a long poem by the time they reach the end. Transformers read the entire poem at once and use 'attention' to link words across any distance โ€” even the first line to the last." Zara was fascinated. "So when it reads 'river', it knows to connect it to 'longing' twenty words later?" "Exactly."

The Problem with RNNs
Why Sequence Models Struggled

Before transformers, text was processed by Recurrent Neural Networks (RNNs) and LSTMs. They process tokens one at a time, left-to-right, carrying a "hidden state" forward.

The 2017 paper "Attention Is All You Need" (Vaswani et al.) introduced the Transformer, which fixed all three problems at once.

Core Mechanism
The Attention Mechanism Explained

Attention lets each token "look at" every other token in the sequence simultaneously and decide how much to weight each one. Imagine reading the sentence: "The apple trees in the valley it grew in were beautiful."

When processing "it", attention asks: which word does "it" refer to? The model looks at all previous words, assigns high weight to "valley" and "apple trees", and low weight to "grew" and "were".

Self-attention uses three learned matrices:

The attention score between token i and token j is computed as: $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Dividing by $\sqrt{d_k}$ prevents very large dot products when the embedding dimension is high.

Below is a simplified attention weight heatmap for the sentence "The river in Srinagar flows swiftly":

TheriverinSrinagarflowsswiftly
The0.520.210.090.070.060.05
river0.180.440.110.170.060.04
in0.060.220.380.280.040.02
Srinagar0.050.290.140.410.070.04
flows0.060.310.080.190.270.09
swiftly0.040.170.050.090.380.27

Read each row as: "When processing this word, how much attention is paid to each other word?" Darker green = higher attention. Notice "flows" attends strongly to "river" โ€” it knows what is flowing.

Architecture Details
Multi-Head Attention and Positional Encoding

Multi-head attention: Run attention multiple times in parallel with different Q, K, V weight matrices. Each "head" learns to attend to different kinds of relationships. For example, one head might track subject-verb agreement, another tracks coreference (pronoun โ†’ noun).

Positional encoding: Since transformers process all tokens in parallel, there's no inherent sense of order. Positional encodings are added to token embeddings to inject position information using sine and cosine functions at different frequencies.

# Visualising Attention with HuggingFace Transformers โ€” Google Colab
!pip install transformers bertviz -q

from transformers import BertTokenizer, BertModel
from bertviz import head_view
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model     = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

sentence = "The river in Srinagar flows swiftly through the valley"
inputs   = tokenizer(sentence, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

attention = outputs.attentions      # tuple of (layer, batch, heads, seq, seq)
tokens    = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print(f"Layers: {len(attention)}")  # 12 layers in BERT-base
print(f"Heads per layer: {attention[0].shape[1]}")   # 12 heads
print(f"Sequence length: {attention[0].shape[-1]}")  # tokens

# Interactive visualisation โ€” run in Colab to see coloured attention arcs
head_view(attention, tokens)
BERT vs GPT
Two Transformer Architectures
Encoder-only
BERT (Bidirectional)
  • Reads entire sentence left โ†” right simultaneously
  • Pre-trained with Masked Language Model (MLM) โ€” predict hidden words
  • Great for: classification, NER, QA, sentiment
  • Used in: Google Search, many Indian NLP tools
  • Can't generate new text
Decoder-only
GPT (Auto-regressive)
  • Reads left โ†’ right only (causal masking)
  • Pre-trained by predicting next token
  • Great for: text generation, code, chatbots
  • Used in: ChatGPT, GitHub Copilot
  • Can generate fluent, long-form text
The full transformer (encoder + decoder) is used for sequence-to-sequence tasks like translation and summarisation. BERT uses only the encoder; GPT uses only the decoder. For translation, you'd use a model like T5 or mBART (encoder-decoder).

๐Ÿงช Check Your Understanding โ€” Lesson 4 Quiz

1. The main problem with RNNs that transformers solve is:
a) RNNs can only process images, not text
b) RNNs struggle with long-range dependencies due to vanishing gradients and cannot be parallelised, making them slow on long sequences
c) RNNs require too much labelled data
d) RNNs produce bounding boxes instead of text
2. In self-attention, the "Query" vector represents:
a) The database of all training examples
b) The current token asking "which other tokens should I attend to?"
c) The output label for classification
d) The gradient update direction
3. Why is attention score divided by โˆšd_k before softmax?
a) To convert attention scores from degrees to radians
b) To ensure all attention scores sum to exactly 1
c) To prevent very large dot products when embedding dimension is high โ€” large values would push softmax into near-zero gradient regions
d) To speed up matrix multiplication on GPU
4. Multi-head attention uses multiple Q, K, V projections in parallel because:
a) It reduces memory usage by splitting computation
b) Each head can learn to attend to different kinds of relationships simultaneously โ€” e.g., one head for syntax, another for coreference
c) It is required to make the model work on multiple languages
d) Each head handles one sentence at a time
5. Positional encoding is added to transformer input because:
a) Transformers process all tokens in parallel and have no inherent sense of word order โ€” positional encodings inject this information
b) It converts words to numbers for the model
c) It normalises token embeddings to unit length
d) It assigns attention weights based on word frequency
6. BERT is described as "bidirectional" because:
a) It can generate text forwards and backwards
b) It reads text in both Devanagari and Latin scripts
c) Its encoder attention can attend to tokens on both the left and right simultaneously โ€” unlike GPT which is causal (left-to-right only)
d) It has two separate models that are merged
7. GPT-style models are better suited than BERT for:
a) Sentence classification tasks like spam detection
b) Named Entity Recognition (NER)
c) Text generation tasks โ€” writing stories, answering open-ended questions, code generation โ€” because they predict the next token autoregressively
d) Object detection in images
8. The landmark paper that introduced the transformer architecture is titled:
a) "Deep Residual Learning for Image Recognition"
b) "Generative Adversarial Networks"
c) "Attention Is All You Need"
d) "BERT: Pre-training of Deep Bidirectional Transformers"
โ† Lesson 3: Object Detection Lesson 5: Fine-Tuning LLMs โ†’