Zara loves Kashmiri poetry โ ghazals, shayari, folk songs. She wanted to build a tool that could translate Kashmiri poetry to Hindi while preserving the feeling, not just the words. She tried Google Translate but the nuance was always lost.
Her AI teacher explained: "Old sequence models process words one-by-one and forget context from the beginning of a long poem by the time they reach the end. Transformers read the entire poem at once and use 'attention' to link words across any distance โ even the first line to the last." Zara was fascinated. "So when it reads 'river', it knows to connect it to 'longing' twenty words later?" "Exactly."
Before transformers, text was processed by Recurrent Neural Networks (RNNs) and LSTMs. They process tokens one at a time, left-to-right, carrying a "hidden state" forward.
- Vanishing gradient: Gradients shrink as they flow back through many time steps. The model forgets early context when processing long sequences.
- No parallelisation: Each token must wait for the previous one. You can't use GPU parallelism โ training is slow.
- Long-range dependency failure: In the sentence "The musical group that won the prize after competing for ten years were happy" โ "group" and "were" are far apart. RNNs often get the agreement wrong.
The 2017 paper "Attention Is All You Need" (Vaswani et al.) introduced the Transformer, which fixed all three problems at once.
Attention lets each token "look at" every other token in the sequence simultaneously and decide how much to weight each one. Imagine reading the sentence: "The apple trees in the valley it grew in were beautiful."
When processing "it", attention asks: which word does "it" refer to? The model looks at all previous words, assigns high weight to "valley" and "apple trees", and low weight to "grew" and "were".
Self-attention uses three learned matrices:
- Query (Q): "What am I looking for?" โ represents the current token asking a question.
- Key (K): "What do I have to offer?" โ represents each other token's "identity".
- Value (V): "What information do I carry?" โ the actual content to aggregate if attended to.
Below is a simplified attention weight heatmap for the sentence "The river in Srinagar flows swiftly":
| The | river | in | Srinagar | flows | swiftly | |
|---|---|---|---|---|---|---|
| The | 0.52 | 0.21 | 0.09 | 0.07 | 0.06 | 0.05 |
| river | 0.18 | 0.44 | 0.11 | 0.17 | 0.06 | 0.04 |
| in | 0.06 | 0.22 | 0.38 | 0.28 | 0.04 | 0.02 |
| Srinagar | 0.05 | 0.29 | 0.14 | 0.41 | 0.07 | 0.04 |
| flows | 0.06 | 0.31 | 0.08 | 0.19 | 0.27 | 0.09 |
| swiftly | 0.04 | 0.17 | 0.05 | 0.09 | 0.38 | 0.27 |
Read each row as: "When processing this word, how much attention is paid to each other word?" Darker green = higher attention. Notice "flows" attends strongly to "river" โ it knows what is flowing.
Multi-head attention: Run attention multiple times in parallel with different Q, K, V weight matrices. Each "head" learns to attend to different kinds of relationships. For example, one head might track subject-verb agreement, another tracks coreference (pronoun โ noun).
Positional encoding: Since transformers process all tokens in parallel, there's no inherent sense of order. Positional encodings are added to token embeddings to inject position information using sine and cosine functions at different frequencies.
# Visualising Attention with HuggingFace Transformers โ Google Colab
!pip install transformers bertviz -q
from transformers import BertTokenizer, BertModel
from bertviz import head_view
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
sentence = "The river in Srinagar flows swiftly through the valley"
inputs = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
attention = outputs.attentions # tuple of (layer, batch, heads, seq, seq)
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print(f"Layers: {len(attention)}") # 12 layers in BERT-base
print(f"Heads per layer: {attention[0].shape[1]}") # 12 heads
print(f"Sequence length: {attention[0].shape[-1]}") # tokens
# Interactive visualisation โ run in Colab to see coloured attention arcs
head_view(attention, tokens)- Reads entire sentence left โ right simultaneously
- Pre-trained with Masked Language Model (MLM) โ predict hidden words
- Great for: classification, NER, QA, sentiment
- Used in: Google Search, many Indian NLP tools
- Can't generate new text
- Reads left โ right only (causal masking)
- Pre-trained by predicting next token
- Great for: text generation, code, chatbots
- Used in: ChatGPT, GitHub Copilot
- Can generate fluent, long-form text