Lesson 9 — Reading and Implementing AI Research Papers | Class 11

Story

Kunal's Quest to Understand Attention

Kunal, 16, from Bengaluru kept hearing about "transformers" and "attention." He found the famous paper: "Attention Is All You Need" (Vaswani et al., 2017). He opened it and immediately felt overwhelmed — 15 pages of dense math, acronyms he didn't know, and figures with no obvious entry point.

His computer science teacher said: "Every expert was once confused by their first research paper. The skill is not intuition — it's a reading strategy." She taught him the 3-pass method used by PhD students worldwide.

Three days later, Kunal had a working implementation of scaled dot-product attention in NumPy — 30 lines of code that matched the equations in the paper exactly. He presented it at his school's AI club to standing applause.

Section 1

Finding Papers on arXiv

arXiv (pronounced "archive") is the free preprint server where most AI researchers publish papers immediately — often before formal peer review. It hosts over 2 million papers in physics, mathematics, CS, and AI.

URL pattern: arxiv.org/abs/YYMM.NNNNN — e.g., arxiv.org/abs/1706.03762 is "Attention Is All You Need"
Search: arxiv.org/search/ — filter by category cs.LG (machine learning), cs.CV (vision), cs.CL (NLP), cs.AI
PDF link: Change /abs/ to /pdf/ in the URL for direct PDF download
Semantic Scholar: semanticscholar.org — finds related papers and citation counts automatically
Papers With Code: paperswithcode.com — every paper linked to its GitHub implementation

Start with highly cited papers. A paper with 10,000+ citations has been read and verified by thousands of researchers worldwide. Starting with influential papers (BERT, GPT-2, ResNet, Attention Is All You Need) builds a foundation for understanding newer work.

Section 2

The 3-Pass Reading Method

Pass 1

5–10 minutes: Survey

Read title, abstract, introduction. Skim section headings and figures. Read conclusion. Goal: understand what problem they solve and what the main result is.

Pass 2

60 minutes: Read

Read all sections carefully. Write notes for every figure and equation. Skip proofs. Mark terms you don't understand but keep moving. Goal: understand the method at a high level.

Pass 3

3–5 hours: Reproduce

Implement the core equations in code. Look up every unknown term. Verify your code produces the same shapes/values as the paper claims. Goal: deep mastery.

Abstract

1 paragraph summary of the problem, method, and result. Read this FIRST — it tells you whether the paper is relevant to your needs.

Introduction

Motivation and context. Written for a general ML audience. Usually accessible without specialist knowledge.

Related Work

How this paper differs from prior work. Read this LAST (after you understand the method) — it makes more sense in context.

Method

The actual contribution. Often the hardest section. Every equation here corresponds to code you can write.

Experiments

Benchmark results, ablation studies ("what if we remove component X?"). Tells you how much each design choice matters.

Conclusion

Summary and limitations. Authors are often more honest here than in the abstract. Look for "limitation" mentions.

Section 3

Implementing Scaled Dot-Product Attention

From "Attention Is All You Need" (Vaswani et al., 2017), equation 1:

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable softmax along last axis."""
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def scaled_dot_product_attention(
    Q: np.ndarray,   # (seq_len, d_k)
    K: np.ndarray,   # (seq_len, d_k)
    V: np.ndarray,   # (seq_len, d_v)
    mask: np.ndarray = None
) -> tuple[np.ndarray, np.ndarray]:
    """
    Scaled dot-product attention.
    Returns (output, attention_weights).
    """
    d_k = Q.shape[-1]

    # Step 1: Compute similarity scores  (seq_len, seq_len)
    scores = Q @ K.T / np.sqrt(d_k)   # QKᵀ / √d_k

    # Step 2: Apply mask (optional — for causal/decoder attention)
    if mask is not None:
        scores = scores + mask * -1e9   # fill masked positions with -∞

    # Step 3: Softmax → attention weights
    weights = softmax(scores)           # (seq_len, seq_len)

    # Step 4: Weighted sum of values
    output = weights @ V               # (seq_len, d_v)

    return output, weights

# ── Test it ──────────────────────────────────────────────────────
seq_len = 4    # 4 tokens
d_k = 8        # key/query dimension
d_v = 8        # value dimension

np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

output, attn = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")    # (4, 8)
print(f"Attention weights:\n{attn.round(3)}")
# Each row sums to 1.0 — verify:
print(f"Row sums: {attn.sum(axis=1)}")    # [1. 1. 1. 1.]

# ── Multi-head attention ──────────────────────────────────────────
class MultiHeadAttention:
    """
    Implements multi-head attention from Vaswani et al. 2017.
    Paper uses: h=8 heads, d_model=512, d_k=d_v=64
    """
    def __init__(self, d_model: int, h: int):
        self.h = h
        self.d_k = d_model // h
        self.d_v = d_model // h

        # Parameter matrices — randomly initialised
        self.W_Q = np.random.randn(d_model, d_model) * 0.01
        self.W_K = np.random.randn(d_model, d_model) * 0.01
        self.W_V = np.random.randn(d_model, d_model) * 0.01
        self.W_O = np.random.randn(d_model, d_model) * 0.01

    def forward(self, X: np.ndarray) -> np.ndarray:
        """X shape: (seq_len, d_model)"""
        seq_len, d_model = X.shape

        Q = X @ self.W_Q   # (seq_len, d_model)
        K = X @ self.W_K
        V = X @ self.W_V

        # Split into h heads
        def split_heads(M):
            M = M.reshape(seq_len, self.h, self.d_k)
            return M.transpose(1, 0, 2)   # (h, seq_len, d_k)

        Q_h = split_heads(Q)   # (h, seq_len, d_k)
        K_h = split_heads(K)
        V_h = split_heads(V)

        # Attention for each head
        head_outputs = []
        for i in range(self.h):
            out, _ = scaled_dot_product_attention(Q_h[i], K_h[i], V_h[i])
            head_outputs.append(out)   # each (seq_len, d_k)

        # Concatenate heads
        concat = np.concatenate(head_outputs, axis=-1)   # (seq_len, d_model)

        # Final projection
        return concat @ self.W_O   # (seq_len, d_model)

# Test multi-head attention
d_model = 32
h = 4
mha = MultiHeadAttention(d_model=d_model, h=h)
X = np.random.randn(6, d_model)   # 6 tokens, d_model=32
out = mha.forward(X)
print(f"Multi-head attention output: {out.shape}")   # (6, 32)

Each equation = runnable code. When you see Attention(Q,K,V) = softmax(QKᵀ/√d_k)V in the paper, that is exactly what lines 15–23 of the code above implement. Read the paper with a Colab notebook open — implement each equation as you encounter it. The math becomes intuitive very quickly.

Section 4

Citing Papers with BibTeX

# Get the BibTeX citation for any arXiv paper:
# 1. Go to the paper's abstract page e.g. arxiv.org/abs/1706.03762
# 2. Click "Export BibTeX citation" on the right panel
# 3. Copy the BibTeX into your references.bib file

# For "Attention Is All You Need":
@misc{vaswani2017attention,
  title   = {Attention Is All You Need},
  author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and
             Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and
             Łukasz Kaiser and Illia Polosukhin},
  year    = {2017},
  eprint  = {1706.03762},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

# In your paper (LaTeX), cite with: \cite{vaswani2017attention}

# Searching citations programmatically with semanticscholar API:
import requests

def get_paper_info(arxiv_id: str) -> dict:
    url = f"https://api.semanticscholar.org/graph/v1/paper/arXiv:{arxiv_id}"
    params = {"fields": "title,authors,year,citationCount"}
    resp = requests.get(url, params=params, timeout=10)
    resp.raise_for_status()
    return resp.json()

info = get_paper_info("1706.03762")
print(f"{info['title']} — {info['citationCount']:,} citations")

📄 Lesson 9 Quiz — Reading and Implementing AI Papers

1. In the 3-pass reading method, what is the primary goal of the FIRST pass through a paper?

a) Memorise all equations and their derivations

b) Survey the paper in 5–10 minutes to understand what problem it solves and what the main result is — title, abstract, intro, section headings, figures, conclusion. This lets you decide whether the paper is worth the 3–5 hours needed for full implementation. Most papers fail this filter and can be safely set aside.

c) Critique the experimental methodology and identify statistical flaws

d) Implement the core equations immediately while the reading is fresh

2. The scaling factor √d_k in Attention(Q,K,V) = softmax(QKᵀ/√d_k)V prevents:

a) The attention weights from summing to more than 1.0

b) The dot products from growing too large in magnitude when d_k is large. Without scaling, QKᵀ values can become very large, pushing the softmax into a region of near-zero gradients where all the weight collapses onto one token (argmax behaviour). Dividing by √d_k keeps the variance of the scores near 1 regardless of d_k, preserving gradient flow during training.

c) Matrix multiplication from being computationally expensive on GPUs

d) The output from having a different shape than the input sequence

3. Multi-head attention with h=8 heads and d_model=512 uses d_k=64 per head. The key benefit of multiple heads over one large attention layer is:

a) Eight heads reduce total computation by 8x compared to single-head attention

b) Each head can attend to different types of relationships simultaneously — one head might capture syntactic structure (subject-verb), another might capture coreference (pronoun-noun), another semantic similarity. A single head with the full d_model would be forced to combine all these signals in one representation, losing specificity.

c) Multiple heads allow the model to process different sentence lengths at the same time

d) Each head is trained on a different subset of the training data for ensemble diversity

4. In the numerically stable softmax implementation, why do we subtract x.max(axis=-1, keepdims=True) before taking exp()?

a) To normalise the inputs to zero mean before the softmax

b) Without the max subtraction, large attention scores cause exp(x) to overflow to infinity (np.inf) on float32. Since softmax is shift-invariant — softmax(x) = softmax(x - c) for any constant c — we can subtract the maximum value to guarantee exp(x - max) ∈ (0, 1], preventing overflow while producing mathematically identical results.

c) The subtraction ensures all softmax outputs are positive values

d) This is a normalisation step required by the paper's equations

5. The mask in scaled_dot_product_attention fills masked positions with -1e9 (large negative value) before softmax. The purpose is:

a) To prevent the model from learning from those positions by zeroing their gradients

b) Adding -1e9 makes exp(-1e9) ≈ 0, so masked positions get effectively zero attention weight after softmax. This implements causal (autoregressive) masking in decoder attention — token i cannot attend to tokens j>i during training. The model must predict each token without "seeing the future."

c) The large negative value acts as a regulariser to encourage sparser attention

d) It prevents division by zero in the subsequent matrix multiplication with V

6. Papers With Code (paperswithcode.com) is particularly useful because it:

a) Only shows papers that have been peer-reviewed and accepted at top venues

b) Links each paper to its official GitHub implementation and tracks benchmark leaderboards. When you want to understand and extend a paper, having the reference implementation lets you verify your own implementation against ground truth, run the official experiments on your own data, and see how the authors handled edge cases the paper doesn't describe.

c) Automatically translates papers into Python pseudocode using AI

d) Provides a summary of each paper written by the original authors for beginners

7. The Related Work section of a paper is recommended for reading LAST (after understanding the method). The reason is:

a) Related work contains the least important information in a research paper

b) Related work compares the paper's contribution to prior approaches — but without understanding the method first, the comparisons are abstract and meaningless. After you understand the method deeply, the related work section becomes highly informative: you can precisely understand in what way this paper improves on each prior approach rather than just reading generic comparison claims.

c) Authors often include incorrect citations in the related work section

d) Related work is written in a more technical style that requires reading the method first to understand the vocabulary

8. Ablation studies in the Experiments section (e.g., "removing positional encodings drops BLEU by 6 points") are important because they:

a) Show that the paper's authors tested all possible model configurations exhaustively

b) Isolate the contribution of each design choice by removing it and measuring the performance drop. This tells you which components are essential to the model's success. If you are implementing the model for a new task, ablation results guide which components to include and which can be safely removed without major performance loss.

c) Provide confidence intervals that prove the results are statistically significant

d) Show that the model works across many different types of data beyond the original benchmark

← Lesson 8: AI Agents Lesson 10: AI Safety →

Reading and Implementing AI Research Papers 📄

Class 11 Lesson 9 - Reading and Implementing AI Research Papers

5–10 minutes: Survey

60 minutes: Read

3–5 hours: Reproduce

📄 Lesson 9 Quiz — Reading and Implementing AI Papers