Kunal, 16, from Bengaluru kept hearing about "transformers" and "attention." He found the famous paper: "Attention Is All You Need" (Vaswani et al., 2017). He opened it and immediately felt overwhelmed โ 15 pages of dense math, acronyms he didn't know, and figures with no obvious entry point.
His computer science teacher said: "Every expert was once confused by their first research paper. The skill is not intuition โ it's a reading strategy." She taught him the 3-pass method used by PhD students worldwide.
Three days later, Kunal had a working implementation of scaled dot-product attention in NumPy โ 30 lines of code that matched the equations in the paper exactly. He presented it at his school's AI club to standing applause.
arXiv (pronounced "archive") is the free preprint server where most AI researchers publish papers immediately โ often before formal peer review. It hosts over 2 million papers in physics, mathematics, CS, and AI.
- URL pattern: arxiv.org/abs/YYMM.NNNNN โ e.g., arxiv.org/abs/1706.03762 is "Attention Is All You Need"
- Search: arxiv.org/search/ โ filter by category cs.LG (machine learning), cs.CV (vision), cs.CL (NLP), cs.AI
- PDF link: Change /abs/ to /pdf/ in the URL for direct PDF download
- Semantic Scholar: semanticscholar.org โ finds related papers and citation counts automatically
- Papers With Code: paperswithcode.com โ every paper linked to its GitHub implementation
5โ10 minutes: Survey
Read title, abstract, introduction. Skim section headings and figures. Read conclusion. Goal: understand what problem they solve and what the main result is.
60 minutes: Read
Read all sections carefully. Write notes for every figure and equation. Skip proofs. Mark terms you don't understand but keep moving. Goal: understand the method at a high level.
3โ5 hours: Reproduce
Implement the core equations in code. Look up every unknown term. Verify your code produces the same shapes/values as the paper claims. Goal: deep mastery.
From "Attention Is All You Need" (Vaswani et al., 2017), equation 1:
import numpy as np
def softmax(x: np.ndarray) -> np.ndarray:
"""Numerically stable softmax along last axis."""
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
def scaled_dot_product_attention(
Q: np.ndarray, # (seq_len, d_k)
K: np.ndarray, # (seq_len, d_k)
V: np.ndarray, # (seq_len, d_v)
mask: np.ndarray = None
) -> tuple[np.ndarray, np.ndarray]:
"""
Scaled dot-product attention.
Returns (output, attention_weights).
"""
d_k = Q.shape[-1]
# Step 1: Compute similarity scores (seq_len, seq_len)
scores = Q @ K.T / np.sqrt(d_k) # QKแต / โd_k
# Step 2: Apply mask (optional โ for causal/decoder attention)
if mask is not None:
scores = scores + mask * -1e9 # fill masked positions with -โ
# Step 3: Softmax โ attention weights
weights = softmax(scores) # (seq_len, seq_len)
# Step 4: Weighted sum of values
output = weights @ V # (seq_len, d_v)
return output, weights
# โโ Test it โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
seq_len = 4 # 4 tokens
d_k = 8 # key/query dimension
d_v = 8 # value dimension
np.random.seed(42)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)
output, attn = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (4, 8)
print(f"Attention weights:\n{attn.round(3)}")
# Each row sums to 1.0 โ verify:
print(f"Row sums: {attn.sum(axis=1)}") # [1. 1. 1. 1.]
# โโ Multi-head attention โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class MultiHeadAttention:
"""
Implements multi-head attention from Vaswani et al. 2017.
Paper uses: h=8 heads, d_model=512, d_k=d_v=64
"""
def __init__(self, d_model: int, h: int):
self.h = h
self.d_k = d_model // h
self.d_v = d_model // h
# Parameter matrices โ randomly initialised
self.W_Q = np.random.randn(d_model, d_model) * 0.01
self.W_K = np.random.randn(d_model, d_model) * 0.01
self.W_V = np.random.randn(d_model, d_model) * 0.01
self.W_O = np.random.randn(d_model, d_model) * 0.01
def forward(self, X: np.ndarray) -> np.ndarray:
"""X shape: (seq_len, d_model)"""
seq_len, d_model = X.shape
Q = X @ self.W_Q # (seq_len, d_model)
K = X @ self.W_K
V = X @ self.W_V
# Split into h heads
def split_heads(M):
M = M.reshape(seq_len, self.h, self.d_k)
return M.transpose(1, 0, 2) # (h, seq_len, d_k)
Q_h = split_heads(Q) # (h, seq_len, d_k)
K_h = split_heads(K)
V_h = split_heads(V)
# Attention for each head
head_outputs = []
for i in range(self.h):
out, _ = scaled_dot_product_attention(Q_h[i], K_h[i], V_h[i])
head_outputs.append(out) # each (seq_len, d_k)
# Concatenate heads
concat = np.concatenate(head_outputs, axis=-1) # (seq_len, d_model)
# Final projection
return concat @ self.W_O # (seq_len, d_model)
# Test multi-head attention
d_model = 32
h = 4
mha = MultiHeadAttention(d_model=d_model, h=h)
X = np.random.randn(6, d_model) # 6 tokens, d_model=32
out = mha.forward(X)
print(f"Multi-head attention output: {out.shape}") # (6, 32)
# Get the BibTeX citation for any arXiv paper:
# 1. Go to the paper's abstract page e.g. arxiv.org/abs/1706.03762
# 2. Click "Export BibTeX citation" on the right panel
# 3. Copy the BibTeX into your references.bib file
# For "Attention Is All You Need":
@misc{vaswani2017attention,
title = {Attention Is All You Need},
author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and
Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and
ลukasz Kaiser and Illia Polosukhin},
year = {2017},
eprint = {1706.03762},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
# In your paper (LaTeX), cite with: \cite{vaswani2017attention}
# Searching citations programmatically with semanticscholar API:
import requests
def get_paper_info(arxiv_id: str) -> dict:
url = f"https://api.semanticscholar.org/graph/v1/paper/arXiv:{arxiv_id}"
params = {"fields": "title,authors,year,citationCount"}
resp = requests.get(url, params=params, timeout=10)
resp.raise_for_status()
return resp.json()
info = get_paper_info("1706.03762")
print(f"{info['title']} โ {info['citationCount']:,} citations")