Lesson 1 — Advanced Python for AI | Class 11

Story

Rohan's Slow Training Loop

Rohan, 16, from Pune had just finished Class 10. His plant disease classifier took 45 minutes to train on Google Colab's free tier. He asked his older cousin Priya, who works at a Pune AI startup, why it was so slow.

Priya looked at his code and laughed — not unkindly. "Rohan, you're using Python loops inside your data pipeline. NumPy vectorisation would make this 200x faster. And your model function re-loads the tokeniser every call — that's why your inference is slow too."

Rohan spent one week learning advanced Python for AI. His training loop went from 45 minutes to 4 minutes. His inference API went from 800ms per request to 12ms. The model didn't change. The Python did.

The lesson: In AI, the bottleneck is rarely the algorithm. It's almost always the code around the algorithm. Fast Python is a superpower.

Section 1

NumPy Vectorisation: Stop Writing Python Loops

Python loops are slow because Python is interpreted and every operation has overhead. NumPy operations run in compiled C — they process entire arrays at once. The difference is dramatic:

❌ SLOW — Python Loop

# Process 1 million pixels
result = []
for pixel in image_data:
    result.append(pixel / 255.0)
# Time: ~2.1 seconds

✅ FAST — NumPy Vectorised

import numpy as np
result = image_data / 255.0
# Time: ~0.003 seconds
# 700x faster!

The rule: Never loop over NumPy arrays element-by-element. Use broadcasting, universal functions (ufuncs), and vectorised operations instead.

import numpy as np
import time

# Generate 1 million random values
data = np.random.randn(1_000_000).astype(np.float32)

# SLOW: Python loop
start = time.time()
result_slow = [x**2 + 2*x + 1 for x in data]
print(f"Loop: {time.time()-start:.3f}s")

# FAST: NumPy vectorised (broadcasted polynomial)
start = time.time()
result_fast = data**2 + 2*data + 1
print(f"NumPy: {time.time()-start:.4f}s")

# ADVANCED: numpy.polynomial for even faster
# result = np.polyval([1, 2, 1], data)

# Broadcasting example: normalise a (1000, 224, 224, 3) image batch
images = np.random.randint(0, 256, (1000, 224, 224, 3), dtype=np.uint8)
mean = np.array([0.485, 0.456, 0.406])   # ImageNet mean
std  = np.array([0.229, 0.224, 0.225])   # ImageNet std

# Broadcasting: (1000,224,224,3) - (3,) = works! NumPy broadcasts automatically
normalised = (images / 255.0 - mean) / std
print(f"normalised shape: {normalised.shape}")  # (1000, 224, 224, 3)

Key NumPy operations every AI developer must know:

np.einsum('ij,jk->ik', A, B) — fast matrix multiply with Einstein notation
np.where(condition, x, y) — vectorised if-else
np.argsort, np.argmax, np.argmin — indices of sorted/extreme values
np.clip(arr, 0, 1) — clamp values (used in normalisation)
np.concatenate, np.stack, np.vstack — combining arrays without loops

Section 2

Python Decorators for AI Code

Decorators let you add functionality to functions without modifying their code. They're used everywhere in AI: caching model outputs, timing functions, validating inputs, logging predictions.

import time
import functools

# ── Decorator 1: Timer ──────────────────────────────────────────
def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"{func.__name__} took {elapsed:.4f}s")
        return result
    return wrapper

@timer
def run_inference(model, image_batch):
    return model.predict(image_batch)

# ── Decorator 2: Cache with TTL ─────────────────────────────────
import hashlib, json, time as _time

def cache_result(ttl_seconds=300):
    """Cache function results for ttl_seconds."""
    cache = {}
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            key = hashlib.md5(
                json.dumps((args, sorted(kwargs.items())),
                default=str).encode()
            ).hexdigest()
            if key in cache:
                result, ts = cache[key]
                if _time.time() - ts < ttl_seconds:
                    return result
            result = func(*args, **kwargs)
            cache[key] = (result, _time.time())
            return result
        return wrapper
    return decorator

@cache_result(ttl_seconds=600)
def embed_text(text: str) -> list[float]:
    """Embed text — expensive API call, cached for 10 min."""
    # calls Gemini / OpenAI embedding API
    return [0.1, 0.3, ...]  # placeholder

# ── Decorator 3: Input validator ────────────────────────────────
def validate_image(func):
    @functools.wraps(func)
    def wrapper(image_array, *args, **kwargs):
        import numpy as np
        if not isinstance(image_array, np.ndarray):
            raise TypeError(f"Expected np.ndarray, got {type(image_array)}")
        if image_array.ndim not in (3, 4):
            raise ValueError(f"Expected (H,W,C) or (N,H,W,C), got {image_array.shape}")
        if image_array.max() > 1.0:
            image_array = image_array / 255.0  # auto-normalise
        return func(image_array, *args, **kwargs)
    return wrapper

@validate_image
def classify_image(image_array, model):
    return model.predict(image_array[np.newaxis, ...])

Pattern to remember: @functools.wraps(func) inside your decorator preserves the original function's __name__ and __doc__. Always include it — FastAPI uses those for its Swagger docs.

Section 3

Generators for Memory-Efficient Data Pipelines

When your dataset is 50GB of images, you cannot load it all into RAM. Generators produce data one item at a time, keeping memory usage flat regardless of dataset size.

from pathlib import Path
from PIL import Image
import numpy as np

# ── Generator-based image loader ───────────────────────────────
def image_batch_generator(image_dir: str, batch_size: int = 32):
    """Yields (images, labels) batches without loading full dataset."""
    paths = list(Path(image_dir).rglob("*.jpg"))
    # Extract labels from folder names: dataset/roses/img1.jpg → "roses"
    labels = [p.parent.name for p in paths]

    for i in range(0, len(paths), batch_size):
        batch_paths = paths[i:i+batch_size]
        batch_labels = labels[i:i+batch_size]

        images = []
        for p in batch_paths:
            img = Image.open(p).resize((224, 224))
            images.append(np.array(img) / 255.0)

        yield np.array(images, dtype=np.float32), batch_labels

# Usage — memory stays flat even for 100k images:
for images, labels in image_batch_generator("dataset/", batch_size=32):
    predictions = model.predict(images)   # process one batch, discard
    # images goes out of scope → GC collects it

# ── Generator expression (even simpler) ────────────────────────
squares = (x**2 for x in range(1_000_000))  # uses ~56 bytes
# vs list comprehension: [x**2 for x in range(1_000_000)] uses ~8MB

# ── yield from — delegate to sub-generators ────────────────────
def multi_dir_loader(*dirs):
    for d in dirs:
        yield from image_batch_generator(d)

Keras and PyTorch DataLoaders are built on the same generator pattern. When you subclass tf.keras.utils.Sequence or torch.utils.data.Dataset, you're implementing a generator interface.

Section 4

Profiling: Finding the Bottleneck

Before optimising, measure. Premature optimisation wastes time. Profiling tells you exactly which lines are slow.

# ── Method 1: cProfile (function-level) ────────────────────────
import cProfile, pstats

profiler = cProfile.Profile()
profiler.enable()

# --- code to profile ---
run_training_loop(model, dataset, epochs=1)
# -----------------------

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # top 10 slowest functions
# Output example:
#   ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
#    1000    5.231    0.005    5.231    0.005    data_utils.py:42(load_image)
# → load_image is called 1000 times taking 5s total → vectorise it!

# ── Method 2: line_profiler (line-by-line) ──────────────────────
# pip install line_profiler
# Add @profile decorator, run: kernprof -l -v your_script.py

# ── Method 3: memory_profiler ──────────────────────────────────
# pip install memory_profiler
from memory_profiler import profile

@profile
def prepare_dataset(path):
    images = load_all_images(path)   # line 1: +2.4 GB
    features = extract_features(images)  # line 2: +800 MB
    del images                           # line 3: -2.4 GB  ← important!
    return features

# ── Method 4: timeit for micro-benchmarks ──────────────────────
import timeit
loop_time  = timeit.timeit('[x**2 for x in range(1000)]', number=10000)
numpy_time = timeit.timeit('import numpy as np; np.arange(1000)**2', number=10000)
print(f"Loop: {loop_time:.3f}s | NumPy: {numpy_time:.3f}s")

Tool	Best for	Output
cProfile	Finding slowest functions	Cumulative time per function
line_profiler	Finding slowest lines inside a function	Time per line
memory_profiler	Tracking RAM growth	MB increment per line
timeit	Comparing two implementations	Seconds for N repetitions

Section 5

Type Hints and Dataclasses for AI Config

Type hints make your AI code self-documenting and enable IDE autocomplete. Dataclasses give you structured configuration objects — much better than passing 15 arguments to a function.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TrainingConfig:
    """All hyperparameters for a training run."""
    model_name: str = "MobileNetV2"
    num_classes: int = 10
    image_size: int = 224
    batch_size: int = 32
    epochs: int = 20
    learning_rate: float = 1e-4
    dropout_rate: float = 0.3
    weight_decay: float = 1e-5
    augment: bool = True
    checkpoint_dir: str = "checkpoints/"
    tags: list[str] = field(default_factory=list)
    notes: Optional[str] = None

    def __post_init__(self):
        if self.learning_rate <= 0:
            raise ValueError("learning_rate must be positive")
        if self.batch_size not in [8, 16, 32, 64, 128]:
            raise ValueError("batch_size must be a power of 2 between 8–128")

# Clean function signatures with type hints
def train(
    config: TrainingConfig,
    train_dir: str,
    val_dir: str,
) -> dict[str, float]:
    """Returns {val_accuracy, val_loss, best_epoch}."""
    ...

# Usage — clear, self-documenting:
cfg = TrainingConfig(
    model_name="EfficientNetB0",
    num_classes=38,
    learning_rate=3e-4,
    tags=["plantvillage", "class10-demo"],
    notes="First run with augmentation"
)
results = train(cfg, "data/train", "data/val")
print(results["val_accuracy"])  # IDE knows this is float

Section 6

Context Managers: Safe Resource Handling

AI code regularly deals with resources that must be cleaned up: GPU memory, file handles, database connections, MLflow runs. Context managers guarantee cleanup even when exceptions occur.

import contextlib
import mlflow

# ── Custom context manager using contextlib.contextmanager ──────
@contextlib.contextmanager
def gpu_memory_guard(name: str):
    """Clears GPU cache before and after a block."""
    import gc
    try:
        import torch
        torch.cuda.empty_cache()
    except ImportError:
        pass
    print(f"[GPU] Starting: {name}")
    try:
        yield
    finally:
        try:
            import torch
            torch.cuda.empty_cache()
            gc.collect()
        except ImportError:
            pass
        print(f"[GPU] Done: {name}")

# Usage:
with gpu_memory_guard("inference_batch_100"):
    predictions = model(large_batch)
    # Even if model() raises an exception, GPU cache is cleared

# ── MLflow run as context manager ──────────────────────────────
with mlflow.start_run(run_name="experiment_v3"):
    mlflow.log_param("lr", 3e-4)
    history = model.fit(train_ds, validation_data=val_ds, epochs=20)
    mlflow.log_metric("val_accuracy", max(history.history["val_accuracy"]))
    mlflow.keras.log_model(model, "model")
# Run ends automatically — even on exception

# ── File management with context manager ───────────────────────
with open("predictions.jsonl", "w") as f:
    for batch in data_loader:
        preds = model.predict(batch)
        for p in preds:
            f.write(json.dumps(p.tolist()) + "\n")
# File is closed automatically

⚡ Lesson 1 Quiz — Advanced Python for AI

1. Rohan's Python loop took 2.1 seconds for 1 million pixels. The equivalent NumPy operation took 0.003 seconds. This speedup happens because:

a) NumPy uses multiple CPU cores by default while Python loops are single-threaded

b) NumPy operations execute compiled C code on contiguous memory blocks in one call — the interpreter overhead of Python's per-object dispatch is eliminated for the entire array at once

c) NumPy uses the GPU automatically when it detects large arrays

d) Python loops allocate new memory for each element, while NumPy reuses a single pointer

2. The @functools.wraps(func) line inside a decorator is important because:

a) It prevents the decorator from running more than once on the same function

b) It copies the original function's __name__, __doc__, and other metadata to the wrapper — without it, FastAPI would generate incorrect Swagger docs and debugging tracebacks would show "wrapper" instead of the real function name

c) It makes the decorator thread-safe by acquiring a lock

d) It ensures the return type of the wrapper matches the original function

3. A generator function using `yield` is preferable to a list comprehension for loading a 50GB image dataset because:

a) Generators are faster because they use compiled C internally

b) A generator produces items one at a time on demand — memory usage stays flat (one batch in RAM) regardless of dataset size. A list comprehension materialises all items at once, requiring the full 50GB in RAM simultaneously.

c) Generators automatically handle multi-threading so batches load in parallel

d) List comprehensions do not support file I/O operations

4. You run cProfile on a training loop and see that load_image() is called 50,000 times and accounts for 78% of total runtime. The correct optimisation strategy is:

a) Rewrite the model architecture to use fewer parameters

b) Focus exclusively on vectorising load_image() — profile first, then optimise the proven bottleneck. Optimising anything else first would reduce total runtime by at most 22%.

c) Switch from Python 3.11 to PyPy for a general interpreter speedup

d) Reduce the batch size so load_image() is called fewer times

5. A @dataclass with __post_init__ validation is better than passing 15 keyword arguments directly to a training function because:

a) Dataclasses run faster than function keyword arguments

b) The config object is a single, versioned, shareable entity — you can log it to MLflow with one line, validate it in __post_init__, reuse it across functions, and type-check it statically. Function arguments cannot provide these guarantees.

c) Python functions cannot accept more than 10 keyword arguments

d) Dataclasses automatically generate YAML config files for reproducibility

6. When broadcasting `images / 255.0 - mean` where images is (1000, 224, 224, 3) and mean is (3,), NumPy:

a) Raises a shape mismatch error because the arrays have different numbers of dimensions

b) Automatically aligns (3,) to (1, 1, 1, 3) and subtracts element-wise across all 1000 images — no explicit reshaping or loops needed

c) Computes a dot product instead of element-wise subtraction

d) Repeats the subtraction 1000×224×224 times using Python's iteration protocol

7. A context manager created with @contextlib.contextmanager guarantees cleanup code in the `finally` block will run:

a) Only when the with block exits normally without any exceptions

b) Both on normal exit AND when an exception is raised inside the with block — this is the critical safety guarantee that makes context managers the correct pattern for GPU memory, file handles, and MLflow runs

c) Only when explicitly called with context_manager.close()

d) After a garbage collection cycle, not immediately on block exit

8. `np.einsum('ij,jk->ik', A, B)` is equivalent to matrix multiplication A @ B. The main practical advantage of einsum for AI code is:

a) einsum is always faster than @ on all hardware

b) einsum handles arbitrary tensor contractions expressible in one line — you can compute batch outer products, batched matrix multiplies, trace operations, and multi-dimensional dot products without reshaping, making complex attention operations readable

c) einsum automatically distributes computation across multiple GPUs

d) einsum supports complex numbers while @ does not

← Class 11 Hub Lesson 2: Deep Learning Theory →

Advanced Python for AI ⚡

Class 11 Lesson 1 - Advanced Python for AI

❌ SLOW — Python Loop

✅ FAST — NumPy Vectorised

⚡ Lesson 1 Quiz — Advanced Python for AI