Lesson 05 — Diffusion Models & Image Generation | Class 12

Story

Tara's Konkani Heritage Posters

👩‍🎨 Tara · Goa · Age 17

Tara's grandmother teaches Konkani folk-art workshops. She wanted to create posters in the traditional Warli + Goan style, but custom illustrations cost ₹2,000 each. Tara fine-tuned Stable Diffusion on 200 photos of her grandmother's paintings and used ControlNet to keep the composition under her control.

She also learned the hard truth: image generation can be misused. Her lesson includes the safety controls she added — content filters, watermarks, and a clear "AI-generated" credit on every poster.

Theory

How Diffusion Models Work

The big idea: instead of generating an image directly (hard), train a model to remove noise from a noisy image (easier). Then start from pure noise and gradually denoise.

Forward process: add tiny amounts of Gaussian noise over T=1000 steps until the image is pure noise. This is just math — no neural network needed.

Reverse process: train a neural network (typically a U-Net) to predict the noise that was added at each step. At inference, start from noise and call the network 1000 (or with DDIM, just 50) times to denoise back to a clean image.

DDPM

Original diffusion. 1000 denoising steps. Slow but high quality.

DDIM

Same model, deterministic schedule, 50 steps. 20× faster.

Latent Diffusion (SD)

Diffuse in compressed latent space (8× smaller). Stable Diffusion's secret sauce.

Classifier-Free Guidance

Sample with both conditional and unconditional model, push toward conditional. Controls prompt strength.

Code

Generate with Stable Diffusion XL

!pip install -q diffusers transformers accelerate

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

prompt = "Goan Warli folk art style poster, women dancing in coconut grove, white figures on terracotta background, traditional patterns, festive"
negative_prompt = "blurry, low quality, modern style, 3d render, photograph"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024, width=1024,
).images[0]
image.save("warli_poster.png")

Guidance scale controls prompt adherence. Too low (<5) ignores the prompt. Too high (>15) creates over-saturated, deep-fried images. 7–9 is the sweet spot.

Code

ControlNet for Composition Control

Pure prompt generation is unpredictable. ControlNet adds a second image (edge map, pose skeleton, depth map) that the model must follow:

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
from PIL import Image
import cv2, numpy as np

# Load Tara's pencil sketch and convert to edge map
sketch = np.array(Image.open("grandmother_sketch.jpg").convert("RGB"))
edges = cv2.Canny(sketch, 100, 200)
edge_image = Image.fromarray(np.stack([edges]*3, axis=-1))

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet, torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="Warli folk art style, traditional Goan motifs, terracotta and white",
    image=edge_image,                  # composition controlled by edges
    controlnet_conditioning_scale=0.7,
    num_inference_steps=30,
).images[0]

Why ControlNet matters: Tara's grandmother sketched the layout she wanted. ControlNet preserves that sketch's composition while filling in the artistic style. This makes the AI a collaborator, not a replacement.

Ethics

Safety Controls Tara Added

Image generation is powerful and dangerous. The same Stable Diffusion that paints folk art can generate harmful content. Tara built these safeguards:

Negative prompts: Always include "nsfw, nude, child, violence, gore" in negative prompts.
Safety checker: The Hugging Face StableDiffusionSafetyChecker blocks NSFW outputs. Don't disable it.
Watermark: Add a visible "AI-generated · Mitra Heritage Project" watermark to every output.
Invisible watermark: Use imwatermark or Stability AI's invisible watermark to mark AI provenance even after cropping.
No real faces: Refuse to generate identifiable people without consent. Use generic figures only.
Source credit: Always credit "Trained on grandmother's original paintings, with permission."

Indian law context: The IT Rules 2021 (amended 2023) require deepfake-style AI content to be clearly labelled. The DPDPA 2023 prohibits processing personal data (including likenesses) without consent. Generating images of identifiable people without consent is both unethical and likely illegal.

Tara's outcome: 50 posters printed for her grandmother's heritage workshop. Zero NSFW incidents. The "AI-generated, original art trained on grandmother's permission" credit became the workshop's tagline.

📝 Check Your Understanding (8 Questions)

1. What is the core insight that makes diffusion models work?

a) Images are easier to generate when broken into 8x8 pixel blocks

b) Generating an image directly is hard, but predicting the noise added at each step of a noising process is much easier; train a network to denoise, then start from pure noise and iteratively denoise to get a generated image

c) Random initialisation of pixel values produces realistic images after enough training epochs

d) Diffusion models replace GANs because they have no discriminator

2. Why does Stable Diffusion use latent diffusion instead of pixel-space diffusion?

a) Latent space is required by the licensing agreement

b) Diffusing in a compressed VAE latent space (8× smaller in each dimension) cuts compute by ~64×, enabling 1024×1024 generation on a single consumer GPU instead of an entire server cluster

c) Latent diffusion produces strictly higher quality than pixel diffusion

d) Pixel-space diffusion is unstable and never converges

3. What is the role of guidance_scale (CFG) in image generation?

a) It sets the maximum image dimensions the model can generate

b) Classifier-Free Guidance — at each step, sample with and without the prompt and push toward the conditional direction; higher values force closer prompt adherence but oversaturate around 12+

c) It controls how many GPUs are used in parallel

d) It is the random seed for the noise schedule

4. Why does Tara use ControlNet with an edge map of her grandmother's sketch?

a) ControlNet is required for any non-English prompt

b) ControlNet conditions the generation on the structural information in the edge map, preserving the composition Tara's grandmother sketched while letting the diffusion model fill in the Warli folk-art style

c) ControlNet runs faster than vanilla Stable Diffusion

d) It allows generating multiple images in parallel

5. Why should the StableDiffusionSafetyChecker generally not be disabled in user-facing apps?

a) Disabling it voids the Apache 2.0 licence

b) It blocks the most obvious NSFW outputs; disabling it removes a safety layer that protects users (especially minors) from harmful content and protects the developer from legal liability

c) It is required for the model to converge during inference

d) It produces higher-resolution images when enabled

6. Why does Tara add a visible watermark and an invisible watermark to every generated poster?

a) Watermarks improve the model's training process

b) Visible watermarks prevent casual misrepresentation as human-made art; invisible watermarks (steganographic) survive cropping/screenshots and prove AI provenance — both are best practice and increasingly required by India's IT Rules and EU AI Act

c) Watermarks help search engines find the images

d) Hugging Face's terms of service mandate watermarks

7. What is the legal context in India for AI-generated images of identifiable people?

a) There are no laws that apply to AI-generated images in India

b) The DPDPA 2023 prohibits processing personal data (including likenesses) without consent, and the IT Rules 2021 (2023 amendment) require synthetic media to be clearly labelled — generating images of identifiable people without consent is unethical and likely illegal

c) Only commercial use of AI-generated images is regulated

d) Indian law allows any AI image generation as long as it is not commercial

8. What is the difference between DDPM (1000 steps) and DDIM (50 steps) at inference time?

a) DDIM uses a different and incompatible model architecture

b) DDIM is a deterministic sampler that uses the same trained model but a non-Markovian schedule, achieving comparable quality in ~20× fewer steps — turning interactive prompting from minutes per image to seconds per image

c) DDIM produces black-and-white images only

d) DDIM requires 20× more GPU memory than DDPM

← Lesson 4: GNNs Lesson 6: Speech AI & Ethics →

Diffusion Models & Image Generation 🎨

Class 12 Lesson 5 - Diffusion Models & Image Generation

DDPM

DDIM

Latent Diffusion (SD)

Classifier-Free Guidance

📝 Check Your Understanding (8 Questions)