Lesson 7 — Multi-modal AI: Vision + Language + Speech | Class 11

Story

Aryan's Accessibility App for Delhi

Aryan, 16, from Delhi had a grandmother who was losing her sight to cataracts. She could no longer read labels at the market, identify medicine packaging, or read WhatsApp messages her family sent. Aryan decided to build something for her.

The app he imagined: point your phone at anything, speak a question in Hindi, and hear a description spoken back. "What medicine is this?" → the app reads the label and answers in Hindi. "What does this sign say?" → it translates and speaks the answer.

Three models working together: CLIP to understand what's in the image, Whisper to transcribe the spoken Hindi question, and a vision-language model to answer. Aryan's grandmother now uses it every day at the market.

Section 1

What is Multi-modal AI?

Traditional AI models process one modality: a CNN sees images, an LLM reads text, a speech model processes audio. Multi-modal AI combines modalities — it can reason across images, text, and sound simultaneously.

The key challenge: how do you put an image and a sentence into the same "space" so you can compare them? OpenAI's CLIP solved this with contrastive learning — training a vision encoder and a text encoder to produce embeddings where matching image-text pairs are close together and non-matching pairs are far apart.

Section 2

CLIP: Contrastive Image-Language Pre-training

# pip install transformers torch pillow

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch, requests
from io import BytesIO

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# ── Zero-shot image classification ───────────────────────────────
def classify_image(image_path: str, candidate_labels: list[str]) -> dict:
    image = Image.open(image_path).convert("RGB")

    inputs = processor(
        text=candidate_labels,
        images=image,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        outputs = model(**inputs)

    # logits_per_image: similarity of image to each text label
    probs = outputs.logits_per_image.softmax(dim=1).squeeze()

    return {label: float(prob) for label, prob in zip(candidate_labels, probs)}

# Example: identify medicine packaging
labels = [
    "a Paracetamol tablet box",
    "a Metformin medicine bottle",
    "an Aspirin blister pack",
    "a vitamin supplement bottle"
]
result = classify_image("medicine.jpg", labels)
for label, prob in sorted(result.items(), key=lambda x: -x[1]):
    print(f"{prob:.1%}  {label}")

# ── Image-text similarity (for accessibility app) ─────────────────
def describe_scene(image_path: str, questions: list[str]) -> str:
    """Find which description best matches the image."""
    scores = classify_image(image_path, questions)
    return max(scores, key=scores.get)

scene = describe_scene("street.jpg", [
    "a busy road with cars and traffic",
    "a quiet park with trees and benches",
    "a market with vegetable stalls",
    "a hospital entrance with people"
])
print(f"Scene: {scene}")

How CLIP works: It was trained on 400 million (image, caption) pairs from the internet. The image encoder (ViT) and text encoder (Transformer) were trained contrastively — the model learned to pull matching pairs together in embedding space and push non-matching pairs apart. Zero-shot classification emerges for free: just compare the image embedding to text embeddings of your labels.

Section 3

Whisper: Speech-to-Text in Any Language

# pip install openai-whisper

import whisper

# Load model (base = fast, large = most accurate)
# Models: tiny, base, small, medium, large, large-v2, large-v3
asr_model = whisper.load_model("base")

# ── Transcribe Hindi audio ────────────────────────────────────────
def transcribe_hindi(audio_path: str) -> dict:
    result = asr_model.transcribe(
        audio_path,
        language="hi",           # force Hindi
        task="transcribe",       # transcribe in original language
        fp16=False               # use fp32 on CPU
    )
    return {
        "text": result["text"],
        "language": result["language"],
        "segments": result["segments"]  # timestamped segments
    }

# ── Translate Hindi audio to English ─────────────────────────────
def translate_hindi_to_english(audio_path: str) -> str:
    result = asr_model.transcribe(
        audio_path,
        language="hi",
        task="translate"   # translate to English
    )
    return result["text"]

# Supported Indian languages: Hindi, Bengali, Tamil, Telugu,
# Marathi, Kannada, Malayalam, Gujarati, Punjabi, Urdu (98 total)

Whisper is genuinely impressive for Indian languages. It was trained on 680,000 hours of multilingual audio. For Hindi it achieves ~5–7% word error rate (comparable to English commercial APIs). For Tamil and Telugu it's 8–12%. It handles accented English, code-switching (Hinglish), and noisy environments well.

Section 4

Building the Visual Q&A Accessibility App

# Full pipeline: image + spoken Hindi question → Hindi answer
# pip install transformers openai-whisper pillow openai

import whisper
import openai
from PIL import Image
import base64, io

asr_model = whisper.load_model("base")
client = openai.OpenAI()  # uses OPENAI_API_KEY env var

def image_to_base64(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def visual_qa_hindi(image_path: str, audio_question_path: str) -> str:
    """
    1. Transcribe spoken Hindi question with Whisper
    2. Send image + question to GPT-4 Vision
    3. Get answer in Hindi
    """
    # Step 1: Speech → Text
    question_result = asr_model.transcribe(
        audio_question_path, language="hi", task="transcribe", fp16=False
    )
    hindi_question = question_result["text"]
    print(f"Question: {hindi_question}")

    # Step 2: Vision-Language model answers
    image_b64 = image_to_base64(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",    # supports vision natively
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant for visually impaired users. "
                    "Describe images clearly and answer questions in Hindi. "
                    "Be concise and practical. Mention important text visible in the image."
                )
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_b64}",
                            "detail": "high"   # full resolution analysis
                        }
                    },
                    {
                        "type": "text",
                        "text": hindi_question
                    }
                ]
            }
        ],
        max_tokens=300
    )

    hindi_answer = response.choices[0].message.content
    print(f"Answer: {hindi_answer}")
    return hindi_answer

# Usage:
answer = visual_qa_hindi(
    image_path="medicine_box.jpg",
    audio_question_path="question_hindi.wav"
)
# "यह Paracetamol 500mg की गोलियाँ हैं। इसे बुखार और दर्द के लिए लिया जाता है।"

Accessibility and ethics: Apps for visually impaired users must be extremely reliable. If the model confidently gives wrong medicine information, it could cause serious harm. Always include a disclaimer, support human verification for medical content, and test extensively with real visually impaired users before deployment. Review India's Rights of Persons with Disabilities Act 2016 for accessibility standards.

👁️ Lesson 7 Quiz — Multi-modal AI

1. CLIP achieves zero-shot image classification without ever being explicitly trained on "classify images into categories." This is possible because:

a) CLIP was secretly fine-tuned on ImageNet before being released

b) CLIP's contrastive training on 400M (image, caption) pairs created a shared embedding space where image and text representations are comparable. At inference, you compute similarity between the image embedding and text embeddings of candidate labels — no classification head needed. Any text you can write becomes a potential "class."

c) Zero-shot classification requires no training — CLIP uses rule-based matching

d) The softmax over logits_per_image automatically creates categorical predictions from any model

2. Whisper was trained on 680,000 hours of audio from the internet. The key benefit for Indian language users compared to older ASR models is:

a) Whisper is faster — it processes audio 10x faster than older speech recognition systems

b) The massive multilingual training set means Whisper natively handles 98 languages including Hindi, Tamil, Telugu, and Bengali — with strong robustness to Indian accents, noisy environments, and code-switching (Hinglish). Older ASR models typically had separate, smaller, lower-quality models per language.

c) Whisper uses a transformer trained only on Indian language audio for maximum accuracy

d) Whisper can translate between all Indian languages directly without an English intermediate

3. In the visual Q&A app, "detail: high" is passed to GPT-4V for the image. This parameter:

a) Increases the image resolution stored in OpenAI's servers

b) Instructs GPT-4V to tile the image into multiple 512×512 patches and process each one separately for fine-grained analysis — important for reading small text (like medicine labels). "low" detail uses a single 512×512 thumbnail, which is cheaper and faster but cannot read small text accurately.

c) Enables the model to describe colours and textures in higher precision

d) Routes the request to a higher tier of API that has access to more recent training data

4. CLIP's logits_per_image is computed by taking the dot product of the image embedding with each text embedding. A high dot product score means:

a) The image and text are literally identical in pixel and character representation

b) The two embeddings point in similar directions in the shared embedding space — indicating semantic alignment. CLIP learned that images containing Paracetamol tablets produce embeddings that point in the same direction as the text "a Paracetamol tablet box" because training repeatedly paired such images with such captions.

c) The text label appears as visible text in the image

d) The image was explicitly included in CLIP's training dataset

5. For an accessibility app helping visually impaired users identify medicine, the most critical safety requirement is:

a) The app must achieve 99.9% accuracy on all medicine labels before any user testing

b) Always present uncertainty clearly, include a disclaimer that the AI can make mistakes, provide a way to verify critical medical information through a pharmacist or the manufacturer's helpline, and never present dosage instructions as definitive without human verification. AI errors in medical contexts have serious safety consequences.

c) The app should only support a list of pre-approved medicines to prevent errors

d) Audio output must be louder than 80 decibels to ensure users with hearing impairment can hear it

6. Contrastive learning in CLIP uses a loss that brings matching (image, text) pairs closer and pushes non-matching pairs apart. A batch of 32 (image, text) pairs produces how many negative examples?

a) 32 negative examples — one per pair

b) 32×31 = 992 negative examples — for each image, all 31 other texts in the batch are negatives; for each text, all 31 other images are negatives. This is why large batch sizes are crucial for contrastive learning — more negatives per step means the model must be more discriminative.

c) 32² = 1024 negative examples — every possible (image, text) combination including positives

d) 1 negative example per positive — contrastive learning uses only pairwise comparisons

7. Whisper's task="translate" option translates speech directly from Hindi to English text without an intermediate Hindi text step. The advantage over a two-step Whisper → translation API pipeline is:

a) Direct translation is always more accurate than two-step for all language pairs

b) Fewer API calls means lower latency and cost. More importantly, direct translation avoids cascading errors: mistakes in the Hindi transcription step can be corrected during the translation step since the model has access to the raw audio. In the two-step pipeline, transcription errors are frozen and passed on as bad input to the translator.

c) Two-step pipelines violate OpenAI's terms of service for non-English languages

d) Direct translation uses the same model as transcription, so no additional API key is needed

8. The system prompt in the visual Q&A app instructs GPT-4V to "mention important text visible in the image." This is especially important for accessibility because:

a) GPT-4V cannot process images without explicit instructions to read text

b) Visually impaired users often specifically need to know what text is written on objects (medicine labels, signs, product names) — without this instruction, the model might describe visual features (colour, shape, size) but omit the critical text content. Tailoring the system prompt to the use case substantially improves real-world usefulness.

c) OCR (reading text in images) requires a special model that only activates with this instruction

d) The instruction enables the model to respond in Hindi instead of English

← Lesson 6: Production Monitoring Lesson 8: AI Agents →

Multi-modal AI: Vision + Language + Speech 👁️🗣️

Class 11 Lesson 7 - Multi-modal AI: Vision + Language + Speech

👁️ Lesson 7 Quiz — Multi-modal AI