Lesson 6 — Building a RAG Chatbot | Class 10

Meet Priya — Class 10, Chandigarh

Priya's mother is a nurse. Every day she gets WhatsApp questions from relatives asking about medicines from doctor prescriptions — "What is this tablet for?", "Can it be taken with food?" She has a shelf of medical reference books that she wishes she could search instantly.

Priya wanted to build a chatbot that could answer questions about a specific medical PDF — not from the internet, not from the LLM's training memory, but directly from the pages of that document. Her computer science teacher told her: "What you need is called RAG — Retrieval-Augmented Generation. It's how enterprise AI assistants work over company documents."

Why LLMs Hallucinate

The Problem RAG Solves

LLMs like Gemini or GPT store knowledge in their weights from pre-training. But they don't know about:

Documents uploaded after their training cutoff
Private company documents, textbooks, or research papers
Specific numbers (drug doses, pricing tables, exam schedules)

When asked about specific facts they don't know, LLMs often hallucinate — generate confident-sounding but wrong answers. RAG prevents this by retrieving the actual relevant text from your document and providing it to the LLM as context in the prompt.

RAG rule: The LLM's answer can only be as correct as the retrieved context. If the document doesn't contain the answer, a well-built RAG system should say "Not found in document" rather than guess.

How RAG Works

The Retrieve → Augment → Generate Pipeline

📄 PDF Document

→

✂️ Split into Chunks (500 chars)

→

🔢 Embed Each Chunk

→

🗄️ Store in Vector DB (ChromaDB)

↓ At Query Time ↓

❓ User Question

→

🔢 Embed Question

→

🔍 Find Similar Chunks (cosine similarity)

→

📝 Prompt = Context + Question

→

🤖 LLM Answer

Embeddings: An embedding converts text to a list of numbers (e.g., 768 numbers) such that semantically similar text has similar numbers. "tablet" and "medicine" are closer in embedding space than "tablet" and "river". This is how the vector search finds relevant chunks even when words don't exactly match the query.

Full Code

Build a PDF Q&A Chatbot in Google Colab

# RAG Chatbot over a PDF — Google Colab
# Uses: LangChain + ChromaDB + Gemini API (free tier)

!pip install langchain langchain-community langchain-google-genai \
             chromadb pypdf sentence-transformers -q

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# ── Step 1: Add your Gemini API key ──
# Get free key at: https://makersuite.google.com/app/apikey
GOOGLE_API_KEY = "your-gemini-api-key-here"    # replace with your key
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# ── Step 2: Load your PDF ──
# Upload your PDF to Colab first, then set the path
# from google.colab import files; files.upload()  # then check filename
PDF_PATH = "medicine_reference.pdf"   # change to your file

# For demo, we'll create a simple text file instead:
demo_text = """
PARACETAMOL 500mg Tablets
Uses: Relief of mild to moderate pain including headache, toothache,
      fever, and cold symptoms.
Dosage: Adults and children over 12: 1-2 tablets every 4-6 hours.
        Maximum 8 tablets in 24 hours.
        Children under 12: Not recommended.
Side effects: Rare at recommended doses. Overdose causes liver damage.
Food: Can be taken with or without food.
Contraindications: Do not take if allergic to paracetamol or
                   if you have severe liver disease.

METFORMIN 500mg Tablets
Uses: Type 2 diabetes. Reduces blood sugar levels.
Dosage: 500mg twice daily with meals. Dose may be increased by doctor.
Side effects: Nausea, diarrhoea (usually temporary). Rarely lactic acidosis.
Food: Take with meals to reduce stomach upset.
Contraindications: Kidney disease, liver disease, excessive alcohol use.
"""
with open("medicine_reference.txt", "w") as f:
    f.write(demo_text)

from langchain_community.document_loaders import TextLoader
loader = TextLoader("medicine_reference.txt")
documents = loader.load()
print(f"Loaded {len(documents)} document(s)")

# ── Step 3: Split into chunks ──
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,       # overlap ensures context isn't cut at boundaries
    length_function=len
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# ── Step 4: Create embeddings and vector store ──
# Using a lightweight sentence-transformers model (runs locally, free)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})  # top 3 chunks
print("Vector store created!")

# ── Step 5: Set up Gemini LLM ──
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",    # free tier, fast
    temperature=0.1              # low temperature = factual, not creative
)

# ── Step 6: Custom prompt — force grounded answers ──
PROMPT_TEMPLATE = """You are a medical information assistant helping a nurse
quickly look up drug information from a reference document.
Answer ONLY based on the context provided below.
If the answer is not in the context, say "This information is not in the document."
Do not use outside knowledge. Be concise and accurate.

Context from document:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)

# ── Step 7: Build RAG chain ──
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",          # "stuff" = concatenate all chunks into prompt
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

# ── Step 8: Ask questions ──
def ask(question):
    result = rag_chain.invoke({"query": question})
    print(f"\nQ: {question}")
    print(f"A: {result['result']}")
    print(f"   (from {len(result['source_documents'])} chunk(s))")

ask("Can paracetamol be taken with food?")
ask("What is the maximum daily dose of paracetamol for adults?")
ask("What are the side effects of metformin?")
ask("Can I take metformin if I have kidney disease?")
ask("What is the dose of amoxicillin 500mg?")   # not in document

Priya's result: When asked "Can paracetamol be taken on empty stomach?" the chatbot answered: "Paracetamol can be taken with or without food." When asked about a drug not in the PDF, it said: "This information is not in the document." — exactly the grounded behaviour needed for a safe medical assistant.

Production Tips

Making Your RAG Chatbot Better

Chunk size: 500–1000 characters with 100 overlap is a good starting point. Too small = not enough context; too large = retrieval finds irrelevant text.
Top-k retrieval: k=3–5 is common. More chunks = more context but higher API cost and risk of confusing the LLM.
Reranking: After initial retrieval, use a cross-encoder model to rerank results by relevance before passing to the LLM.
Metadata filtering: Add chapter/page metadata to chunks and filter by document section when relevant.
Persistent ChromaDB: Save embeddings to disk with persist_directory — avoid re-embedding the whole document on each run.

Other LLM options: Replace Gemini with Ollama (fully local, free, no API key) to run models like Llama 3 or Mistral entirely on your own machine — ideal when working with private documents that should never leave your computer.

🧪 Check Your Understanding — Lesson 6 Quiz

1. The main reason LLMs hallucinate facts is:

a) They are poorly trained on English grammar

b) They generate text based on statistical patterns in training data — they don't "look up" facts and will confidently produce plausible-sounding but wrong answers for things they don't know

c) They run too fast to check their answers

d) Hallucination only happens with older GPT models, not modern ones

2. In RAG, the "Retrieval" step retrieves:

a) The user's conversation history

b) The most relevant chunks from your document's vector store based on semantic similarity to the user's question

c) The latest version of the LLM from the internet

d) All pages of the PDF every time a question is asked

3. Text embeddings are used in RAG because:

a) They compress the PDF to a smaller file size

b) They convert text to numeric vectors where semantically similar text is close in vector space — enabling similarity search that works even when exact words don't match

c) They translate text between languages automatically

d) They are required by the Gemini API to format text input

4. In the RAG prompt template, why does the prompt say "Answer ONLY based on the context provided"?

a) To prevent the LLM from being creative in its writing style

b) To ground the LLM's answer in the retrieved document text and prevent it from making up facts from its training data

c) Because the Gemini API only processes text from the context field

d) To reduce the length of the answer

5. `chunk_overlap=100` in RecursiveCharacterTextSplitter means:

a) 100 characters are duplicated between adjacent chunks to ensure context isn't lost at chunk boundaries

b) Each chunk is exactly 100 characters long

c) The splitter creates 100 chunks per page

d) The first 100 chunks are ignored during retrieval

6. ChromaDB is used in this pipeline as:

a) A PDF reader that parses text from files

b) A vector database that stores text chunk embeddings and supports fast semantic similarity search

c) The LLM that generates answers

d) A code formatter for Python notebooks

7. Why is `temperature=0.1` set for the Gemini LLM in a medical chatbot?

a) Low temperature limits context window size

b) Low temperature makes the LLM's responses more deterministic and factual — less random creativity, which is essential when accuracy matters for medical information

c) It speeds up Gemini API response time

d) Temperature controls the number of chunks retrieved

8. If a user asks about a drug not mentioned in the PDF, the correct RAG chatbot behaviour is:

a) Search the internet for the answer and add it to the response

b) Use the LLM's training knowledge to answer the question anyway

c) Say "This information is not in the document" — the retrieved chunks won't contain relevant text, and the prompt instruction prevents the LLM from guessing

d) Crash with an error because no chunks are retrieved

← Lesson 5: Fine-Tuning LLMs Lesson 7: ML Pipelines →

Building a RAG Chatbot 🤖

Class 10 Lesson 6 - Building a RAG Chatbot

🧪 Check Your Understanding — Lesson 6 Quiz