Lesson 02 — Vector Databases & Production RAG | Class 12

Story

Karthik's Tamil Legal Aid Chatbot

👨‍⚖️ Karthik · Chennai · Age 17

Karthik volunteered with a Chennai NGO that helps daily-wage workers with legal questions in Tamil. They had 12,000 pages of Tamil legal Q&A documents. A simple ChromaDB prototype worked for 100 documents but timed out at 5,000 queries/day.

He rebuilt with Pinecone (managed vector DB), hybrid search (BM25 + dense embeddings), and a reranker. Result: P95 latency dropped from 4 seconds to 280ms, and answer accuracy improved from 62% to 89%.

Concepts

Why Vector Databases?

A vector database stores high-dimensional vectors (embeddings) and finds nearest neighbours in milliseconds — even across billions of vectors. The magic is the HNSW index (Hierarchical Navigable Small Worlds), which avoids comparing the query to every vector.

Vector DB	Type	Best For	Indian Cost
ChromaDB	Embedded / Self-hosted	Prototypes, <100K vectors	Free (your server)
Pinecone	Managed cloud	Production, low ops	~$70/mo starter
Weaviate	Self/managed, hybrid native	Hybrid search apps	Free OSS or $25/mo cloud
Qdrant	Self/managed, fast filters	Filtered search at scale	Free OSS or $25/mo
pgvector	Postgres extension	Apps already on Postgres	Free with your DB

Karthik's choice: Pinecone Starter — ₹6,000/month, zero ops, scales to 5M vectors. The NGO budget allowed it; ChromaDB self-hosted would have cost more in volunteer DevOps time.

Code

Build the Indexing Pipeline

Step 1 — Embed the Tamil legal documents using a multilingual model:

from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec

# Multilingual model — supports Tamil, Hindi, English natively
model = SentenceTransformer("intfloat/multilingual-e5-large")

pc = Pinecone(api_key="YOUR_KEY")
pc.create_index(
    name="tamil-legal",
    dimension=1024,  # e5-large output dim
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("tamil-legal")

Step 2 — Chunk and upsert (semantic chunking with overlap):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
docs = []  # list of {"id", "text", "source", "law_section"}
for raw_doc in load_legal_docs("./legal_pdfs/"):
    chunks = splitter.split_text(raw_doc.text)
    for i, chunk in enumerate(chunks):
        docs.append({
            "id": f"{raw_doc.id}-{i}", "text": chunk,
            "source": raw_doc.source, "law_section": raw_doc.section,
        })

# Embed in batches of 64 to avoid OOM
batch_size = 64
for start in range(0, len(docs), batch_size):
    batch = docs[start:start+batch_size]
    texts = [f"passage: {d['text']}" for d in batch]  # e5 requires "passage:" prefix
    embeddings = model.encode(texts, normalize_embeddings=True)
    index.upsert(vectors=[
        {"id": d["id"], "values": emb.tolist(),
         "metadata": {"text": d["text"], "source": d["source"], "law_section": d["law_section"]}}
        for d, emb in zip(batch, embeddings)
    ])

Code

Hybrid Search + Reranking

Pure dense search misses exact-match queries (e.g. "Section 138 of the Negotiable Instruments Act"). Hybrid search combines BM25 (keyword) and dense embeddings:

from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder

# BM25 index built once on all chunk texts
bm25 = BM25Okapi([d["text"].split() for d in docs])

# Reranker — small but high-precision cross-encoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def hybrid_rag(query: str, top_k: int = 5):
    # 1. Dense recall — get top 30 candidates
    q_emb = model.encode(f"query: {query}", normalize_embeddings=True).tolist()
    dense_hits = index.query(vector=q_emb, top_k=30, include_metadata=True)
    candidates = {h["id"]: h["metadata"] for h in dense_hits["matches"]}

    # 2. BM25 recall — add top 20 keyword matches
    bm25_scores = bm25.get_scores(query.split())
    top_bm25_idx = bm25_scores.argsort()[-20:][::-1]
    for idx in top_bm25_idx:
        candidates[docs[idx]["id"]] = docs[idx]

    # 3. Rerank — let the cross-encoder pick the best top_k
    pairs = [(query, c["text"]) for c in candidates.values()]
    rerank_scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates.values(), rerank_scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:top_k]]

Final step — feed top chunks to an LLM with grounded prompting:

def answer(query: str) -> str:
    chunks = hybrid_rag(query, top_k=5)
    context = "\n\n".join([f"[{c['source']}, {c['law_section']}]\n{c['text']}" for c in chunks])
    prompt = f"""You are a Tamil legal aid assistant. Answer ONLY using the context.
If the context does not contain the answer, say "என்னிடம் இந்த கேள்விக்கான தகவல் இல்லை" (I don't have information on this).
Always cite the source in [brackets].

Context:
{context}

Question: {query}
Answer in Tamil:"""
    return llm.generate(prompt)

Production

What Made Karthik's Latency Drop 14×

Embedding cache: A Redis cache of hash(query) → embedding for repeated queries (legal aid users ask similar questions). Cache hit rate: 47%.
Pre-filter by law section: Pinecone metadata filter {"law_section": "labour"} reduces the search space 8×.
Async API + connection pool: FastAPI with asyncio + a single Pinecone client across requests.
Reranker on GPU: The cross-encoder is small (568M params) and runs in 40ms on a T4 for 50 candidates. CPU would take 800ms.

Public good outcome: The NGO now serves 5,000+ queries per day with one volunteer engineer maintaining the system. Karthik's GitHub repo became a template for 4 other Indian legal aid NGOs.

📝 Check Your Understanding (8 Questions)

1. What problem does an HNSW index solve in vector databases?

a) It compresses vectors to take less storage space

b) It enables sub-millisecond nearest-neighbour search across millions or billions of vectors by avoiding exhaustive comparison with every stored vector

c) It encrypts vectors so embeddings cannot be reverse-engineered

d) It synchronises vectors across multiple geographic regions

2. Why does Karthik use a multilingual embedding model rather than an English-only one?

a) Multilingual models are always faster than English-only models

b) His documents and user queries are in Tamil; an English-only embedding model would not place semantically similar Tamil sentences close together in vector space

c) Multilingual models are required by Indian data localisation law

d) English-only models cannot store more than 1024 dimensions

3. Why does pure dense retrieval fail for queries like 'Section 138 of the Negotiable Instruments Act'?

a) Dense models cannot encode numbers

b) Exact identifiers (section numbers, code IDs, legal citations) carry critical semantic weight that dense embeddings often dilute; BM25 keyword matching catches them precisely

c) Pinecone strips numerals from queries before searching

d) The phrase is too long to embed

4. What is the role of the cross-encoder reranker in Karthik's pipeline?

a) It encrypts the retrieved chunks before sending them to the LLM

b) It re-scores the top candidates from dense + BM25 retrieval using a more accurate (but slower) model that processes query and candidate together, picking the best top_k

c) It translates Tamil text into English before reranking

d) It removes duplicate candidates that appear in both dense and BM25 results

5. Why does the e5 embedding model require 'passage:' and 'query:' prefixes?

a) The prefixes are arbitrary metadata for the vector database

b) The model was trained with these prefixes to produce different embeddings for the same text depending on whether it is being indexed (passage) or searched (query), improving asymmetric retrieval

c) The prefixes are required by Pinecone's API

d) The prefixes activate special multilingual tokens in the tokeniser

6. Why does the grounded prompt instruct the model to say 'I don't have information' when the context is insufficient?

a) It is a marketing requirement from the LLM provider

b) To prevent hallucination — without this instruction the LLM may invent legal advice from its pre-training, which is dangerous in a legal aid context where wrong answers can harm vulnerable users

c) To save tokens and reduce API costs

d) To comply with Pinecone's terms of service

7. What does the metadata filter {'law_section': 'labour'} achieve?

a) It changes the embedding model to a labour-law-specific model

b) It restricts the vector search to only chunks tagged with that law section, shrinking the search space and improving both speed and relevance

c) It assigns higher weights to labour-related vectors during scoring

d) It is purely cosmetic — Pinecone uses metadata only for display

8. What is the most important reliability lesson from Karthik's production work?

a) Always use the most expensive vector database available

b) A production RAG system requires the full stack — embeddings + retrieval (dense + BM25 hybrid) + reranking + grounded prompting + caching + monitoring; missing any layer creates a different failure mode at scale

c) Tamil legal questions should always be translated to English before indexing

d) ChromaDB cannot be used for any production application

← Lesson 1: LLM Fine-tuning Lesson 3: Distributed Training →

Vector Databases & Production RAG 🗄️

Class 12 Lesson 2 - Vector Databases & Production RAG

📝 Check Your Understanding (8 Questions)