Karthik volunteered with a Chennai NGO that helps daily-wage workers with legal questions in Tamil. They had 12,000 pages of Tamil legal Q&A documents. A simple ChromaDB prototype worked for 100 documents but timed out at 5,000 queries/day.
He rebuilt with Pinecone (managed vector DB), hybrid search (BM25 + dense embeddings), and a reranker. Result: P95 latency dropped from 4 seconds to 280ms, and answer accuracy improved from 62% to 89%.
A vector database stores high-dimensional vectors (embeddings) and finds nearest neighbours in milliseconds — even across billions of vectors. The magic is the HNSW index (Hierarchical Navigable Small Worlds), which avoids comparing the query to every vector.
| Vector DB | Type | Best For | Indian Cost |
|---|---|---|---|
| ChromaDB | Embedded / Self-hosted | Prototypes, <100K vectors | Free (your server) |
| Pinecone | Managed cloud | Production, low ops | ~$70/mo starter |
| Weaviate | Self/managed, hybrid native | Hybrid search apps | Free OSS or $25/mo cloud |
| Qdrant | Self/managed, fast filters | Filtered search at scale | Free OSS or $25/mo |
| pgvector | Postgres extension | Apps already on Postgres | Free with your DB |
Step 1 — Embed the Tamil legal documents using a multilingual model:
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec
# Multilingual model — supports Tamil, Hindi, English natively
model = SentenceTransformer("intfloat/multilingual-e5-large")
pc = Pinecone(api_key="YOUR_KEY")
pc.create_index(
name="tamil-legal",
dimension=1024, # e5-large output dim
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
index = pc.Index("tamil-legal")
Step 2 — Chunk and upsert (semantic chunking with overlap):
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
docs = [] # list of {"id", "text", "source", "law_section"}
for raw_doc in load_legal_docs("./legal_pdfs/"):
chunks = splitter.split_text(raw_doc.text)
for i, chunk in enumerate(chunks):
docs.append({
"id": f"{raw_doc.id}-{i}", "text": chunk,
"source": raw_doc.source, "law_section": raw_doc.section,
})
# Embed in batches of 64 to avoid OOM
batch_size = 64
for start in range(0, len(docs), batch_size):
batch = docs[start:start+batch_size]
texts = [f"passage: {d['text']}" for d in batch] # e5 requires "passage:" prefix
embeddings = model.encode(texts, normalize_embeddings=True)
index.upsert(vectors=[
{"id": d["id"], "values": emb.tolist(),
"metadata": {"text": d["text"], "source": d["source"], "law_section": d["law_section"]}}
for d, emb in zip(batch, embeddings)
])
Pure dense search misses exact-match queries (e.g. "Section 138 of the Negotiable Instruments Act"). Hybrid search combines BM25 (keyword) and dense embeddings:
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder
# BM25 index built once on all chunk texts
bm25 = BM25Okapi([d["text"].split() for d in docs])
# Reranker — small but high-precision cross-encoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def hybrid_rag(query: str, top_k: int = 5):
# 1. Dense recall — get top 30 candidates
q_emb = model.encode(f"query: {query}", normalize_embeddings=True).tolist()
dense_hits = index.query(vector=q_emb, top_k=30, include_metadata=True)
candidates = {h["id"]: h["metadata"] for h in dense_hits["matches"]}
# 2. BM25 recall — add top 20 keyword matches
bm25_scores = bm25.get_scores(query.split())
top_bm25_idx = bm25_scores.argsort()[-20:][::-1]
for idx in top_bm25_idx:
candidates[docs[idx]["id"]] = docs[idx]
# 3. Rerank — let the cross-encoder pick the best top_k
pairs = [(query, c["text"]) for c in candidates.values()]
rerank_scores = reranker.predict(pairs)
ranked = sorted(zip(candidates.values(), rerank_scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:top_k]]
Final step — feed top chunks to an LLM with grounded prompting:
def answer(query: str) -> str:
chunks = hybrid_rag(query, top_k=5)
context = "\n\n".join([f"[{c['source']}, {c['law_section']}]\n{c['text']}" for c in chunks])
prompt = f"""You are a Tamil legal aid assistant. Answer ONLY using the context.
If the context does not contain the answer, say "என்னிடம் இந்த கேள்விக்கான தகவல் இல்லை" (I don't have information on this).
Always cite the source in [brackets].
Context:
{context}
Question: {query}
Answer in Tamil:"""
return llm.generate(prompt)
- Embedding cache: A Redis cache of
hash(query) → embeddingfor repeated queries (legal aid users ask similar questions). Cache hit rate: 47%. - Pre-filter by law section: Pinecone metadata filter
{"law_section": "labour"}reduces the search space 8×. - Async API + connection pool: FastAPI with
asyncio+ a single Pinecone client across requests. - Reranker on GPU: The cross-encoder is small (568M params) and runs in 40ms on a T4 for 50 candidates. CPU would take 800ms.