Rohit's school built a Hindi study chatbot. Using OpenAI GPT-3.5 for 50,000 daily students cost ₹3.2 lakhs/month — unaffordable. Rohit migrated to a self-hosted Llama-3-8B served with vLLM on a single A10 GPU rented for ₹40,000/month from an Indian cloud provider (Yotta or E2E).
The service handles 50,000 daily users with P99 latency of 1.4 seconds. The trick: vLLM's PagedAttention + continuous batching gets 14× more throughput than naive HuggingFace serving.
Naive LLM serving allocates a contiguous KV-cache slot per request, sized for the worst case (max output length). Most requests use a fraction of it. Result: 60–80% of GPU memory is wasted, and you can only batch 8–16 concurrent requests on an A10 GPU.
vLLM's two innovations:
PagedAttention
Treats KV-cache like virtual memory pages. Allocates blocks on demand. Memory waste < 4%. Supports 5–10× more concurrent requests.
Continuous Batching
Doesn't wait for all requests in a batch to finish. As soon as one request completes, a new one slots in. GPU stays busy.
| Server | Throughput (tok/s) | Concurrent Requests | Setup |
|---|---|---|---|
| HF Transformers | ~600 | 4–8 | 1 line |
| HF TGI | ~3,500 | 32 | Docker |
| vLLM | ~9,000 | 128+ | Docker + config |
| SGLang / TensorRT-LLM | ~12,000 | 128+ | Complex setup |
# 1. Start vLLM server with Llama-3-8B
docker run --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.92
vLLM exposes an OpenAI-compatible API. Existing OpenAI client code works with one line change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy", # vLLM doesn't check keys by default
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a Hindi study tutor."},
{"role": "user", "content": "प्रकाश संश्लेषण समझाइए"},
],
temperature=0.7,
max_tokens=400,
)
print(response.choices[0].message.content)
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3
spec:
replicas: 2
selector:
matchLabels: {app: vllm}
template:
metadata:
labels: {app: vllm}
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A10
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Meta-Llama-3-8B-Instruct
- --max-model-len=4096
- --gpu-memory-utilization=0.92
ports: [{containerPort: 8000}]
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
readinessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 120 # vLLM takes ~2 min to load 8B model
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric: {name: vllm_requests_running}
target: {type: AverageValue, averageValue: "60"}
kubectl apply -f vllm-deployment.yaml
kubectl get pods -l app=vllm
# vllm-llama3-7c6f4d89b8-x2k4p Running
| Approach | Cost / Month | 50K daily users |
|---|---|---|
| OpenAI GPT-3.5 | ~₹3.2 lakh | Pay-per-token, scales linearly |
| OpenAI GPT-4o | ~₹12 lakh | Higher quality, much higher cost |
| vLLM Llama-3-8B (1× A10) | ~₹40,000 | Self-hosted, fixed cost |
| vLLM Llama-3-70B (4× A100) | ~₹3.5 lakh | GPT-4-class quality at GPT-3.5 price |