Lesson 08 — AI at Scale: Kubernetes + vLLM | Class 12

Story

Rohit's ₹40,000/month LLM Service

👨‍💻 Rohit · Mumbai · Age 17

Rohit's school built a Hindi study chatbot. Using OpenAI GPT-3.5 for 50,000 daily students cost ₹3.2 lakhs/month — unaffordable. Rohit migrated to a self-hosted Llama-3-8B served with vLLM on a single A10 GPU rented for ₹40,000/month from an Indian cloud provider (Yotta or E2E).

The service handles 50,000 daily users with P99 latency of 1.4 seconds. The trick: vLLM's PagedAttention + continuous batching gets 14× more throughput than naive HuggingFace serving.

Why vLLM

The Throughput Problem

Naive LLM serving allocates a contiguous KV-cache slot per request, sized for the worst case (max output length). Most requests use a fraction of it. Result: 60–80% of GPU memory is wasted, and you can only batch 8–16 concurrent requests on an A10 GPU.

vLLM's two innovations:

PagedAttention

Treats KV-cache like virtual memory pages. Allocates blocks on demand. Memory waste < 4%. Supports 5–10× more concurrent requests.

Continuous Batching

Doesn't wait for all requests in a batch to finish. As soon as one request completes, a new one slots in. GPU stays busy.

Server	Throughput (tok/s)	Concurrent Requests	Setup
HF Transformers	~600	4–8	1 line
HF TGI	~3,500	32	Docker
vLLM	~9,000	128+	Docker + config
SGLang / TensorRT-LLM	~12,000	128+	Complex setup

Code

Run vLLM in 3 Commands

# 1. Start vLLM server with Llama-3-8B
docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.92

vLLM exposes an OpenAI-compatible API. Existing OpenAI client code works with one line change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",  # vLLM doesn't check keys by default
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a Hindi study tutor."},
        {"role": "user", "content": "प्रकाश संश्लेषण समझाइए"},
    ],
    temperature=0.7,
    max_tokens=400,
)
print(response.choices[0].message.content)

Code

Kubernetes Deployment with Autoscaling

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
spec:
  replicas: 2
  selector:
    matchLabels: {app: vllm}
  template:
    metadata:
      labels: {app: vllm}
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A10
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=meta-llama/Meta-Llama-3-8B-Instruct
        - --max-model-len=4096
        - --gpu-memory-utilization=0.92
        ports: [{containerPort: 8000}]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
        readinessProbe:
          httpGet: {path: /health, port: 8000}
          initialDelaySeconds: 120  # vLLM takes ~2 min to load 8B model
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama3
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric: {name: vllm_requests_running}
      target: {type: AverageValue, averageValue: "60"}

kubectl apply -f vllm-deployment.yaml
kubectl get pods -l app=vllm
# vllm-llama3-7c6f4d89b8-x2k4p  Running

Indian cloud tip: Yotta, E2E, and Tata Cloud all offer A10/L40 GPUs at 30–50% lower price than AWS for Indian-hosted workloads. Data localisation under DPDPA 2023 also favours Indian infrastructure for student data.

Cost

The Economics — Why Self-Hosting Wins at Scale

Approach	Cost / Month	50K daily users
OpenAI GPT-3.5	~₹3.2 lakh	Pay-per-token, scales linearly
OpenAI GPT-4o	~₹12 lakh	Higher quality, much higher cost
vLLM Llama-3-8B (1× A10)	~₹40,000	Self-hosted, fixed cost
vLLM Llama-3-70B (4× A100)	~₹3.5 lakh	GPT-4-class quality at GPT-3.5 price

Crossover point: Self-hosting wins when you have ~5,000+ daily active users. Below that, OpenAI / Anthropic / Google APIs are cheaper because the GPU sits idle most of the time. Don't self-host prematurely.

Operational reality: Self-hosting requires DevOps capability — GPU monitoring, autoscaling, model upgrades, security patches. Budget 0.25 FTE engineering time. Below that, the API price is worth it.

📝 Check Your Understanding (8 Questions)

1. What problem does PagedAttention solve in LLM serving?

a) It speeds up the matrix multiplications inside attention

b) Naive serving allocates a worst-case-sized contiguous KV cache per request, wasting 60-80% of GPU memory; PagedAttention treats KV cache like virtual-memory pages allocated on demand, reducing waste to under 4% and enabling 5-10× more concurrent requests

c) It removes the attention mechanism entirely

d) It encrypts the KV cache for privacy

2. What is continuous batching in vLLM?

a) A scheduling policy that batches all requests every 10 milliseconds

b) The scheduler doesn't wait for every request in a batch to finish — as soon as one completes, a new request slots into its place; the GPU stays continuously busy instead of idling on the slowest request

c) A way to batch training and inference workloads on the same GPU

d) A continuous integration test that runs each batch through CI

3. Why does Rohit use the OpenAI-compatible API exposed by vLLM?

a) It is the only API that vLLM supports

b) OpenAI's chat completions API is the de-facto standard; vLLM exposing the same interface means the existing chatbot code works with a single base_url change — zero rewrite, easy migration back to OpenAI if needed

c) OpenAI requires it for licensing reasons

d) The OpenAI API is the only one that supports Hindi

4. Why does Rohit's HorizontalPodAutoscaler scale on vllm_requests_running rather than CPU usage?

a) CPU metrics are not exposed by Kubernetes for GPU pods

b) LLM workloads are GPU-bound — CPU usage stays low even at full GPU saturation; vLLM exposes a queue-depth metric that actually tracks load and is the right scaling signal

c) The pods do not have CPU limits set

d) Memory is the only meaningful metric for LLMs

5. Why does the readiness probe have initialDelaySeconds: 120?

a) It is a Kubernetes-imposed minimum

b) vLLM takes ~2 minutes to download (if not cached) and load the 8B-parameter model into GPU memory and warm up CUDA kernels; serving health checks before that period would mark the pod ready prematurely

c) The Llama license requires a 2-minute waiting period

d) Kubernetes garbage collection runs every 120 seconds

6. At what scale does self-hosting an LLM with vLLM become cheaper than OpenAI's API?

a) Self-hosting is always cheaper than any API

b) Approximately 5,000+ daily active users — below that, the GPU sits idle most of the time and pay-per-token APIs win on cost; above that, fixed self-hosted infrastructure becomes much cheaper per query

c) Self-hosting is never cheaper because of DevOps overhead

d) At 100 users per day

7. Why is gpu-memory-utilization=0.92 a good default for vLLM?

a) 0.92 is the maximum value vLLM accepts

b) vLLM pre-allocates 92% of GPU memory for KV cache + model weights; leaving 8% headroom prevents OOM crashes from temporary spikes (long prompts, beam search) while maximising concurrent request capacity

c) It matches the recommended ratio in NVIDIA's CUDA documentation

d) 92% is the average utilisation seen in production

8. What hidden cost does Rohit's lesson warn about with self-hosting?

a) Electricity bills become prohibitively expensive

b) DevOps engineering time — GPU monitoring, autoscaling, model upgrades, security patches require ~0.25 FTE; teams without that capability should pay for the API even if the per-request cost is higher, because operational outages cost more than savings

c) GPU drivers must be re-licensed monthly

d) Indian cloud providers do not offer A10 GPUs

← Lesson 7: Recommender Systems Lesson 9: Time Series →

AI at Scale: Kubernetes + vLLM 🚢

Class 12 Lesson 8 - AI at Scale: Kubernetes + vLLM

PagedAttention

Continuous Batching

📝 Check Your Understanding (8 Questions)