LLM Fine-tuning with LoRA & QLoRA 🔧

Class 12Age 16-17Lesson 01 of 12🆓 Free
Class 12 Lesson 01 hero — Aishwarya, Pune
Watch first - 2-3 minutes

Class 12 Lesson 1 - LLM Fine-tuning with LoRA & QLoRA

No sign-in needed - English narration - Safe for all school ages

Story
Aishwarya's Marathi Tutor Bot
👩‍🎓 Aishwarya · Pune · Age 17

Aishwarya wanted to build a Marathi-English study tutor for her school's juniors. Llama-3-8B understands Marathi roughly but answers awkwardly. Full fine-tuning needs 80GB of GPU memory — way beyond Colab's free 15GB T4.

She discovered LoRA: train only 0.5% of the parameters, freeze the rest. With QLoRA (4-bit quantisation), she fine-tuned Llama-3-8B on a free Colab GPU in 2 hours using 1,000 Marathi Q&A pairs. The result: a tutor that explains concepts in fluent Marathi-English code-switching.

Why LoRA
The Problem with Full Fine-tuning

Llama-3-8B has 8 billion parameters. To fine-tune all of them you need to store: model weights (16GB in fp16), gradients (16GB), optimiser state (32GB for Adam) = 64GB+ of GPU memory. A free Colab T4 has 15GB. An A100 80GB costs ₹150/hour.

Insight from research (Hu et al., 2021): When you fine-tune, the weight updates have very low intrinsic rank. You can approximate the update ΔW as a product of two small matrices: ΔW ≈ B·A where A is r×k and B is d×r, with r=8 or 16.

Full FT

8B trainable params · 64GB VRAM · ₹150/hr A100

LoRA r=16

~40M trainable (0.5%) · 18GB VRAM · 1× A10

QLoRA r=16

~40M trainable + 4-bit base · 12GB VRAM · free T4

Code
Fine-tune Llama-3 with QLoRA in Colab

Install dependencies (run once in Colab):

!pip install -q transformers peft bitsandbytes accelerate datasets trl

Load Llama-3 in 4-bit and add LoRA adapters:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,071,061,504 || trainable%: 0.52

Train with TRL's SFTTrainer on Aishwarya's Marathi Q&A dataset:

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

dataset = load_dataset("json", data_files="marathi_qa.jsonl", split="train")

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    dataset_text_field="text", max_seq_length=512,
    args=TrainingArguments(
        output_dir="./marathi-tutor", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, logging_steps=10, save_strategy="epoch",
        bf16=True, optim="paged_adamw_8bit",
    ),
)
trainer.train()
model.save_pretrained("./marathi-tutor-lora")  # only ~150 MB
Inference
Loading and Using Your Adapter
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(base, "./marathi-tutor-lora")

prompt = "विद्यार्थी: प्रकाश संश्लेषण म्हणजे काय?\nशिक्षक:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Tip: The LoRA adapter is just ~150MB. You can keep multiple adapters (Marathi tutor, Hindi tutor, English summariser) and swap them on the same base model. This is the foundation of multi-tenant LLM serving.
What's Next
Production Considerations

📝 Check Your Understanding (8 Questions)

1. Why does Aishwarya choose LoRA over full fine-tuning for her Marathi tutor?
a) LoRA always produces higher quality models than full fine-tuning
b) Full fine-tuning of Llama-3-8B requires ~64GB GPU memory which exceeds Colab's free T4 (15GB); LoRA trains only 0.5% of parameters, fitting easily in 12GB
c) LoRA is the only method that supports Indian languages
d) Full fine-tuning is illegal under Meta's Llama license
2. What is the core mathematical idea behind LoRA?
a) Replace all matrix multiplications with element-wise operations
b) Approximate weight updates ΔW as a low-rank product B·A where A is r×k and B is d×r, with r much smaller than d and k
c) Use Fourier transforms to compress the weight matrices
d) Train the model only on the most important 1% of training data
3. What does QLoRA add on top of LoRA?
a) Quantum-inspired noise injection during training
b) 4-bit quantisation (NF4) of the frozen base model, dramatically reducing memory while still training the LoRA adapters in higher precision
c) Q-learning style reward shaping during fine-tuning
d) Quality-of-service guarantees for production deployment
4. In Aishwarya's LoraConfig, why does she target q_proj, k_proj, v_proj, o_proj?
a) Those are the only modules that PEFT library supports
b) These are the attention projection matrices — empirically the highest-impact modules to adapt; touching them changes how the model attends to information
c) LoRA cannot be applied to feed-forward layers due to a mathematical limitation
d) These four modules collectively contain over 90% of the model's parameters
5. Why is the saved LoRA adapter only ~150 MB despite the base model being 16 GB?
a) The PEFT library compresses the base model alongside the adapter
b) The adapter only contains the small A and B matrices (millions of parameters), not the frozen 8 billion base parameters which remain unchanged
c) Hugging Face automatically removes redundant weights when saving adapters
d) The adapter is stored as a delta against a publicly known checksum
6. If Aishwarya trains for 30 epochs instead of 3 on 1,000 Marathi examples, what is the most likely outcome?
a) The model becomes 10× more accurate because it sees the data more times
b) Catastrophic overfitting — the model memorises and regurgitates training answers and loses general knowledge from pre-training
c) The model converts to a different language by accident
d) The training crashes because PEFT does not support more than 5 epochs
7. Why is paged_adamw_8bit a good optimiser choice in QLoRA training?
a) It runs natively on Apple Silicon and Intel CPUs
b) It pages optimiser state between GPU and CPU memory and stores it in 8-bit, drastically reducing GPU memory use vs standard AdamW
c) It is the only optimiser compatible with NF4 quantisation
d) It uses pages of memory rather than continuous allocation, which improves cache hit rates
8. What is the most reliable way to know if Aishwarya's fine-tuned model is actually better at Marathi?
a) Compare the final training loss to the starting loss
b) Build a held-out test set of ~50 Marathi questions and have a Marathi speaker rate the answers from base model vs fine-tuned model side-by-side
c) Check the perplexity score on a generic English benchmark
d) Count the number of Marathi tokens in the model's output vocabulary
← Class 12 Hub Lesson 2: Vector DBs & RAG →