Aishwarya wanted to build a Marathi-English study tutor for her school's juniors. Llama-3-8B understands Marathi roughly but answers awkwardly. Full fine-tuning needs 80GB of GPU memory — way beyond Colab's free 15GB T4.
She discovered LoRA: train only 0.5% of the parameters, freeze the rest. With QLoRA (4-bit quantisation), she fine-tuned Llama-3-8B on a free Colab GPU in 2 hours using 1,000 Marathi Q&A pairs. The result: a tutor that explains concepts in fluent Marathi-English code-switching.
Llama-3-8B has 8 billion parameters. To fine-tune all of them you need to store: model weights (16GB in fp16), gradients (16GB), optimiser state (32GB for Adam) = 64GB+ of GPU memory. A free Colab T4 has 15GB. An A100 80GB costs ₹150/hour.
Insight from research (Hu et al., 2021): When you fine-tune, the weight updates have very low intrinsic rank. You can approximate the update ΔW as a product of two small matrices: ΔW ≈ B·A where A is r×k and B is d×r, with r=8 or 16.
Full FT
8B trainable params · 64GB VRAM · ₹150/hr A100
LoRA r=16
~40M trainable (0.5%) · 18GB VRAM · 1× A10
QLoRA r=16
~40M trainable + 4-bit base · 12GB VRAM · free T4
Install dependencies (run once in Colab):
!pip install -q transformers peft bitsandbytes accelerate datasets trl
Load Llama-3 in 4-bit and add LoRA adapters:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto"
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,071,061,504 || trainable%: 0.52
Train with TRL's SFTTrainer on Aishwarya's Marathi Q&A dataset:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
dataset = load_dataset("json", data_files="marathi_qa.jsonl", split="train")
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
dataset_text_field="text", max_seq_length=512,
args=TrainingArguments(
output_dir="./marathi-tutor", num_train_epochs=3,
per_device_train_batch_size=4, gradient_accumulation_steps=4,
learning_rate=2e-4, logging_steps=10, save_strategy="epoch",
bf16=True, optim="paged_adamw_8bit",
),
)
trainer.train()
model.save_pretrained("./marathi-tutor-lora") # only ~150 MB
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(base, "./marathi-tutor-lora")
prompt = "विद्यार्थी: प्रकाश संश्लेषण म्हणजे काय?\nशिक्षक:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
- Evaluation: Don't trust loss numbers. Build a held-out Marathi test set with 50 questions and have a Marathi speaker rate each answer.
- Merge for serving:
model.merge_and_unload()bakes the adapter into the base for faster inference. - Llama license: Meta's Llama-3 community license allows commercial use up to 700M monthly users — fine for any Indian school project.
- Beware overfit: 3 epochs on 1,000 examples works. 30 epochs makes the model parrot training data and forget general knowledge.