Recommender Systems at Scale ๐ŸŽฏ

Class 12Age 16-17Lesson 07 of 12๐Ÿ†“ Free
Class 12 Lesson 07 hero โ€” Priya, Hyderabad
Watch first - 2-3 minutes

Class 12 Lesson 7 - Recommender Systems at Scale

No sign-in needed - English narration - Safe for all school ages

Story
Priya's Telugu OTT Recommender
๐Ÿ‘ฉโ€๐Ÿ’ป Priya ยท Hyderabad ยท Age 17

Priya interned at a Hyderabad startup building a Telugu-first OTT app. Their original "trending" feed showed everyone the same 20 films. Engagement was 4 minutes/day. After Priya built a personalised recommender, engagement grew to 19 minutes/day in 8 weeks.

She started with simple collaborative filtering, hit cold-start problems, then built a two-tower neural model that handled new users (cold-start) and scaled to 200K users + 8K films.

Concepts
The Recommender Toolbox

Popularity

Show the trending. No personalisation. The baseline. Beats nothing personalised.

Collaborative Filtering

"Users who liked X also liked Y." Works when you have ratings. Cold-start fails.

Matrix Factorisation

Decompose user-item matrix into latent factors. SVD/ALS/NMF. The classic.

Content-Based

Recommend items similar in features (genre, cast, language). Works for new items.

Two-Tower Neural

One tower encodes users, one encodes items. Both share an embedding space. Scales to billions.

Transformer-based

Sequence of past interactions โ†’ next-item prediction. State of the art (SASRec, BERT4Rec).

Code
Two-Tower Neural Recommender
import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoTowerRecommender(nn.Module):
    """User tower + Item tower โ†’ cosine similarity in shared 64-dim space."""

    def __init__(self, num_users, num_items, num_genres, num_languages, dim=64):
        super().__init__()
        # User features: ID + age_bucket + city + watch_history (last 10 items)
        self.user_id_emb = nn.Embedding(num_users, 32)
        self.age_emb = nn.Embedding(8, 8)            # 8 age buckets
        self.city_emb = nn.Embedding(50, 16)         # 50 cities
        self.history_emb = nn.Embedding(num_items + 1, 32, padding_idx=0)
        self.user_mlp = nn.Sequential(
            nn.Linear(32 + 8 + 16 + 32, 128), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(128, dim),
        )
        # Item features: ID + genre + language + release_year_bucket
        self.item_id_emb = nn.Embedding(num_items, 32)
        self.genre_emb = nn.Embedding(num_genres, 16)
        self.lang_emb = nn.Embedding(num_languages, 8)
        self.year_emb = nn.Embedding(10, 8)
        self.item_mlp = nn.Sequential(
            nn.Linear(32 + 16 + 8 + 8, 128), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(128, dim),
        )

    def encode_user(self, user_id, age, city, history):
        history_avg = self.history_emb(history).mean(dim=1)  # avg of last 10 items
        x = torch.cat([self.user_id_emb(user_id), self.age_emb(age),
                       self.city_emb(city), history_avg], dim=-1)
        return F.normalize(self.user_mlp(x), dim=-1)

    def encode_item(self, item_id, genre, lang, year):
        x = torch.cat([self.item_id_emb(item_id), self.genre_emb(genre),
                       self.lang_emb(lang), self.year_emb(year)], dim=-1)
        return F.normalize(self.item_mlp(x), dim=-1)

    def forward(self, user_inputs, item_inputs):
        u = self.encode_user(*user_inputs)
        v = self.encode_item(*item_inputs)
        return (u * v).sum(dim=-1)  # cosine similarity (since both normalised)
Code
Train with Sampled Softmax (Negative Sampling)
def train_step(model, batch, num_negatives=4):
    """For each positive (user, item) pair, sample N random items as negatives."""
    user_inputs = batch["user"]   # (user_id, age, city, history)
    pos_item = batch["pos_item"]  # (item_id, genre, lang, year)

    # Sample negatives uniformly (or proportional to popularity^0.75)
    batch_size = pos_item[0].size(0)
    neg_ids = torch.randint(0, num_items, (batch_size, num_negatives))
    neg_items = (neg_ids,
                 item_genre_table[neg_ids],
                 item_lang_table[neg_ids],
                 item_year_table[neg_ids])

    pos_score = model(user_inputs, pos_item)  # (B,)
    # Encode all negatives in one go
    u = model.encode_user(*user_inputs).unsqueeze(1)         # (B, 1, dim)
    v_neg = model.encode_item(*neg_items)                    # (B, N, dim)
    neg_scores = (u * v_neg).sum(dim=-1)                     # (B, N)

    logits = torch.cat([pos_score.unsqueeze(1), neg_scores], dim=1)  # (B, 1+N)
    targets = torch.zeros(batch_size, dtype=torch.long, device=logits.device)
    return F.cross_entropy(logits, targets)

At inference, encode all 8K films into a FAISS index once, then for each user encode the user vector and look up top-K nearest films in <5ms.

Production
Cold Start, Diversity, and Filter Bubbles
Ethics for OTT recommenders:
  • Allow "show me everything chronologically" โ€” let users opt out of personalisation.
  • Don't recommend extreme/harmful content even if engagement is high.
  • Make the "Why am I seeing this?" explanation accessible (e.g., "Because you watched X").
Priya's outcome: Engagement 4 โ†’ 19 min/day. The startup raised a Series A based on the engagement metric. Priya kept her summer internship and is now building the next iteration: a transformer-based sequential model (SASRec).

๐Ÿ“ Check Your Understanding (8 Questions)

1. Why does popularity-based 'trending' work poorly for personalisation?
a) Trending lists are always wrong
b) It shows everyone the same 20 items, so it cannot capture individual taste โ€” a Telugu user who only watches romance gets the same feed as a comedy fan, hurting engagement and retention
c) Trending requires expensive computation
d) Trending lists violate privacy laws
2. What problem does collaborative filtering have that the two-tower model solves?
a) Collaborative filtering does not support GPU acceleration
b) Collaborative filtering cannot handle cold-start (new users with no history); the two-tower model uses side features (age, city, genre preferences from onboarding) so it can produce reasonable recommendations on day 1
c) Collaborative filtering only works for English content
d) Collaborative filtering has been deprecated by FAISS
3. Why does the two-tower architecture put the user encoder and item encoder in separate towers?
a) It runs faster because the towers can be trained on different GPUs
b) At serving time you precompute and FAISS-index all item embeddings once; for each query you only encode the user and do a fast nearest-neighbour search โ€” this scales to billions of items, which a single combined model cannot
c) Two-tower is required by the recommender system patent
d) It allows the model to handle multiple languages
4. Why does training use negative sampling (sampled softmax) instead of computing softmax over all 8,000 films?
a) Negative sampling produces strictly higher accuracy
b) Computing the full softmax over every item per batch is prohibitively expensive at scale; sampling 4โ€“10 negatives per positive gives an unbiased gradient estimate at a fraction of the cost โ€” the standard scaling trick from word2vec
c) Negative sampling is required for cosine similarity to work
d) The 8,000 films cannot fit in GPU memory simultaneously
5. Why does Priya inject 1โ€“2 random 'exploration' items into every list of 10?
a) To slow down the algorithm and reduce server load
b) To break filter bubbles and continually gather signal on items the user might like but the model doesn't yet know about; pure exploitation collapses the recommendation space and causes engagement decay over time
c) To meet a regulatory requirement for randomness
d) Because the model has bugs that random items mask
6. Why are explicit 'Not Interested' clicks more valuable than 1-minute watches?
a) Skip data is required by Indian law to be tracked
b) A 'Not Interested' click is a clear, intentional negative signal; a 1-minute watch is ambiguous (interrupted? pre-roll? reading subtitles?) โ€” clear labels train better models
c) Skips happen more frequently than watches in the dataset
d) The OTT platform pays content owners only on long watches
7. Why does Priya use F.normalize on both user and item embeddings before computing the dot product?
a) Normalisation is required by PyTorch when using two-tower models
b) Normalising both vectors to unit length means the dot product equals cosine similarity, which is bounded in [-1, 1] and behaves well for ranking and FAISS retrieval; unnormalised dot products are dominated by vector magnitudes and skew rankings toward popular items
c) It reduces GPU memory by half
d) It is the only way to combine user and item features
8. What is the most important ethical principle Priya applies to her recommender?
a) Always recommend content with the highest predicted watch time
b) Engagement optimisation has guardrails: allow users to opt out of personalisation, refuse to amplify extreme/harmful content even when engaging, and make 'Why am I seeing this?' easy to access โ€” engagement at any cost causes long-term harm to users and to the business
c) Always show items in alphabetical order to be fair
d) Recommend only content from the user's home state
โ† Lesson 6: Speech AI Lesson 8: AI at Scale โ†’