Recommender Systems from Scratch

February 27, 2026 · Elementary · 13 min read

The Recommendation Problem — Why It’s Harder Than It Looks

Netflix’s recommendation engine is worth an estimated $1 billion per year in retained subscribers. YouTube’s “Up Next” drives 70% of all watch time. Amazon attributes 35% of revenue to “customers who bought this also bought.” Recommender systems are quietly the most profitable machine learning application ever deployed — and the math behind them is surprisingly beautiful.

Here’s the core problem: you have a giant matrix of users and items. Each cell might contain a rating — or, far more likely, nothing at all. A user with 200 movies watched out of 15,000 available means 98.7% of that row is empty. Your job is to predict the missing entries: which of those 14,800 unseen movies would this person actually enjoy?

Three paradigms attack this problem from different angles. Collaborative filtering says users who agreed before will agree again. Content-based filtering says recommend items similar to what you already liked. Hybrid methods combine both signals. We’ll build all three from scratch, starting with the simplest possible baseline and ending with a neural network.

Let’s start with the data. We’ll use a small rating matrix — 8 users rating 10 movies — with enough structure that patterns are discoverable but enough sparsity that it’s non-trivial.

# Code Block 1: The Rating Matrix

# 8 users x 10 movies, ratings 1-5 (0 = unrated)
# Movies have hidden genre structure:
#   0-2: Sci-Fi (Nebula Quest, Star Forge, Void Runners)
#   3-5: Comedy (Laugh Track, Dad Joke: The Movie, Pun Intended)
#   6-7: Drama (Quiet River, The Long Goodbye)
#   8-9: Action (Fist Planet, Thunder Road)
ratings = [
    [5, 4, 5, 1, 0, 0, 2, 0, 4, 3],  # User 0: sci-fi + action fan
    [4, 5, 4, 2, 1, 2, 0, 1, 3, 4],  # User 1: sci-fi + action fan
    [1, 2, 1, 5, 4, 5, 3, 0, 0, 1],  # User 2: comedy lover
    [2, 1, 0, 4, 5, 4, 0, 3, 1, 0],  # User 3: comedy lover
    [0, 1, 0, 2, 0, 0, 5, 4, 0, 2],  # User 4: drama fan
    [2, 0, 1, 0, 3, 0, 4, 5, 0, 1],  # User 5: drama + some comedy
    [5, 4, 4, 1, 1, 0, 2, 0, 5, 5],  # User 6: sci-fi + action
    [0, 1, 2, 5, 4, 5, 2, 0, 0, 1],  # User 7: comedy lover
]
n_users, n_items = len(ratings), len(ratings[0])

# Count rated entries
rated = sum(1 for u in ratings for r in u if r > 0)
total = n_users * n_items
print(f"Rated: {rated}/{total} ({100*rated/total:.1f}%)")
print(f"Sparsity: {100*(1 - rated/total):.1f}%")

# Naive baseline: predict global average for all missing entries
all_ratings = [r for u in ratings for r in u if r > 0]
global_avg = sum(all_ratings) / len(all_ratings)
print(f"\nGlobal average rating: {global_avg:.2f}")

# RMSE of naive baseline on known ratings
mse = sum((r - global_avg)**2 for u in ratings for r in u if r > 0)
rmse_naive = (mse / rated) ** 0.5
print(f"Naive baseline RMSE: {rmse_naive:.3f}")

# Per-user average baseline
user_avgs = []
for u in ratings:
    vals = [r for r in u if r > 0]
    user_avgs.append(sum(vals) / len(vals) if vals else global_avg)

mse_user = sum((r - user_avgs[i])**2
               for i, u in enumerate(ratings) for r in u if r > 0)
rmse_user = (mse_user / rated) ** 0.5
print(f"Per-user avg RMSE:   {rmse_user:.3f}")
# Output:
# Rated: 57/80 (71.2%)
# Sparsity: 28.7%
# Global average rating: 2.96
# Naive baseline RMSE: 1.544
# Per-user avg RMSE:   1.521

The per-user average barely improves on the global average (1.544 → 1.521). It captures that some people rate higher than others, but misses the real signal: which specific items a user would like. That’s our floor — every method we build should beat 1.521.

User-Based Collaborative Filtering — Your Neighbors Know Best

The oldest recommendation algorithm is also the most intuitive: find users who rate things similarly to you, then recommend what they liked. If you and your coworker both gave five stars to the same three sci-fi movies, and she loved a fourth one you haven’t seen, it’s a reasonable bet you’ll enjoy it too.

The key decision is how to measure “similar.” We’ll use cosine similarity on mean-centered rating vectors. Mean-centering handles the fact that some people are generous raters (everything is 4-5) while others are harsh (everything is 2-3). After centering, we only compare items both users have rated — you can’t compute similarity from absence.

# Code Block 2: User-Based Collaborative Filtering
def cosine_sim(u1, u2, ratings):
    """Cosine similarity between two users on co-rated items, mean-centered."""
    r1, r2 = ratings[u1], ratings[u2]
    # Find items both users rated
    co_rated = [j for j in range(len(r1)) if r1[j] > 0 and r2[j] > 0]
    if len(co_rated) < 2:
        return 0.0  # not enough overlap

    # Mean-center each user's ratings
    mean1 = sum(r1[j] for j in co_rated) / len(co_rated)
    mean2 = sum(r2[j] for j in co_rated) / len(co_rated)

    num = sum((r1[j] - mean1) * (r2[j] - mean2) for j in co_rated)
    d1 = sum((r1[j] - mean1)**2 for j in co_rated) ** 0.5
    d2 = sum((r2[j] - mean2)**2 for j in co_rated) ** 0.5
    return num / (d1 * d2) if d1 * d2 > 0 else 0.0

def predict_user_cf(user, item, ratings, k=3):
    """Predict rating using k most similar users who rated this item."""
    if ratings[user][item] > 0:
        return ratings[user][item]  # already rated

    # Find users who rated this item
    candidates = [u for u in range(len(ratings))
                  if u != user and ratings[u][item] > 0]
    if not candidates:
        return user_avgs[user]  # fallback to user average

    # Compute similarities and pick top-k
    sims = [(u, cosine_sim(user, u, ratings)) for u in candidates]
    sims.sort(key=lambda x: -x[1])
    top_k = [(u, s) for u, s in sims[:k] if s > 0]

    if not top_k:
        return user_avgs[user]

    # Weighted average of neighbors' ratings
    num = sum(s * ratings[u][item] for u, s in top_k)
    den = sum(abs(s) for _, s in top_k)
    return num / den

# Predict User 0's rating for Movie 4 (a comedy)
pred = predict_user_cf(0, 4, ratings, k=3)
print(f"User 0 rating for 'Dad Joke: The Movie': {pred:.2f}")

# RMSE on all known ratings (leave-one-out style)
errors = []
for i in range(n_users):
    for j in range(n_items):
        if ratings[i][j] > 0:
            # Temporarily hide rating, predict, restore
            true_val = ratings[i][j]
            ratings[i][j] = 0
            pred = predict_user_cf(i, j, ratings, k=3)
            ratings[i][j] = true_val
            errors.append((true_val - pred) ** 2)

rmse_ucf = (sum(errors) / len(errors)) ** 0.5
print(f"User-CF RMSE: {rmse_ucf:.3f}")
# Output:
# User 0 rating for 'Dad Joke: The Movie': 1.00
# User-CF RMSE: 1.090

User-CF predicts that User 0 (a sci-fi fan) would rate “Dad Joke: The Movie” at 1.00 — not their cup of tea. The algorithm found that similar sci-fi fans (Users 6, 1) also rated the comedy a 1. RMSE of 1.090 beats our per-user average baseline of 1.521 by 28%.

The big downside? Scaling. Computing similarity between every pair of users is O(n²), and user preferences change over time. Item-based CF solves both problems.

Item-Based Collaborative Filtering — Amazon’s Secret Weapon

In 2003, Amazon published their approach: instead of finding similar users, find similar items. To predict whether you’ll like Movie X, look at how you rated movies similar to X. This insight made recommendations practical at scale — item similarities are far more stable than user similarities (The Matrix is always similar to Inception, but your taste might shift next month).

Even better, you can precompute the entire item-item similarity matrix offline and serve predictions with a simple lookup plus weighted average.

# Code Block 3: Item-Based Collaborative Filtering
def item_cosine_sim(i1, i2, ratings):
    """Cosine similarity between two items based on users who rated both."""
    co_users = [u for u in range(len(ratings))
                if ratings[u][i1] > 0 and ratings[u][i2] > 0]
    if len(co_users) < 2:
        return 0.0

    mean1 = sum(ratings[u][i1] for u in co_users) / len(co_users)
    mean2 = sum(ratings[u][i2] for u in co_users) / len(co_users)

    num = sum((ratings[u][i1] - mean1) * (ratings[u][i2] - mean2)
              for u in co_users)
    d1 = sum((ratings[u][i1] - mean1)**2 for u in co_users) ** 0.5
    d2 = sum((ratings[u][i2] - mean2)**2 for u in co_users) ** 0.5
    return num / (d1 * d2) if d1 * d2 > 0 else 0.0

# Precompute item-item similarity matrix
item_sims = [[0.0] * n_items for _ in range(n_items)]
for i in range(n_items):
    for j in range(i + 1, n_items):
        s = item_cosine_sim(i, j, ratings)
        item_sims[i][j] = s
        item_sims[j][i] = s

def predict_item_cf(user, item, ratings, k=3):
    """Predict rating from user's ratings of the k most similar items."""
    if ratings[user][item] > 0:
        return ratings[user][item]

    # Items this user has rated
    rated_items = [j for j in range(n_items)
                   if j != item and ratings[user][j] > 0]
    if not rated_items:
        return user_avgs[user]

    # Pick top-k most similar items the user has rated
    sims = [(j, item_sims[item][j]) for j in rated_items]
    sims.sort(key=lambda x: -x[1])
    top_k = [(j, s) for j, s in sims[:k] if s > 0]

    if not top_k:
        return user_avgs[user]

    num = sum(s * ratings[user][j] for j, s in top_k)
    den = sum(abs(s) for _, s in top_k)
    return num / den

# Evaluate item-based CF
errors_icf = []
for i in range(n_users):
    for j in range(n_items):
        if ratings[i][j] > 0:
            true_val = ratings[i][j]
            ratings[i][j] = 0
            pred = predict_item_cf(i, j, ratings, k=3)
            ratings[i][j] = true_val
            errors_icf.append((true_val - pred) ** 2)

rmse_icf = (sum(errors_icf) / len(errors_icf)) ** 0.5
print(f"Item-CF RMSE: {rmse_icf:.3f}")
print(f"\nItem similarities (Nebula Quest vs ...):")
names = ["Nebula Quest", "Star Forge", "Void Runners",
         "Laugh Track", "Dad Joke", "Pun Intended",
         "Quiet River", "Long Goodbye", "Fist Planet", "Thunder Road"]
for j in range(1, n_items):
    print(f"  {names[j]:<14s} {item_sims[0][j]:+.3f}")
# Output:
# Item-CF RMSE: 1.016
# Item similarities (Nebula Quest vs ...):
#   Star Forge     +0.804
#   Void Runners   +0.956
#   Laugh Track    -1.000
#   Dad Joke       -0.868
#   Pun Intended   -1.000
#   Quiet River    -0.802
#   Long Goodbye   -0.866
#   Fist Planet    +0.966
#   Thunder Road   +0.877

Look at those item similarities — they discovered genre structure purely from rating patterns. Nebula Quest (sci-fi) is highly similar to Fist Planet (+0.97, action) and Void Runners (+0.96, sci-fi), but negatively correlated with every comedy and drama. Nobody told the algorithm about genres. It figured it out from who likes what.

Matrix Factorization — The Netflix Prize Breakthrough

In 2006, Netflix offered $1 million to anyone who could beat their recommendation algorithm by 10%. The winning insight — from Simon Funk’s famous blog post — was matrix factorization: decompose the rating matrix R ≈ U × V^T where U is a user-factor matrix and V is an item-factor matrix.

Each user becomes a vector of k latent factors. Each item becomes a vector of k latent factors. A predicted rating is just the dot product plus bias terms: r̂_ij = μ + b_i + b_j + u_i · v_j. The global mean μ captures the overall tendency. User bias b_i captures whether this user rates high or low. Item bias b_j captures whether this item gets high or low ratings. The dot product captures the interesting part: the interaction between user preferences and item attributes.

We train with SGD — for each observed rating, nudge the vectors to reduce the prediction error, plus L2 regularization to prevent overfitting.

# Code Block 4: Matrix Factorization via SGD
import random
random.seed(42)
k = 5          # latent factors
lr = 0.01      # learning rate
reg = 0.1      # regularization strength
epochs = 80

# Initialize randomly
U = [[random.gauss(0, 0.1) for _ in range(k)] for _ in range(n_users)]
V = [[random.gauss(0, 0.1) for _ in range(k)] for _ in range(n_items)]
bu = [0.0] * n_users   # user biases
bi = [0.0] * n_items   # item biases
mu = global_avg         # global mean

# Collect observed (user, item, rating) triples
observed = [(i, j, ratings[i][j])
            for i in range(n_users) for j in range(n_items)
            if ratings[i][j] > 0]

for epoch in range(epochs):
    random.shuffle(observed)
    total_loss = 0.0
    for i, j, r in observed:
        # Predict: mu + bu[i] + bi[j] + dot(U[i], V[j])
        dot = sum(U[i][f] * V[j][f] for f in range(k))
        pred = mu + bu[i] + bi[j] + dot
        err = r - pred
        total_loss += err ** 2

        # Update biases
        bu[i] += lr * (err - reg * bu[i])
        bi[j] += lr * (err - reg * bi[j])

        # Update latent factors
        for f in range(k):
            u_old = U[i][f]
            U[i][f] += lr * (err * V[j][f] - reg * U[i][f])
            V[j][f] += lr * (err * u_old - reg * V[j][f])

    if (epoch + 1) % 20 == 0:
        rmse = (total_loss / len(observed)) ** 0.5
        print(f"Epoch {epoch+1:3d}: RMSE = {rmse:.4f}")

# Final RMSE
final_loss = sum((r - mu - bu[i] - bi[j] -
                  sum(U[i][f]*V[j][f] for f in range(k)))**2
                 for i, j, r in observed)
rmse_mf = (final_loss / len(observed)) ** 0.5
print(f"\nMatrix Factorization RMSE: {rmse_mf:.3f}")
print(f"\nUser biases: {[f'{b:+.2f}' for b in bu]}")
print(f"Item biases: {[f'{b:+.2f}' for b in bi]}")
# Output:
# Epoch  20: RMSE = 1.4388
# Epoch  40: RMSE = 0.9741
# Epoch  60: RMSE = 0.6827
# Epoch  80: RMSE = 0.6149
# Matrix Factorization RMSE: 0.601

An RMSE of 0.601 on the training data is a big leap from item-CF’s 1.016. But the real magic is what the latent factors learn. Without being told about genres, the model discovers latent dimensions that correspond to concepts like “sci-fi affinity” or “comedy preference.” Users and items that share high values on the same factor end up near each other in latent space — which is exactly what the interactive demo below will visualize.

Content-Based Filtering — When You Know What Items Are

Collaborative filtering fails when an item has no ratings at all — the cold-start problem. A brand-new movie added today has zero interactions, so CF can’t compute any similarities. Content-based filtering solves this by using item features instead of rating patterns.

We build a profile for each user from the features of items they’ve rated (weighted by their ratings). Then we score new items by cosine similarity to that profile. If you’ve consistently liked sci-fi movies, a new sci-fi movie gets a high score even before anyone rates it.

# Code Block 5: Content-Based Filtering with TF-IDF
# Item features: [action, comedy, drama, sci-fi, romance, thriller]
features = [
    [0.1, 0.0, 0.0, 0.9, 0.0, 0.2],  # Nebula Quest (sci-fi)
    [0.2, 0.0, 0.0, 0.8, 0.0, 0.3],  # Star Forge (sci-fi)
    [0.3, 0.0, 0.0, 0.7, 0.0, 0.4],  # Void Runners (sci-fi/action)
    [0.0, 0.9, 0.1, 0.0, 0.1, 0.0],  # Laugh Track (comedy)
    [0.0, 0.8, 0.2, 0.0, 0.2, 0.0],  # Dad Joke: The Movie (comedy)
    [0.0, 0.9, 0.0, 0.0, 0.0, 0.1],  # Pun Intended (comedy)
    [0.0, 0.1, 0.9, 0.0, 0.3, 0.0],  # Quiet River (drama)
    [0.0, 0.0, 0.9, 0.0, 0.4, 0.0],  # The Long Goodbye (drama)
    [0.9, 0.0, 0.1, 0.2, 0.0, 0.7],  # Fist Planet (action)
    [0.8, 0.1, 0.0, 0.1, 0.0, 0.8],  # Thunder Road (action)
]
n_features = len(features[0])

# TF-IDF weighting: upweight rare genres, downweight common ones
import math
doc_freq = [sum(1 for item in features if item[f] > 0.3)
            for f in range(n_features)]
n_docs = len(features)
idf = [math.log(n_docs / (1 + df)) for df in doc_freq]

# Apply IDF weights
tfidf = [[features[i][f] * idf[f] for f in range(n_features)]
         for i in range(n_items)]

def build_user_profile(user, ratings, tfidf):
    """Weighted average of liked items' TF-IDF vectors."""
    profile = [0.0] * n_features
    weight_sum = 0.0
    for j in range(n_items):
        if ratings[user][j] >= 3:  # only items the user liked
            w = ratings[user][j]
            for f in range(n_features):
                profile[f] += w * tfidf[j][f]
            weight_sum += w
    if weight_sum > 0:
        profile = [p / weight_sum for p in profile]
    return profile

def content_score(profile, item_vec):
    """Cosine similarity between user profile and item."""
    dot = sum(a * b for a, b in zip(profile, item_vec))
    na = sum(a**2 for a in profile) ** 0.5
    nb = sum(b**2 for b in item_vec) ** 0.5
    return dot / (na * nb) if na * nb > 0 else 0.0

# Score all unrated items for User 0
profile_u0 = build_user_profile(0, ratings, tfidf)
print("User 0's profile (sci-fi/action fan):")
genre_names = ["action", "comedy", "drama", "sci-fi", "romance", "thriller"]
for f in range(n_features):
    if profile_u0[f] > 0.01:
        print(f"  {genre_names[f]:<10s} {profile_u0[f]:.3f}")

print("\nContent-based scores for User 0's unrated movies:")
for j in range(n_items):
    if ratings[0][j] == 0:
        score = content_score(profile_u0, tfidf[j])
        print(f"  {names[j]:<14s} score={score:.3f}")
# Output:
# User 0's profile (sci-fi/action fan):
#   action     0.505
#   sci-fi     0.537
#   thriller   0.410
# Content-based scores for User 0's unrated movies:
#   Dad Joke       score=0.021
#   Pun Intended   score=0.069
#   Long Goodbye   score=0.023

The content-based model correctly identifies that User 0 (sci-fi/action fan) would have very low interest in the comedies and dramas. It can also score a brand-new movie — add a new sci-fi feature vector and it immediately gets a high score, no ratings needed.

The limitation is obvious: content-based filtering has no serendipity. It can only recommend more of the same. A collaborative filter might surprise you with an unexpected gem that people like you enjoyed — content-based never will.

Neural Collaborative Filtering — Deep Learning Meets Recommendations

Matrix factorization captures interactions through a dot product, which is inherently linear. What if the relationship between user preferences and item attributes is more complex? Neural collaborative filtering replaces the dot product with a neural network that can learn arbitrary non-linear interaction functions.

The architecture is simple: embed each user and item into a dense vector (just like MF), concatenate them, and feed the result through hidden layers to predict the rating. A single linear layer with no activation recovers matrix factorization exactly — so neural CF strictly generalizes it.

# Code Block 6: Neural Collaborative Filtering (pure numpy-style)
random.seed(42)
emb_dim = 5     # embedding dimension
h1_size = 16    # first hidden layer
h2_size = 8     # second hidden layer

# Initialize weights (Xavier-like)
def rand_matrix(rows, cols, scale=None):
    if scale is None:
        scale = (2.0 / (rows + cols)) ** 0.5
    return [[random.gauss(0, scale) for _ in range(cols)]
            for _ in range(rows)]

# Embeddings
E_user = rand_matrix(n_users, emb_dim)
E_item = rand_matrix(n_items, emb_dim)
# Hidden layers: input = 2*emb_dim, output = 1
W1 = rand_matrix(2 * emb_dim, h1_size)
b1 = [0.0] * h1_size
W2 = rand_matrix(h1_size, h2_size)
b2 = [0.0] * h2_size
W3 = rand_matrix(h2_size, 1)
b3 = [0.0]

def relu(x): return max(0.0, x)

def forward(user, item):
    """Forward pass: embed, concat, 2 hidden layers, linear output."""
    x = E_user[user] + E_item[item]  # concatenated input
    # Hidden layer 1
    h1 = [relu(sum(x[i] * W1[i][j] for i in range(2*emb_dim)) + b1[j])
          for j in range(h1_size)]
    # Hidden layer 2
    h2 = [relu(sum(h1[i] * W2[i][j] for i in range(h1_size)) + b2[j])
          for j in range(h2_size)]
    # Output
    out = sum(h2[i] * W3[i][0] for i in range(h2_size)) + b3[0]
    return out, x, h1, h2

# Train with SGD and backpropagation
lr_ncf = 0.005
for epoch in range(100):
    random.shuffle(observed)
    total_loss = 0.0
    for u, j, r in observed:
        pred, x, h1, h2 = forward(u, j)
        err = r - pred
        total_loss += err ** 2

        # Backprop through output layer
        d_out = -2 * err
        for i in range(h2_size):
            grad_w3 = d_out * h2[i]
            W3[i][0] -= lr_ncf * grad_w3
        b3[0] -= lr_ncf * d_out

        # Backprop through hidden layer 2
        d_h2 = [d_out * W3[i][0] * (1 if h2[i] > 0 else 0)
                for i in range(h2_size)]
        for i in range(h1_size):
            for o in range(h2_size):
                W2[i][o] -= lr_ncf * d_h2[o] * h1[i]
        for o in range(h2_size):
            b2[o] -= lr_ncf * d_h2[o]

        # Backprop through hidden layer 1
        d_h1 = [sum(d_h2[o] * W2[i][o] for o in range(h2_size))
                * (1 if h1[i] > 0 else 0) for i in range(h1_size)]
        for i in range(2 * emb_dim):
            for o in range(h1_size):
                W1[i][o] -= lr_ncf * d_h1[o] * x[i]
        for o in range(h1_size):
            b1[o] -= lr_ncf * d_h1[o]

        # Update embeddings
        d_x = [sum(d_h1[o] * W1[i][o] for o in range(h1_size))
               for i in range(2 * emb_dim)]
        for i in range(emb_dim):
            E_user[u][i] -= lr_ncf * d_x[i]
            E_item[j][i] -= lr_ncf * d_x[emb_dim + i]

    if (epoch + 1) % 25 == 0:
        rmse = (total_loss / len(observed)) ** 0.5
        print(f"Epoch {epoch+1:3d}: RMSE = {rmse:.4f}")
# Output:
# Epoch  25: RMSE = 0.9441
# Epoch  50: RMSE = 0.4587
# Epoch  75: RMSE = 0.2965
# Epoch 100: RMSE = 0.2418

Neural CF reaches a training RMSE of 0.242, lower than MF’s 0.601 — the neural network’s extra capacity lets it fit the training data more tightly. On larger datasets with complex interaction patterns, the neural approach can capture non-linear relationships that MF misses — like “this user likes sci-fi comedies but not sci-fi dramas,” which is a feature interaction that a simple dot product can’t model.

The tradeoff is classic: more parameters, more data needed, and much less interpretable. For most practical cases, matrix factorization with well-tuned biases is remarkably hard to beat — which is why it remained the backbone of Netflix’s system for years.

Your Netflix queue is basically linear algebra. A dot product between two vectors — one describing who you are and one describing what a movie is — determines what you watch tonight.

Interactive Demos

Now let’s make all of this tangible. The first demo lets you explore how collaborative filtering makes predictions in real time. The second shows how matrix factorization discovers hidden structure in the data.

Try It: Collaborative Filtering Explorer

Click any gray cell to predict its rating. Click a colored cell to cycle its rating (1→2→3→4→5→unrated). Switch methods with the buttons below.

Click any gray cell to see a prediction.

Try It: Latent Factor Space Visualizer

After matrix factorization, each movie becomes a point in latent space. Drag the slider to change the number of factors — watch genre clusters emerge as k increases.

Latent factors (k): 5

Hover over a movie to see its latent vector and nearest neighbors.

References & Further Reading

Koren, Bell, Volinsky — “Matrix Factorization Techniques for Recommender Systems” — the definitive overview of the Netflix Prize approach (IEEE Computer, 2009)
Linden, Smith, York — “Amazon.com Recommendations: Item-to-Item Collaborative Filtering” — the paper that made item-based CF practical at scale (IEEE Internet Computing, 2003)
He et al. — “Neural Collaborative Filtering” — replacing dot products with neural networks (WWW 2017)
Rendle — “Factorization Machines” — generalizing matrix factorization to arbitrary feature interactions (ICDM 2010)
Covington, Adams, Sargin — “Deep Neural Networks for YouTube Recommendations” — how YouTube scaled neural recommendations to billions of users (RecSys 2016)

For the foundations these methods build on, see our earlier posts on embeddings (user and item vectors are embeddings), PCA (SVD is a special case of matrix factorization), k-means clustering (how items cluster in latent space), gradient descent (the SGD we used to train MF), and neural networks (the architecture behind neural CF).