Embeddings from Scratch: How Words Become Vectors
The Punchline
Picture a two-dimensional map. In one corner, “king” and “queen” sit side by side. Nearby, “man” and “woman” cluster together. Across the map, “cat” and “dog” are neighbors, with “bird” and “fish” not far away. In another region, “bread” huddles near “cheese”.
This map was produced by a matrix that started as pure random noise. After being nudged a few million times by a deceptively simple prediction task, the numbers organized themselves into a map of meaning. Words that share similar contexts ended up in similar places — not because anyone told the algorithm what “king” means, but because kings and queens tend to appear in the same sentences.
How? That's what we're building today — Word2Vec's skip-gram with negative sampling, from scratch, in about 80 lines of NumPy. By the end, you'll have a working embedding trainer and an interactive visualization where you can watch random vectors learn to understand language.
This is the third post in the Elementary AI trilogy. In Micrograd, we taught numbers to compute their own derivatives — the engine. In Attention, we taught them to find relevance — the steering. Now we teach them to represent meaning — the fuel. Together, these three ideas are the foundation of every large language model.
Words as Addresses vs. Words as Meaning
The most obvious way to represent words as numbers is one-hot encoding: give each word a unique index and set that position to 1, everything else to 0. In a vocabulary of five words:
import numpy as np
vocab = ["cat", "dog", "fish", "bird", "king"]
for word in vocab:
vec = np.zeros(len(vocab))
vec[vocab.index(word)] = 1.0
print(f"{word:>5s} → {vec}")
# Output:
# cat → [1. 0. 0. 0. 0.]
# dog → [0. 1. 0. 0. 0.]
# fish → [0. 0. 1. 0. 0.]
# bird → [0. 0. 0. 1. 0.]
# king → [0. 0. 0. 0. 1.]
This is like assigning every kid in school a locker number. Locker #47 has no mathematical relationship to locker #48, even if those two kids are twins. The numbers are addresses, not descriptions.
One-hot encoding has two fatal problems:
- No similarity signal. The dot product of any two different one-hot vectors is always zero. The math says “cat” and “dog” are exactly as unrelated as “cat” and “quantum physics.”
- Dimensionality explosion. A real vocabulary has 50,000+ words. That means 50,000-dimensional vectors that are 99.998% zeros. Wasteful and useless for learning.
What we want instead: dense vectors — compact arrays of 50–300 floating-point numbers — where similar words land in similar positions. We want the geometry of the vector space to reflect the meaning of the words.
But how do you teach a matrix of random numbers to capture meaning? You need a source of truth. Enter the distributional hypothesis.
Context Is Meaning
“You shall know a word by the company it keeps.” — J.R. Firth, 1957
This single sentence is the philosophical foundation of everything that follows. Consider: you have never seen the word “gloopy” before. But read this sentence: “The gloopy porridge stuck to the spoon.” Suddenly you know quite a lot — it's something thick, sticky, probably not a compliment. No dictionary required. Context alone delivered meaning.
The distributional hypothesis takes this further: words that appear in similar contexts tend to have similar meanings. “Cat” and “dog” both appear near “pet,” “fed,” “furry,” and “played with.” “King” and “queen” both appear near “ruled,” “throne,” and “kingdom.” If we could somehow convert these co-occurrence patterns into numbers, we'd have meaningful vectors.
That's exactly what Word2Vec does. No grammar rules. No human-curated dictionaries. Just a massive amount of text and a clever training trick.
The Clever Trick: Predict Your Neighbors
Word2Vec's skip-gram model has a beautifully simple objective: given a center word, predict the words that appear nearby.
Take the sentence “the cat sat on the mat” with a context window of 2. For each word, we look at the words within two positions on either side:
def generate_training_pairs(sentences, window=2):
"""For each center word, pair it with every context word in the window."""
pairs = []
for sentence in sentences:
words = sentence.lower().split()
for i, center in enumerate(words):
start = max(0, i - window)
end = min(len(words), i + window + 1)
for j in range(start, end):
if i != j:
pairs.append((center, words[j]))
return pairs
pairs = generate_training_pairs(["the cat sat on the mat"])
for center, context in pairs[:8]:
print(f" center={center:>5s} → context={context}")
# center= the → context=cat
# center= the → context=sat
# center= cat → context=the
# center= cat → context=sat
# center= cat → context=on
# center= sat → context=the
# center= sat → context=cat
# center= sat → context=on
Here's the key intuition: if “cat” and “dog” both frequently appear near words like “the,” “sat,” “pet,” and “played,” then to predict those same context words, they'll need similar internal representations. The prediction task is the pretext — the embeddings are the real prize.
A Lookup Table That Learns
The skip-gram “neural network” is surprisingly simple. It has one hidden layer and no activation function — just two weight matrices:
W_centerwith shape(V, N)— one row per word in the vocabulary, each row N numbers longW_contextwith shape(V, N)— same size, but used for context words
Where V is vocabulary size and N is the embedding dimension we choose (typically 50–300).
vocab_size = 50 # number of unique words
embedding_dim = 20 # dimensions per embedding
# The only learnable parameters — initialized with small random values
W_center = np.random.randn(vocab_size, embedding_dim) * 0.1
W_context = np.random.randn(vocab_size, embedding_dim) * 0.1
# "Looking up" a word's embedding — just index the row
cat_idx = 7 # suppose "cat" is word #7
cat_embedding = W_center[cat_idx] # shape: (20,)
Here's the moment that makes it all click: remember one-hot encoding, where multiplying a one-hot vector by a matrix just selects one row? That's all that happens here. The “forward pass” of this network is a row lookup. No actual matrix multiplication needed.
The AHA moment: W_center has one row per word. Each row is that word's embedding. The entire prediction task exists only to give these rows a reason to organize themselves meaningfully. The prediction is the excuse — the embedding matrix is the product.
To predict whether a center word and context word belong together, we compute their dot product and push it through a sigmoid. High dot product = high probability they co-occur. Low dot product = probably not.
The Speed Hack: Negative Sampling
In theory, for each center word we should compute a probability distribution over the entire vocabulary using softmax. For a 50,000-word vocabulary, that means 50,000 exponentials and a normalizing sum — per training pair. With millions of pairs, this is computationally brutal.
Mikolov et al. introduced an elegant fix: negative sampling. Instead of “predict the right word out of 50,000,” reframe it as binary classification:
- “Is (cat, sat) a real co-occurrence pair?” → Yes (label 1)
- “Is (cat, refrigerator) a real pair?” → No (label 0)
For each positive pair, we sample K=5 random “negative” words from a noise distribution. Now we only update K+2 embedding rows — the center word, the real context word, and K negatives — instead of all V. The noise distribution uses word frequencies raised to the 3/4 power — a trick that boosts rare words (giving them more training signal) and dampens the most common ones:
def build_noise_distribution(word_counts):
"""Frequency^(3/4) — balances rare and common words."""
counts = np.array(list(word_counts.values()), dtype=np.float64)
powered = counts ** 0.75
return powered / powered.sum()
def get_negative_samples(noise_dist, k=5, exclude=None):
"""Sample k words that probably aren't real context words."""
negatives = []
while len(negatives) < k:
idx = np.random.choice(len(noise_dist), p=noise_dist)
if idx != exclude:
negatives.append(idx)
return negatives
Why the 3/4 power? Consider a corpus where “the” appears 10,000 times and “giraffe” appears 10 times. Raw frequency would almost never sample “giraffe” as a negative. But 100000.75 ≈ 1000 while 100.75 ≈ 5.6 — the ratio shrinks from 1000:1 to about 178:1. Rare words get a meaningful chance of being sampled, which helps the model learn sharper distinctions.
Surprisingly Simple Gradients
The forward pass for each training pair is: look up two embeddings, compute their dot product, push through sigmoid. The loss is binary cross-entropy with negative sampling:
J = −log(σ(vcontext · vcenter)) − Σk log(σ(−vnegk · vcenter))
The gradient has a beautiful form. For every pair — positive or negative — it's the same pattern:
gradient = (σ(score) − target) × other_vector
For a positive pair the target is 1, so the error is (σ − 1) — always negative, pushing the dot product up. For a negative pair the target is 0, so the error is just σ — always positive, pushing the dot product down. This is identical to logistic regression, and if you worked through the micrograd post, you'll recognize the chain rule at work:
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -15, 15)))
def train_step(center_idx, context_idx, W_center, W_context,
noise_dist, lr=0.05, neg_k=5):
v_center = W_center[center_idx]
v_context = W_context[context_idx]
# --- Positive pair: push dot product UP ---
score = np.dot(v_center, v_context)
sig = sigmoid(score)
loss = -np.log(sig + 1e-10)
# Gradient: (σ - 1) × other_vector
grad_center = (sig - 1) * v_context
W_context[context_idx] -= lr * (sig - 1) * v_center
# --- Negative pairs: push dot products DOWN ---
for neg_idx in get_negative_samples(noise_dist, neg_k, exclude=context_idx):
v_neg = W_context[neg_idx]
score = np.dot(v_center, v_neg)
sig = sigmoid(score)
loss += -np.log(1 - sig + 1e-10)
# Gradient: σ × other_vector
grad_center += sig * v_neg
W_context[neg_idx] -= lr * sig * v_center
W_center[center_idx] -= lr * grad_center
return loss
Each training step updates exactly K+2 rows: the center word, the context word, and K negative samples. The entire vocabulary of potentially thousands of words is untouched. This is what makes skip-gram with negative sampling fast enough to train on billions of words.
The Full Implementation
Let's assemble all the pieces into a complete Word2Vec class. This is the entire algorithm — vocabulary building, training pair generation, negative sampling, and gradient descent — in one self-contained block:
class Word2Vec:
def __init__(self, sentences, dim=20, window=2, neg_k=5, lr=0.05):
# Build vocabulary and count word frequencies
self.word2idx, self.idx2word, counts = {}, [], {}
for s in sentences:
for w in s.lower().split():
counts[w] = counts.get(w, 0) + 1
if w not in self.word2idx:
self.word2idx[w] = len(self.idx2word)
self.idx2word.append(w)
V = len(self.idx2word)
self.W_center = np.random.randn(V, dim) * 0.1
self.W_context = np.random.randn(V, dim) * 0.1
# Noise distribution: frequency^(3/4)
freq = np.array([counts[w] for w in self.idx2word], dtype=np.float64)
self.noise = freq ** 0.75
self.noise /= self.noise.sum()
# Pre-generate all (center, context) training pairs
self.pairs = []
for s in sentences:
ids = [self.word2idx[w] for w in s.lower().split()]
for i, c in enumerate(ids):
for j in range(max(0, i-window), min(len(ids), i+window+1)):
if i != j:
self.pairs.append((c, ids[j]))
self.lr, self.neg_k = lr, neg_k
def _sigmoid(self, x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -15, 15)))
def train(self, epochs=1000):
losses = []
for epoch in range(epochs):
np.random.shuffle(self.pairs)
total = 0.0
for ci, co in self.pairs:
vc = self.W_center[ci].copy()
# Positive
s = self._sigmoid(np.dot(vc, self.W_context[co]))
total += -np.log(s + 1e-10)
grad = (s - 1) * self.W_context[co]
self.W_context[co] -= self.lr * (s - 1) * vc
# Negatives
negs = np.random.choice(len(self.idx2word), self.neg_k, p=self.noise)
for ni in negs:
s = self._sigmoid(np.dot(vc, self.W_context[ni]))
total += -np.log(1 - s + 1e-10)
grad += s * self.W_context[ni]
self.W_context[ni] -= self.lr * s * vc
self.W_center[ci] -= self.lr * grad
losses.append(total / len(self.pairs))
if epoch % 200 == 0:
print(f"Epoch {epoch:4d} | Loss: {losses[-1]:.4f}")
return losses
def most_similar(self, word, k=5):
v = self.W_center[self.word2idx[word]]
norms = np.linalg.norm(self.W_center, axis=1)
sims = (self.W_center @ v) / (norms * np.linalg.norm(v) + 1e-10)
top = np.argsort(-sims)[1:k+1]
return [(self.idx2word[i], f"{sims[i]:.3f}") for i in top]
Let's train it on a curated corpus with clear semantic clusters:
corpus = [
"the king ruled the kingdom",
"the queen ruled the kingdom",
"king and queen are royalty",
"the man and woman walked together",
"a boy and girl played together",
"the cat sat on the mat",
"the dog ran in the park",
"cat and dog are pets",
"a bird flew over the tree",
"fish swam in the river",
"bird and fish are animals",
"the river flows past the tree",
"sun shone over the park",
"bread and cheese for lunch",
"rice and soup for dinner",
"the king ate bread and cheese",
"the queen ate rice and soup",
"the boy played with the dog",
"the girl played with the cat",
"the man walked in the park",
"the woman walked by the river",
]
model = Word2Vec(corpus, dim=20, window=2, neg_k=5, lr=0.05)
model.train(epochs=1000)
Epoch 0 | Loss: 2.5139
Epoch 200 | Loss: 1.6253
Epoch 400 | Loss: 1.3018
Epoch 600 | Loss: 1.1672
Epoch 800 | Loss: 1.0815
The loss drops steadily as the random embeddings organize themselves. Now the moment of truth — do similar words have similar vectors?
print(model.most_similar("king"))
# [('queen', '0.872'), ('ruled', '0.741'), ('kingdom', '0.693'), ...]
print(model.most_similar("cat"))
# [('dog', '0.845'), ('bird', '0.712'), ('fish', '0.668'), ...]
print(model.most_similar("bread"))
# [('cheese', '0.823'), ('rice', '0.756'), ('soup', '0.701'), ...]
From random noise to semantic understanding. “King” learned that it's most like “queen” — not because anyone labeled them as related, but because they appear in nearly identical contexts. The prediction task was the pretext; the meaningful geometry was the emergent reward.
Try It: Watch Words Learn
Interactive: 2D Embedding Space
This runs Word2Vec training live in your browser. Each dot is a word, projected from 20 dimensions down to 2 using PCA. Click Train to watch random noise organize into clusters of meaning. Words are colored by semantic category.
Try training multiple times — each run starts from different random noise and finds a different arrangement, but the clusters remain consistent. Animals land near animals, royalty near royalty. The specific coordinates change; the semantic structure doesn't. That's the signal emerging from the noise.
The Famous Analogy
The most celebrated property of word embeddings is vector arithmetic on meaning. The idea: if “king” and “queen” differ mainly by gender, and “man” and “woman” differ in the same way, then:
king − man + woman ≈ queen
This works because training encodes relationships as consistent directions in the embedding space. The “gender” direction is roughly the same vector whether applied to royalty or commoners. Let's implement the test:
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10)
def analogy(model, a, b, c, k=3):
"""a is to b as c is to ???"""
va = model.W_center[model.word2idx[a]]
vb = model.W_center[model.word2idx[b]]
vc = model.W_center[model.word2idx[c]]
target = vb - va + vc # the vector arithmetic
exclude = {a, b, c}
results = []
for word, idx in model.word2idx.items():
if word not in exclude:
sim = cosine_sim(target, model.W_center[idx])
results.append((word, sim))
results.sort(key=lambda x: -x[1])
return results[:k]
# king is to queen as man is to ???
print(analogy(model, "king", "queen", "man"))
# [('woman', 0.68), ('girl', 0.52), ...]
An important caveat: on our tiny 21-sentence corpus, analogies are approximate at best. The famous king−man+woman=queen demo was trained on billions of words from Google News. With our toy data, we see the right general direction — “woman” often appears near the top — but it's noisy. The method is sound; the data is just too small for crisp results. Feed in a million sentences and the analogies sharpen dramatically.
Connecting the Dots: The Trilogy
The three Elementary AI posts form a stack that mirrors how real language models work:
- Micrograd — Numbers learn to compute their own derivatives. This is the engine of training: automatic differentiation, backpropagation, gradient descent.
- Embeddings (this post) — Numbers learn to represent meaning. This is the fuel: dense vector representations where geometry encodes semantics.
- Attention — Numbers learn which other numbers matter. This is the steering: the mechanism that lets every token attend to every other token based on relevance.
In a real Transformer like GPT or Claude, all three work together: embeddings are the input to attention layers, and gradients (backpropagation) are how you train the whole thing. The dot product — the operation at the heart of Word2Vec's similarity computation — is the same dot product that drives attention's query-key scoring. It appears everywhere because it's the fundamental operation of meaning: how much do these two vectors point in the same direction?
We started with random noise and ended with a map of meaning. No rules. No grammar. No dictionaries. Just context and gradients.
References & Further Reading
- Mikolov et al. — “Efficient Estimation of Word Representations in Vector Space” (2013) — The original Word2Vec paper introducing CBOW and skip-gram.
- Mikolov et al. — “Distributed Representations of Words and Phrases and their Compositionality” (2013) — Introduces negative sampling and the frequency^(3/4) trick.
- Goldberg & Levy — “word2vec Explained” (2014) — The clearest mathematical derivation of the skip-gram objective.
- Rong — “word2vec Parameter Learning Explained” (2014) — Detailed gradient derivations for every Word2Vec variant.
- Jay Alammar — “The Illustrated Word2Vec” — The gold standard visual explanation of Word2Vec.
- J.R. Firth (1957) — “A synopsis of linguistic theory 1930–1955” — The origin of “You shall know a word by the company it keeps.”