← Back to Blog

Knowledge Graphs from Scratch

The Graph of All Knowledge

Google "who directed Inception" and a Knowledge Panel appears instantly: Christopher Nolan, his photo, filmography, and related people. Google didn't scan web pages for your answer. It looked up a fact in a structured knowledge graph — billions of triples encoding a machine-readable model of real-world knowledge.

A knowledge graph stores facts as (head, relation, tail) triples. (Marie_Curie, won, Nobel_Prize_Physics) is one fact. (Nobel_Prize_Physics, field_of, Physics) is another. Chain them together and you can answer "What field did Curie's prize belong to?" by traversing two edges. Entities become nodes. Relations become labeled, directed edges. Every edge is a verifiable fact.

The major knowledge graphs are staggering in scale:

But here's the problem: even the largest knowledge graphs are vastly incomplete. Freebase was estimated to be 70% incomplete for birthplaces and 94% for nationalities (Dong et al., 2014). Under the open-world assumption, a missing triple doesn't mean the fact is false — it just hasn't been added yet. This motivates the central question: can we learn to predict missing facts from the existing ones?

from collections import defaultdict

# A knowledge graph: list of (head, relation, tail) triples
triples = [
    ("Einstein", "born_in", "Ulm"),
    ("Einstein", "field", "Physics"),
    ("Einstein", "won", "Nobel_Physics"),
    ("Curie", "born_in", "Warsaw"),
    ("Curie", "field", "Physics"),
    ("Curie", "field", "Chemistry"),
    ("Curie", "won", "Nobel_Physics"),
    ("Curie", "won", "Nobel_Chemistry"),
    ("Bohr", "born_in", "Copenhagen"),
    ("Bohr", "field", "Physics"),
    ("Bohr", "won", "Nobel_Physics"),
    ("Ulm", "located_in", "Germany"),
    ("Warsaw", "located_in", "Poland"),
    ("Copenhagen", "located_in", "Denmark"),
]

# Build adjacency: entity -> [(relation, neighbor)]
graph = defaultdict(list)
for h, r, t in triples:
    graph[h].append((r, t))

# One-hop: "What did Einstein win?"
wins = [t for r, t in graph["Einstein"] if r == "won"]
print(f"Einstein won: {wins}")
# -> Einstein won: ['Nobel_Physics']

# Two-hop: "What country was Einstein born in?"
cities = [t for r, t in graph["Einstein"] if r == "born_in"]
countries = [t for c in cities for r, t in graph[c] if r == "located_in"]
print(f"Einstein's birth country: {countries}")
# -> Einstein's birth country: ['Germany']

With 14 triples, we already answer multi-hop questions. Real knowledge graphs have billions — but the data structure is the same. The challenge is filling in the gaps.

Try It: Knowledge Graph Explorer

Click an entity to see its connections, or type a query above.

TransE — Relations as Translations

The most elegant idea in knowledge graph embeddings comes from Bordes et al. (2013): represent entities and relations as vectors, then train so that for every true triple (h, r, t), the equation h + r ≈ t holds.

Think of it like Word2Vec's famous analogy: vec(king) - vec(man) + vec(woman) ≈ vec(queen). TransE applies the same logic to graph relations. The vector for "capital_of" is a consistent offset from every city to its country: vec(Berlin) + vec(capital_of) ≈ vec(Germany) and vec(Paris) + vec(capital_of) ≈ vec(France).

Training uses a margin-based ranking loss. For each true triple, create a corrupted triple by randomly replacing head or tail. Then push the true triple's distance down and the corrupted one's distance up:

L = Σ max(0, γ + d(h + r, t) - d(h' + r, t'))

Here γ is a margin and d is L2 distance. The model learns embeddings where true facts have small translation distance and false facts have large distance.

TransE has one notable limitation: one-to-many relations. If Einstein and Curie both have field = Physics, then TransE forces vec(Einstein) ≈ vec(Curie). Extensions like TransR and RotatE address this, but TransE remains the foundational building block.

import numpy as np

def train_transe(triples, entities, relations, dim=20, lr=0.01,
                 margin=1.0, epochs=500):
    """Train TransE: learn embeddings where h + r ≈ t."""
    rng = np.random.RandomState(42)
    ent_emb = rng.randn(len(entities), dim)
    ent_emb /= np.linalg.norm(ent_emb, axis=1, keepdims=True)
    rel_emb = rng.randn(len(relations), dim)
    rel_emb /= np.linalg.norm(rel_emb, axis=1, keepdims=True)

    ent2id = {e: i for i, e in enumerate(entities)}
    rel2id = {r: i for i, r in enumerate(relations)}
    ent_list = list(entities)

    for epoch in range(epochs):
        for h, r, t in triples:
            hi, ri, ti = ent2id[h], rel2id[r], ent2id[t]
            # Corrupt: replace head or tail randomly
            ci = ent2id[rng.choice(ent_list)]
            d_pos = np.linalg.norm(ent_emb[hi] + rel_emb[ri] - ent_emb[ti])
            if rng.rand() < 0.5:
                d_neg = np.linalg.norm(ent_emb[ci] + rel_emb[ri] - ent_emb[ti])
            else:
                d_neg = np.linalg.norm(ent_emb[hi] + rel_emb[ri] - ent_emb[ci])

            loss = max(0, margin + d_pos - d_neg)
            if loss > 0:
                grad = (ent_emb[hi] + rel_emb[ri] - ent_emb[ti])
                grad /= (np.linalg.norm(grad) + 1e-8)
                ent_emb[hi] -= lr * grad
                rel_emb[ri] -= lr * grad
                ent_emb[ti] += lr * grad

        ent_emb /= np.linalg.norm(ent_emb, axis=1, keepdims=True)

    return ent_emb, rel_emb, ent2id, rel2id

entities = {"Einstein", "Curie", "Bohr", "Physics", "Chemistry",
            "Nobel_Physics", "Nobel_Chemistry", "Ulm", "Warsaw", "Copenhagen"}
rels = {"field", "won", "born_in"}
ent_emb, rel_emb, ent2id, rel2id = train_transe(
    triples[:11], entities, rels)

# Test: h + r should land closest to correct tail
h_vec = ent_emb[ent2id["Einstein"]] + rel_emb[rel2id["won"]]
for e in ["Nobel_Physics", "Nobel_Chemistry", "Warsaw"]:
    d = np.linalg.norm(h_vec - ent_emb[ent2id[e]])
    print(f"  d(Einstein+won, {e}) = {d:.3f}")
# Nobel_Physics should have smallest distance

The training loop is remarkably simple: for each triple, create a negative, compute the margin loss, nudge embeddings via gradient descent. After training, vec(Einstein) + vec(won) lands closest to vec(Nobel_Physics).

Bilinear Models — DistMult and ComplEx

TransE treats relations as translations. An alternative family — bilinear models — treats them as multiplicative interactions.

DistMult (Yang et al., 2015) scores a triple using an element-wise product:

score(h, r, t) = Σi hi · ri · ti

Each dimension captures one latent factor. The relation vector r amplifies dimensions that matter and suppresses irrelevant ones — equivalent to hT diag(r) t.

But DistMult has a fatal flaw: its scoring is symmetric. Since multiplication commutes, score(h, r, t) = score(t, r, h). DistMult thinks (Paris, capital_of, France) equals (France, capital_of, Paris).

ComplEx (Trouillon et al., 2016) fixes this with complex-valued embeddings. Each entity and relation lives in ℂd:

score(h, r, t) = Re(Σi hi · ri · t̄i)

The complex conjugate breaks symmetry: score(h, r, t) ≠ score(t, r, h). Real parts capture symmetric patterns, imaginary parts capture asymmetric ones. Set all imaginary components to zero and ComplEx reduces to DistMult.

def score_distmult(h, r, t):
    """DistMult: element-wise product, then sum. Symmetric."""
    return np.sum(h * r * t)

def score_complex(h_re, h_im, r_re, r_im, t_re, t_im):
    """ComplEx: uses complex conjugate of tail. Asymmetric."""
    # Re(h * r * conj(t))
    return (np.sum(h_re * r_re * t_re) + np.sum(h_re * r_im * t_im)
          + np.sum(h_im * r_re * t_im) - np.sum(h_im * r_im * t_re))

rng = np.random.RandomState(7)
dim = 10

# DistMult: same score both directions
paris, france, capital = rng.randn(dim), rng.randn(dim), rng.randn(dim)
fwd = score_distmult(paris, capital, france)
bwd = score_distmult(france, capital, paris)
print(f"DistMult (Paris, capital_of, France): {fwd:.3f}")
print(f"DistMult (France, capital_of, Paris): {bwd:.3f}")
print(f"Same? {abs(fwd - bwd) < 1e-10}")  # True!

# ComplEx: different scores (correctly asymmetric)
p_re, p_im = rng.randn(dim), rng.randn(dim)
f_re, f_im = rng.randn(dim), rng.randn(dim)
c_re, c_im = rng.randn(dim), rng.randn(dim)
fwd = score_complex(p_re, p_im, c_re, c_im, f_re, f_im)
bwd = score_complex(f_re, f_im, c_re, c_im, p_re, p_im)
print(f"\nComplEx (Paris, capital_of, France): {fwd:.3f}")
print(f"ComplEx (France, capital_of, Paris): {bwd:.3f}")
print(f"Same? {abs(fwd - bwd) < 1e-10}")  # False!

Even with random embeddings, ComplEx produces different scores for the two directions. After training, the correct direction scores high and the reverse scores low. DistMult can never learn this — it's structurally blind to direction.

Try It: Embedding Geometry Playground

Loss: — | Hits@1: —

Link Prediction and Evaluation

The core task for KG embeddings is link prediction: given an incomplete graph, predict which missing triples are likely true. For (Einstein, won, ?), rank all entities by their embedding score and check where the correct answer falls.

The evaluation protocol: for each test triple (h, r, t), replace the tail with every entity, score them all, and record the rank of the true tail. Three standard metrics:

A critical detail: the filtered setting removes other known true triples from the ranking. If both (Einstein, won, Nobel_Physics) and (Einstein, won, Nobel_Chemistry) are true, don't penalize the model for ranking Nobel_Chemistry highly when testing Nobel_Physics.

On standard benchmarks, the best models achieve MRR ~0.35 on FB15k-237 (Freebase, 237 relations, ~310K triples) and ~0.49 on WN18RR (WordNet, 11 relations, ~93K triples).

def evaluate_link_prediction(ent_emb, rel_emb, ent2id, rel2id,
                              test_triples, all_true):
    """Compute MR, MRR, Hits@1/3/10 with filtered ranking."""
    ranks = []
    n_ent = len(ent2id)
    id2ent = {v: k for k, v in ent2id.items()}

    for h, r, t in test_triples:
        hi, ri, ti = ent2id[h], rel2id[r], ent2id[t]
        # Score all entities as possible tails (TransE)
        scores = np.zeros(n_ent)
        for ei in range(n_ent):
            scores[ei] = -np.linalg.norm(
                ent_emb[hi] + rel_emb[ri] - ent_emb[ei])

        # Filter: mask other true tails
        for ei in range(n_ent):
            if ei != ti and (h, r, id2ent[ei]) in all_true:
                scores[ei] = -1e9

        rank = 1 + np.sum(scores > scores[ti])
        ranks.append(rank)

    ranks = np.array(ranks)
    print(f"MR: {np.mean(ranks):.1f} | MRR: {np.mean(1.0/ranks):.3f} | "
          f"H@1: {np.mean(ranks <= 1):.2f} | "
          f"H@3: {np.mean(ranks <= 3):.2f} | "
          f"H@10: {np.mean(ranks <= 10):.2f}")

MRR is especially useful: it's bounded between 0 and 1 and rewards top-ranked predictions heavily — rank 1 contributes 1.0, rank 2 contributes 0.5, rank 100 contributes only 0.01.

Multi-Hop Reasoning

Real questions rarely involve a single edge. "Where was the director of Inception educated?" requires two hops: (Inception, directed_by, Nolan)(Nolan, educated_at, UCL). Can we compose relation embeddings to answer without explicit traversal?

With TransE, the idea is natural: if h + r1 ≈ mid and mid + r2 ≈ t, then h + r1 + r2 ≈ t. The sum of relation vectors creates a "virtual" two-hop relation. More sophisticated approaches like Query2Box (Ren et al., 2020) represent entire queries as hyperrectangles in embedding space — entities inside the box are answers, and intersecting boxes computes logical AND.

def multi_hop_query(ent_emb, rel_emb, ent2id, rel2id,
                    start, relation_path):
    """Answer multi-hop queries by composing TransE translations."""
    id2ent = {v: k for k, v in ent2id.items()}

    # Compose: start entity + sum of relation vectors
    composed = ent_emb[ent2id[start]].copy()
    for rel in relation_path:
        composed += rel_emb[rel2id[rel]]

    # Rank all entities by distance
    distances = np.linalg.norm(ent_emb - composed, axis=1)
    ranked = np.argsort(distances)

    path_str = " -> ".join(relation_path)
    print(f"Query: {start} -> {path_str} -> ?")
    for i in range(3):
        eid = ranked[i]
        print(f"  #{i+1}: {id2ent[eid]} (d={distances[eid]:.3f})")

# "What country was Einstein born in?" = born_in + located_in
multi_hop_query(ent_emb, rel_emb, ent2id, rel2id,
                "Einstein", ["born_in", "located_in"])
# Should rank 'Germany' highest

We never explicitly traversed the graph. Two relation vectors composed into a virtual two-hop relation, and the nearest entity in embedding space is the answer. This generalizes to paths that don't even exist in the training data — the model infers from geometric structure.

Knowledge Graphs Meet LLMs

LLMs hallucinate facts. Knowledge graphs have verified facts but rigid structure and limited coverage. Combining them creates systems that are both fluent and factual.

GraphRAG (Microsoft, 2024) takes this furthest: instead of retrieving text chunks, it builds a knowledge graph from the document corpus, uses community detection for hierarchical summaries, and retrieves structured subgraphs as multi-hop context. For synthesis queries spanning many documents, GraphRAG dramatically outperforms vector-based RAG.

The pipeline also runs in reverse: LLMs can construct knowledge graphs from text, extracting triples far faster than manual curation:

import re

def extract_triples(sentences):
    """Extract (subject, predicate, object) triples from text."""
    patterns = [
        (r"(\w+) was born in (\w+)",
         lambda m: (m.group(1), "born_in", m.group(2))),
        (r"(\w+) discovered (\w+)",
         lambda m: (m.group(1), "discovered", m.group(2))),
        (r"(\w+) works at (\w+)",
         lambda m: (m.group(1), "works_at", m.group(2))),
        (r"(\w+) is a (\w+)",
         lambda m: (m.group(1), "is_a", m.group(2))),
    ]
    extracted = []
    for sent in sentences:
        for pattern, extractor in patterns:
            match = re.search(pattern, sent)
            if match:
                extracted.append(extractor(match))
    return extracted

sentences = [
    "Einstein was born in Ulm",
    "Einstein discovered Relativity",
    "Relativity is a Theory",
    "Curie discovered Radium",
    "Curie works at Sorbonne",
]

kg = extract_triples(sentences)
for h, r, t in kg:
    print(f"  ({h}, {r}, {t})")

# Two-hop: "What type of thing did Einstein discover?"
graph = defaultdict(list)
for h, r, t in kg:
    graph[h].append((r, t))

mid = [t for r, t in graph["Einstein"] if r == "discovered"]
answer = [t for m in mid for r, t in graph[m] if r == "is_a"]
print(f"\nEinstein discovered a {answer[0]}.")
# -> Einstein discovered a Theory.

In production, you'd replace regex with dependency parsing or LLM-based extraction. The resulting knowledge graph serves as a factual backbone for RAG — grounding responses in verified triples rather than hallucinations.

Conclusion

Knowledge graphs turn the messy world of facts into clean, queryable structure. We've built the full stack: from the basic (h, r, t) triple through TransE's translation geometry, DistMult's symmetric bilinear scoring, ComplEx's symmetry-breaking complex conjugate, to multi-hop reasoning by composing relation vectors.

The deeper insight is geometric: each embedding method imposes a different shape on fact-space. TransE carves parallel translation channels. DistMult creates symmetric correlation patterns. ComplEx adds rotational asymmetry through complex numbers. Choosing the right geometry for your domain is the art of knowledge graph engineering.

As LLMs continue to struggle with factual consistency, knowledge graphs are experiencing a renaissance — not as replacements for neural language models, but as the structured factual backbone that keeps them honest.

References & Further Reading