← Back to Blog

Tokenization from Scratch: How LLMs Read Text

The Invisible First Step

Every conversation with an LLM starts with a silent transformation. Before the model can think about your question, before attention heads weigh relevance, before embeddings encode meaning — your text has to become a sequence of integers. Not one integer per word. Not one per character. Something cleverer.

Type “I’m tokenizing this” into GPT-4, and behind the scenes it sees something like [40, 2846, 5765, 2995, 420]. The word “tokenizing” has been split into ["token", "izing"] — two subword pieces that each have their own integer ID. The model has never seen the word “tokenizing” as a unit, but it knows “token” and “izing” separately, so it can handle the combination just fine.

This splitting process is called tokenization, and the dominant algorithm behind it is Byte Pair Encoding (BPE). It’s the very first step in the pipeline we’ve been building in this series. In Embeddings, we turned words into meaningful vectors. In Attention, we taught those vectors to find relevance. In Micrograd, we built the training engine. But all of those assumed the text was already chopped into pieces. Today we build the chopper.

Two Extremes That Don’t Work

Before we get to BPE, let’s understand why it exists by trying the two obvious approaches and watching them fail.

Attempt 1: One character at a time

The simplest possible tokenizer splits text into individual characters:

def char_tokenize(text):
    vocab = sorted(set(text))
    char_to_id = {ch: i for i, ch in enumerate(vocab)}
    return [char_to_id[ch] for ch in text]

tokens = char_tokenize("the cat sat on the mat")
print(tokens)   # [9, 4, 3, 0, 2, 1, 9, 0, 8, 1, 9, 0, 7, 6, 0, 9, 4, 3, 0, 5, 1, 9]
print(f"Vocab size: {len(set('the cat sat on the mat'))}")  # 10

This has one beautiful property: a vocabulary of about 256 characters covers every possible English text. You’ll never encounter an unknown token. But there’s a crushing cost — the sequence is enormous. That 6-word sentence produces 22 tokens. A typical paragraph would generate hundreds. The model has to figure out that the characters t, h, e appearing together mean “the” — spending precious attention capacity on reassembling words instead of understanding meaning.

Attempt 2: One word at a time

The opposite extreme splits on whitespace:

def word_tokenize(text):
    words = text.split()
    vocab = sorted(set(words))
    word_to_id = {w: i for i, w in enumerate(vocab)}
    return [word_to_id[w] for w in words]

tokens = word_tokenize("the cat sat on the mat")
print(tokens)   # [4, 0, 3, 2, 4, 1]
print(f"Vocab size: {len(set('the cat sat on the mat'.split()))}")  # 5

Now our 6-word sentence is a tight 6 tokens. Each token carries meaning. But what happens when the model encounters “tokenizing” and it wasn’t in the training data? It maps to a special [UNK] token and all information is lost. To cover a real language, you’d need a vocabulary of over 100,000 words — and you’d still hit unknowns on novel words, typos, names, and technical jargon.

The dilemma: characters give you infinite coverage but painfully long sequences. Words give you compact sequences but can’t handle the unexpected. We need something in between.

The Key Insight: Let the Data Decide

In 1994, Philip Gage published a short paper describing a compression algorithm called Byte Pair Encoding. The idea was disarmingly simple: scan your data for the most frequently occurring pair of adjacent bytes, replace every occurrence with a new symbol, and repeat. Each replacement shortens the data by removing one symbol per occurrence of that pair.

Twenty-two years later, Rico Sennrich and colleagues realized this same idea could solve the tokenization problem for neural machine translation. Instead of compressing bytes, they compressed characters within words. Common sequences like “th”, “ing”, and “tion” would naturally merge into single tokens. Whole common words like “the” would eventually become single tokens too. But rare words would stay split into recognizable pieces.

The result is a vocabulary where:

This is the sweet spot. Let’s build it.

BPE: The Algorithm, Step by Step

BPE has two phases: training (learning which pairs to merge) and encoding (applying those merges to new text). Let’s walk through training with a concrete example.

Imagine our training corpus contains these words with their frequencies:

corpus = {"hug": 10, "pug": 5, "pun": 12, "bun": 4, "hugs": 5}

We start by splitting every word into individual characters and collecting all unique characters as our initial vocabulary:

# Split words into character sequences (with frequencies)
# "hug" (10 times) -> ["h", "u", "g"]
# "pug" (5 times)  -> ["p", "u", "g"]
# "pun" (12 times) -> ["p", "u", "n"]
# "bun" (4 times)  -> ["b", "u", "n"]
# "hugs" (5 times) -> ["h", "u", "g", "s"]

initial_vocab = ["b", "g", "h", "n", "p", "s", "u"]  # 7 characters

Now we count every adjacent pair, weighted by word frequency:

Merge 1: The most frequent pair is ("u", "g") with count 20. We merge it into a new token "ug" and replace everywhere:

# After merge 1: ("u", "g") -> "ug"
# "hug"  -> ["h", "ug"]       (was ["h", "u", "g"])
# "pug"  -> ["p", "ug"]       (was ["p", "u", "g"])
# "pun"  -> ["p", "u", "n"]   (unchanged - no "u","g" pair)
# "bun"  -> ["b", "u", "n"]   (unchanged)
# "hugs" -> ["h", "ug", "s"]  (was ["h", "u", "g", "s"])

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug"]  # 8 tokens

Merge 2: Recount pairs. ("u", "n") appears 16 times (still the runner-up). Merge it:

# After merge 2: ("u", "n") -> "un"
# "pun" -> ["p", "un"]
# "bun" -> ["b", "un"]

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]  # 9 tokens

Merge 3: Now ("h", "ug") appears 15 times (in “hug” and “hugs”). Merge it into "hug":

# After merge 3: ("h", "ug") -> "hug"
# "hug"  -> ["hug"]           (fully compressed!)
# "hugs" -> ["hug", "s"]

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]  # 10 tokens

Three merges and the word “hug” is already a single token. The algorithm naturally promoted the most common words to single-token status first.

Encoding new text is straightforward: split into characters, then apply the learned merges in order. To encode “bug”:

  1. Start: ["b", "u", "g"]
  2. Apply merge 1 (u + g → ug): ["b", "ug"]
  3. No more applicable merges.
  4. Result: ["b", "ug"] — the model has never seen “bug” as a whole word, but it recognizes both pieces.

Building BPE from Scratch in Python

Here’s a complete BPE tokenizer in about 70 lines. We pre-tokenize on whitespace, train by repeatedly merging the most common pair, and encode by replaying the merges in order:

from collections import Counter

class BPETokenizer:
    def __init__(self):
        self.merges = []          # ordered list of (pair, merged_token)
        self.vocab = {}           # token -> id

    def train(self, text, num_merges):
        """Learn `num_merges` merge rules from a training corpus."""
        # Pre-tokenize: split on whitespace, count word frequencies
        word_freqs = Counter(text.split())

        # Split each word into a tuple of characters
        splits = {word: list(word) for word in word_freqs}

        for i in range(num_merges):
            # Count every adjacent pair, weighted by word frequency
            pair_counts = Counter()
            for word, freq in word_freqs.items():
                symbols = splits[word]
                for j in range(len(symbols) - 1):
                    pair_counts[(symbols[j], symbols[j + 1])] += freq

            if not pair_counts:
                break

            # Find the most frequent pair
            best_pair = pair_counts.most_common(1)[0][0]
            merged = best_pair[0] + best_pair[1]
            self.merges.append((best_pair, merged))

            # Apply this merge everywhere
            for word in splits:
                symbols = splits[word]
                new_symbols = []
                j = 0
                while j < len(symbols):
                    if (j < len(symbols) - 1
                            and symbols[j] == best_pair[0]
                            and symbols[j + 1] == best_pair[1]):
                        new_symbols.append(merged)
                        j += 2
                    else:
                        new_symbols.append(symbols[j])
                        j += 1
                splits[word] = new_symbols

        # Build vocabulary: all unique tokens across all splits + base chars
        all_tokens = set()
        for symbols in splits.values():
            all_tokens.update(symbols)
        for (pair, merged) in self.merges:
            all_tokens.add(pair[0])
            all_tokens.add(pair[1])
            all_tokens.add(merged)
        self.vocab = {tok: i for i, tok in enumerate(sorted(all_tokens))}

    def encode(self, text):
        """Encode a string into a list of token IDs."""
        tokens = []
        for word in text.split():
            symbols = list(word)
            # Replay merges in training order
            for (pair, merged) in self.merges:
                new_symbols = []
                j = 0
                while j < len(symbols):
                    if (j < len(symbols) - 1
                            and symbols[j] == pair[0]
                            and symbols[j + 1] == pair[1]):
                        new_symbols.append(merged)
                        j += 2
                    else:
                        new_symbols.append(symbols[j])
                        j += 1
                symbols = new_symbols
            tokens.extend(symbols)
        return [self.vocab[t] for t in tokens]

    def decode(self, ids):
        """Decode token IDs back to a string."""
        id_to_tok = {i: t for t, i in self.vocab.items()}
        return " ".join(id_to_tok[i] for i in ids)

Let’s train it on a small corpus and see what it learns:

text = ("the cat sat on the mat the cat and the bat "
        "the rat sat on the flat mat that the cat sat on")

tokenizer = BPETokenizer()
tokenizer.train(text, num_merges=15)

# What merges did it learn?
for i, (pair, merged) in enumerate(tokenizer.merges):
    print(f"  Merge {i+1}: '{pair[0]}' + '{pair[1]}' -> '{merged}'")

# Merge 1: 'a' + 't' -> 'at'    (count: 12 — cat, sat, mat, bat, rat, flat, that)
# Merge 2: 't' + 'h' -> 'th'    (count: 8 — from "the" x7 + "that" x1)
# Merge 3: 'th' + 'e' -> 'the'  (count: 7)
# Merge 4: 'c' + 'at' -> 'cat'  (count: 3)
# Merge 5: 's' + 'at' -> 'sat'  (count: 3)
# Merge 6: 'o' + 'n' -> 'on'    (count: 3)
# ...

# Encode new text
ids = tokenizer.encode("the cat sat on the mat")
print(f"Token IDs: {ids}")
# Each common word is now a single token!

# Encode something new
ids = tokenizer.encode("bat hat")
# "bat" -> ["b", "at"] (knows "at" from training)
# "hat" -> ["h", "at"] (same!)

Notice how the algorithm discovered English morphology on its own. Our corpus is full of rhyming words (cat, sat, mat, bat, rat), so it merged a + t → at first — building the shared suffix that six different words have in common. Then it merged t + h → th and th + e → the, compressing the most frequent word into a single token. Nobody told it about these patterns — frequency did all the work.

Byte-Level BPE: The GPT-2 Innovation

Our character-level BPE has a subtle remaining problem: if the training data never included the character “ü” or “é” or the emoji “🚀”, we can’t tokenize text that contains them. In 2019, OpenAI’s GPT-2 paper introduced an elegant fix: start with bytes instead of characters.

Every string in every language can be encoded as a sequence of UTF-8 bytes — integers from 0 to 255. By using these 256 values as the base vocabulary, byte-level BPE can represent any text, in any language, including emojis, mathematical notation, and scripts it has never seen. Zero unknown tokens, ever.

# Character-level: "café" -> ['c', 'a', 'f', 'é']
# But what if 'é' wasn't in training data? -> [UNK]!

# Byte-level: "café" -> [99, 97, 102, 195, 169]
# 'é' is 2 bytes in UTF-8: [195, 169] — both in 0-255 range
# Every possible character decomposes into known bytes

text = "caf\u00e9 \U0001F680"
byte_tokens = list(text.encode("utf-8"))
print(byte_tokens)  # [99, 97, 102, 195, 169, 32, 240, 159, 154, 128]
# 'c'=99, 'a'=97, 'f'=102, 'é'=[195,169], ' '=32, '🚀'=[240,159,154,128]

GPT-2’s final vocabulary: 256 base byte tokens + 50,000 learned merges + 1 special end-of-text token = 50,257 tokens. Every single one of those 50,000 merges was learned from data — the algorithm decided which byte sequences appeared frequently enough to deserve their own token.

The Modern Landscape

Today’s production LLMs use highly optimized BPE implementations. Here’s how the vocabulary sizes have grown:

Model Vocab Size Tokenizer Year
GPT-250,257Byte-level BPE2019
T532,000SentencePiece2019
LLaMA 232,000SentencePiece BPE2023
GPT-4100,256tiktoken (BPE)2023
LLaMA 3128,000BPE2024
GPT-4o~200,000tiktoken (BPE)2024

Larger vocabularies mean more merges, which means better compression — more text fits into the same context window. GPT-4o’s 200K vocabulary represents the same English text in significantly fewer tokens than GPT-2’s 50K vocabulary, effectively giving it a larger context window for the same model architecture.

Two production-grade tokenizer libraries dominate:

Surprising gotchas

Tokenization has consequences most people don’t expect:

Try It: Interactive BPE Visualizer

Watch BPE Tokenize Your Text

Type some text below, then step through the BPE merges one at a time. The tokenizer is pre-trained on common English words — watch how frequent patterns compress first.

The Invisible Foundation

Tokenization is the lens through which every language model perceives text. It determines what a “word” is, how many tokens your prompt costs, how well the model handles non-English input, and even whether it can do basic arithmetic. It’s the most consequential preprocessing step in all of deep learning — and it runs on an algorithm from a 1994 data compression paper.

BPE’s genius is letting frequency do all the work. No linguistic rules, no hand-crafted dictionaries — just count pairs and merge the winners. The result is a vocabulary that naturally captures the statistical structure of language: common patterns compress into single tokens, while rare patterns stay decomposed into recognizable pieces.

With this post, we’ve now built every major piece of the modern LLM pipeline from scratch. Tokenization chops text into subwords. Embeddings turn those subwords into vectors with meaning. Attention lets those vectors communicate and find relevance. Autograd trains the whole thing end to end. Four posts, four building blocks, one complete foundation.

References & Further Reading