Tokenization from Scratch: How LLMs Read Text

February 25, 2026 · Elementary AI · 13 min read

The Invisible First Step

Every conversation with an LLM starts with a silent transformation. Before the model can think about your question, before attention heads weigh relevance, before embeddings encode meaning — your text has to become a sequence of integers. Not one integer per word. Not one per character. Something cleverer.

Type “I’m tokenizing this” into GPT-4, and behind the scenes it sees something like [40, 2846, 5765, 2995, 420]. The word “tokenizing” has been split into ["token", "izing"] — two subword pieces that each have their own integer ID. The model has never seen the word “tokenizing” as a unit, but it knows “token” and “izing” separately, so it can handle the combination just fine.

This splitting process is called tokenization, and the dominant algorithm behind it is Byte Pair Encoding (BPE). It’s the very first step in the pipeline we’ve been building in this series. In Embeddings, we turned words into meaningful vectors. In Attention, we taught those vectors to find relevance. In Micrograd, we built the training engine. But all of those assumed the text was already chopped into pieces. Today we build the chopper.

Two Extremes That Don’t Work

Before we get to BPE, let’s understand why it exists by trying the two obvious approaches and watching them fail.

Attempt 1: One character at a time

The simplest possible tokenizer splits text into individual characters:

def char_tokenize(text):
    vocab = sorted(set(text))
    char_to_id = {ch: i for i, ch in enumerate(vocab)}
    return [char_to_id[ch] for ch in text]

tokens = char_tokenize("the cat sat on the mat")
print(tokens)   # [9, 4, 3, 0, 2, 1, 9, 0, 8, 1, 9, 0, 7, 6, 0, 9, 4, 3, 0, 5, 1, 9]
print(f"Vocab size: {len(set('the cat sat on the mat'))}")  # 10

This has one beautiful property: a vocabulary of about 256 characters covers every possible English text. You’ll never encounter an unknown token. But there’s a crushing cost — the sequence is enormous. That 6-word sentence produces 22 tokens. A typical paragraph would generate hundreds. The model has to figure out that the characters t, h, e appearing together mean “the” — spending precious attention capacity on reassembling words instead of understanding meaning.

Attempt 2: One word at a time

The opposite extreme splits on whitespace:

def word_tokenize(text):
    words = text.split()
    vocab = sorted(set(words))
    word_to_id = {w: i for i, w in enumerate(vocab)}
    return [word_to_id[w] for w in words]

tokens = word_tokenize("the cat sat on the mat")
print(tokens)   # [4, 0, 3, 2, 4, 1]
print(f"Vocab size: {len(set('the cat sat on the mat'.split()))}")  # 5

Now our 6-word sentence is a tight 6 tokens. Each token carries meaning. But what happens when the model encounters “tokenizing” and it wasn’t in the training data? It maps to a special [UNK] token and all information is lost. To cover a real language, you’d need a vocabulary of over 100,000 words — and you’d still hit unknowns on novel words, typos, names, and technical jargon.

The dilemma: characters give you infinite coverage but painfully long sequences. Words give you compact sequences but can’t handle the unexpected. We need something in between.

The Key Insight: Let the Data Decide

In 1994, Philip Gage published a short paper describing a compression algorithm called Byte Pair Encoding. The idea was disarmingly simple: scan your data for the most frequently occurring pair of adjacent bytes, replace every occurrence with a new symbol, and repeat. Each replacement shortens the data by removing one symbol per occurrence of that pair.

Twenty-two years later, Rico Sennrich and colleagues realized this same idea could solve the tokenization problem for neural machine translation. Instead of compressing bytes, they compressed characters within words. Common sequences like “th”, “ing”, and “tion” would naturally merge into single tokens. Whole common words like “the” would eventually become single tokens too. But rare words would stay split into recognizable pieces.

The result is a vocabulary where:

Frequent words are single tokens: “the”, “and”, “is”
Common subwords are tokens: “ing”, “tion”, “un”
Rare words decompose into known pieces: “tokenizing” → “token” + “izing”
Nothing is unknown: even a word the model has never seen breaks down into subwords it has

This is the sweet spot. Let’s build it.

BPE: The Algorithm, Step by Step

BPE has two phases: training (learning which pairs to merge) and encoding (applying those merges to new text). Let’s walk through training with a concrete example.

Imagine our training corpus contains these words with their frequencies:

corpus = {"hug": 10, "pug": 5, "pun": 12, "bun": 4, "hugs": 5}

We start by splitting every word into individual characters and collecting all unique characters as our initial vocabulary:

# Split words into character sequences (with frequencies)
# "hug" (10 times) -> ["h", "u", "g"]
# "pug" (5 times)  -> ["p", "u", "g"]
# "pun" (12 times) -> ["p", "u", "n"]
# "bun" (4 times)  -> ["b", "u", "n"]
# "hugs" (5 times) -> ["h", "u", "g", "s"]

initial_vocab = ["b", "g", "h", "n", "p", "s", "u"]  # 7 characters

Now we count every adjacent pair, weighted by word frequency:

("u", "g"): appears in “hug” (10) + “pug” (5) + “hugs” (5) = 20
("u", "n"): appears in “pun” (12) + “bun” (4) = 16
("h", "u"): appears in “hug” (10) + “hugs” (5) = 15
("p", "u"): appears in “pug” (5) + “pun” (12) = 17
("g", "s"): appears in “hugs” (5) = 5
("b", "u"): appears in “bun” (4) = 4

Merge 1: The most frequent pair is ("u", "g") with count 20. We merge it into a new token "ug" and replace everywhere:

# After merge 1: ("u", "g") -> "ug"
# "hug"  -> ["h", "ug"]       (was ["h", "u", "g"])
# "pug"  -> ["p", "ug"]       (was ["p", "u", "g"])
# "pun"  -> ["p", "u", "n"]   (unchanged - no "u","g" pair)
# "bun"  -> ["b", "u", "n"]   (unchanged)
# "hugs" -> ["h", "ug", "s"]  (was ["h", "u", "g", "s"])

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug"]  # 8 tokens

Merge 2: Recount pairs. ("u", "n") appears 16 times (still the runner-up). Merge it:

# After merge 2: ("u", "n") -> "un"
# "pun" -> ["p", "un"]
# "bun" -> ["b", "un"]

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]  # 9 tokens

Merge 3: Now ("h", "ug") appears 15 times (in “hug” and “hugs”). Merge it into "hug":

# After merge 3: ("h", "ug") -> "hug"
# "hug"  -> ["hug"]           (fully compressed!)
# "hugs" -> ["hug", "s"]

vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]  # 10 tokens

Three merges and the word “hug” is already a single token. The algorithm naturally promoted the most common words to single-token status first.

Encoding new text is straightforward: split into characters, then apply the learned merges in order. To encode “bug”:

Start: ["b", "u", "g"]
Apply merge 1 (u + g → ug): ["b", "ug"]
No more applicable merges.
Result: ["b", "ug"] — the model has never seen “bug” as a whole word, but it recognizes both pieces.

Building BPE from Scratch in Python

Here’s a complete BPE tokenizer in about 70 lines. We pre-tokenize on whitespace, train by repeatedly merging the most common pair, and encode by replaying the merges in order:

from collections import Counter

class BPETokenizer:
    def __init__(self):
        self.merges = []          # ordered list of (pair, merged_token)
        self.vocab = {}           # token -> id

    def train(self, text, num_merges):
        """Learn `num_merges` merge rules from a training corpus."""
        # Pre-tokenize: split on whitespace, count word frequencies
        word_freqs = Counter(text.split())

        # Split each word into a tuple of characters
        splits = {word: list(word) for word in word_freqs}

        for i in range(num_merges):
            # Count every adjacent pair, weighted by word frequency
            pair_counts = Counter()
            for word, freq in word_freqs.items():
                symbols = splits[word]
                for j in range(len(symbols) - 1):
                    pair_counts[(symbols[j], symbols[j + 1])] += freq

            if not pair_counts:
                break

            # Find the most frequent pair
            best_pair = pair_counts.most_common(1)[0][0]
            merged = best_pair[0] + best_pair[1]
            self.merges.append((best_pair, merged))

            # Apply this merge everywhere
            for word in splits:
                symbols = splits[word]
                new_symbols = []
                j = 0
                while j < len(symbols):
                    if (j < len(symbols) - 1
                            and symbols[j] == best_pair[0]
                            and symbols[j + 1] == best_pair[1]):
                        new_symbols.append(merged)
                        j += 2
                    else:
                        new_symbols.append(symbols[j])
                        j += 1
                splits[word] = new_symbols

        # Build vocabulary: all unique tokens across all splits + base chars
        all_tokens = set()
        for symbols in splits.values():
            all_tokens.update(symbols)
        for (pair, merged) in self.merges:
            all_tokens.add(pair[0])
            all_tokens.add(pair[1])
            all_tokens.add(merged)
        self.vocab = {tok: i for i, tok in enumerate(sorted(all_tokens))}

    def encode(self, text):
        """Encode a string into a list of token IDs."""
        tokens = []
        for word in text.split():
            symbols = list(word)
            # Replay merges in training order
            for (pair, merged) in self.merges:
                new_symbols = []
                j = 0
                while j < len(symbols):
                    if (j < len(symbols) - 1
                            and symbols[j] == pair[0]
                            and symbols[j + 1] == pair[1]):
                        new_symbols.append(merged)
                        j += 2
                    else:
                        new_symbols.append(symbols[j])
                        j += 1
                symbols = new_symbols
            tokens.extend(symbols)
        return [self.vocab[t] for t in tokens]

    def decode(self, ids):
        """Decode token IDs back to a string."""
        id_to_tok = {i: t for t, i in self.vocab.items()}
        return " ".join(id_to_tok[i] for i in ids)

Let’s train it on a small corpus and see what it learns:

text = ("the cat sat on the mat the cat and the bat "
        "the rat sat on the flat mat that the cat sat on")

tokenizer = BPETokenizer()
tokenizer.train(text, num_merges=15)

# What merges did it learn?
for i, (pair, merged) in enumerate(tokenizer.merges):
    print(f"  Merge {i+1}: '{pair[0]}' + '{pair[1]}' -> '{merged}'")

# Merge 1: 'a' + 't' -> 'at'    (count: 12 — cat, sat, mat, bat, rat, flat, that)
# Merge 2: 't' + 'h' -> 'th'    (count: 8 — from "the" x7 + "that" x1)
# Merge 3: 'th' + 'e' -> 'the'  (count: 7)
# Merge 4: 'c' + 'at' -> 'cat'  (count: 3)
# Merge 5: 's' + 'at' -> 'sat'  (count: 3)
# Merge 6: 'o' + 'n' -> 'on'    (count: 3)
# ...

# Encode new text
ids = tokenizer.encode("the cat sat on the mat")
print(f"Token IDs: {ids}")
# Each common word is now a single token!

# Encode something new
ids = tokenizer.encode("bat hat")
# "bat" -> ["b", "at"] (knows "at" from training)
# "hat" -> ["h", "at"] (same!)

Notice how the algorithm discovered English morphology on its own. Our corpus is full of rhyming words (cat, sat, mat, bat, rat), so it merged a + t → at first — building the shared suffix that six different words have in common. Then it merged t + h → th and th + e → the, compressing the most frequent word into a single token. Nobody told it about these patterns — frequency did all the work.

Byte-Level BPE: The GPT-2 Innovation

Our character-level BPE has a subtle remaining problem: if the training data never included the character “ü” or “é” or the emoji “🚀”, we can’t tokenize text that contains them. In 2019, OpenAI’s GPT-2 paper introduced an elegant fix: start with bytes instead of characters.

Every string in every language can be encoded as a sequence of UTF-8 bytes — integers from 0 to 255. By using these 256 values as the base vocabulary, byte-level BPE can represent any text, in any language, including emojis, mathematical notation, and scripts it has never seen. Zero unknown tokens, ever.

# Character-level: "café" -> ['c', 'a', 'f', 'é']
# But what if 'é' wasn't in training data? -> [UNK]!

# Byte-level: "café" -> [99, 97, 102, 195, 169]
# 'é' is 2 bytes in UTF-8: [195, 169] — both in 0-255 range
# Every possible character decomposes into known bytes

text = "caf\u00e9 \U0001F680"
byte_tokens = list(text.encode("utf-8"))
print(byte_tokens)  # [99, 97, 102, 195, 169, 32, 240, 159, 154, 128]
# 'c'=99, 'a'=97, 'f'=102, 'é'=[195,169], ' '=32, '🚀'=[240,159,154,128]

GPT-2’s final vocabulary: 256 base byte tokens + 50,000 learned merges + 1 special end-of-text token = 50,257 tokens. Every single one of those 50,000 merges was learned from data — the algorithm decided which byte sequences appeared frequently enough to deserve their own token.

The Modern Landscape

Today’s production LLMs use highly optimized BPE implementations. Here’s how the vocabulary sizes have grown:

Model	Vocab Size	Tokenizer	Year
GPT-2	50,257	Byte-level BPE	2019
T5	32,000	SentencePiece	2019
LLaMA 2	32,000	SentencePiece BPE	2023
GPT-4	100,256	tiktoken (BPE)	2023
LLaMA 3	128,000	BPE	2024
GPT-4o	~200,000	tiktoken (BPE)	2024

Larger vocabularies mean more merges, which means better compression — more text fits into the same context window. GPT-4o’s 200K vocabulary represents the same English text in significantly fewer tokens than GPT-2’s 50K vocabulary, effectively giving it a larger context window for the same model architecture.

Two production-grade tokenizer libraries dominate:

tiktoken (OpenAI): Byte-level BPE implemented in Rust for speed. Powers GPT-3.5, GPT-4, and GPT-4o. Uses a regex-based pre-tokenizer that splits text into chunks before applying BPE merges — preventing merges across word boundaries.
SentencePiece (Google): Supports both BPE and an alternative algorithm called Unigram. Works directly on raw text without pre-tokenization, using a special “▁” character to mark word boundaries. Powers T5, LLaMA 1/2, and XLNet.

Surprising gotchas

Tokenization has consequences most people don’t expect:

Arithmetic is hard because numbers split inconsistently. “380” might be one token while “381” splits into “38” + “1”. The model never sees digits aligned for column addition.
Non-English text is expensive. A tokenizer trained mostly on English produces 3–4x more tokens for the same meaning in Hindi or Japanese, directly increasing cost and reducing effective context length.
Glitch tokens appear when the tokenizer’s training data differs from the model’s. Some tokens exist in the vocabulary but were never seen during model training, producing unpredictable behavior when triggered.

Try It: Interactive BPE Visualizer

Watch BPE Tokenize Your Text

Type some text below, then step through the BPE merges one at a time. The tokenizer is pre-trained on common English words — watch how frequent patterns compress first.

The Invisible Foundation

Tokenization is the lens through which every language model perceives text. It determines what a “word” is, how many tokens your prompt costs, how well the model handles non-English input, and even whether it can do basic arithmetic. It’s the most consequential preprocessing step in all of deep learning — and it runs on an algorithm from a 1994 data compression paper.

BPE’s genius is letting frequency do all the work. No linguistic rules, no hand-crafted dictionaries — just count pairs and merge the winners. The result is a vocabulary that naturally captures the statistical structure of language: common patterns compress into single tokens, while rare patterns stay decomposed into recognizable pieces.

With this post, we’ve now built every major piece of the modern LLM pipeline from scratch. Tokenization chops text into subwords. Embeddings turn those subwords into vectors with meaning. Attention lets those vectors communicate and find relevance. Autograd trains the whole thing end to end. Four posts, four building blocks, one complete foundation.

References & Further Reading

Philip Gage — “A New Algorithm for Data Compression” (1994) — the original BPE paper that started it all
Sennrich, Haddow & Birch — “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016) — adapted BPE for NLP tokenization
Radford et al. — “Language Models are Unsupervised Multitask Learners” (2019) — GPT-2 paper introducing byte-level BPE
Kudo & Richardson — “SentencePiece” (EMNLP 2018) — Google’s language-independent tokenizer
Andrej Karpathy — minbpe — clean, minimal BPE implementations for learning
OpenAI — tiktoken — production BPE tokenizer for GPT models
Google — SentencePiece — C++ library supporting BPE and Unigram algorithms