Tokenization from Scratch: How LLMs Read Text
The Invisible First Step
Every conversation with an LLM starts with a silent transformation. Before the model can think about your question, before attention heads weigh relevance, before embeddings encode meaning — your text has to become a sequence of integers. Not one integer per word. Not one per character. Something cleverer.
Type “I’m tokenizing this” into GPT-4, and behind the scenes it sees something like [40, 2846, 5765, 2995, 420]. The word “tokenizing” has been split into ["token", "izing"] — two subword pieces that each have their own integer ID. The model has never seen the word “tokenizing” as a unit, but it knows “token” and “izing” separately, so it can handle the combination just fine.
This splitting process is called tokenization, and the dominant algorithm behind it is Byte Pair Encoding (BPE). It’s the very first step in the pipeline we’ve been building in this series. In Embeddings, we turned words into meaningful vectors. In Attention, we taught those vectors to find relevance. In Micrograd, we built the training engine. But all of those assumed the text was already chopped into pieces. Today we build the chopper.
Two Extremes That Don’t Work
Before we get to BPE, let’s understand why it exists by trying the two obvious approaches and watching them fail.
Attempt 1: One character at a time
The simplest possible tokenizer splits text into individual characters:
def char_tokenize(text):
vocab = sorted(set(text))
char_to_id = {ch: i for i, ch in enumerate(vocab)}
return [char_to_id[ch] for ch in text]
tokens = char_tokenize("the cat sat on the mat")
print(tokens) # [9, 4, 3, 0, 2, 1, 9, 0, 8, 1, 9, 0, 7, 6, 0, 9, 4, 3, 0, 5, 1, 9]
print(f"Vocab size: {len(set('the cat sat on the mat'))}") # 10
This has one beautiful property: a vocabulary of about 256 characters covers every possible English text. You’ll never encounter an unknown token. But there’s a crushing cost — the sequence is enormous. That 6-word sentence produces 22 tokens. A typical paragraph would generate hundreds. The model has to figure out that the characters t, h, e appearing together mean “the” — spending precious attention capacity on reassembling words instead of understanding meaning.
Attempt 2: One word at a time
The opposite extreme splits on whitespace:
def word_tokenize(text):
words = text.split()
vocab = sorted(set(words))
word_to_id = {w: i for i, w in enumerate(vocab)}
return [word_to_id[w] for w in words]
tokens = word_tokenize("the cat sat on the mat")
print(tokens) # [4, 0, 3, 2, 4, 1]
print(f"Vocab size: {len(set('the cat sat on the mat'.split()))}") # 5
Now our 6-word sentence is a tight 6 tokens. Each token carries meaning. But what happens when the model encounters “tokenizing” and it wasn’t in the training data? It maps to a special [UNK] token and all information is lost. To cover a real language, you’d need a vocabulary of over 100,000 words — and you’d still hit unknowns on novel words, typos, names, and technical jargon.
The dilemma: characters give you infinite coverage but painfully long sequences. Words give you compact sequences but can’t handle the unexpected. We need something in between.
The Key Insight: Let the Data Decide
In 1994, Philip Gage published a short paper describing a compression algorithm called Byte Pair Encoding. The idea was disarmingly simple: scan your data for the most frequently occurring pair of adjacent bytes, replace every occurrence with a new symbol, and repeat. Each replacement shortens the data by removing one symbol per occurrence of that pair.
Twenty-two years later, Rico Sennrich and colleagues realized this same idea could solve the tokenization problem for neural machine translation. Instead of compressing bytes, they compressed characters within words. Common sequences like “th”, “ing”, and “tion” would naturally merge into single tokens. Whole common words like “the” would eventually become single tokens too. But rare words would stay split into recognizable pieces.
The result is a vocabulary where:
- Frequent words are single tokens: “the”, “and”, “is”
- Common subwords are tokens: “ing”, “tion”, “un”
- Rare words decompose into known pieces: “tokenizing” → “token” + “izing”
- Nothing is unknown: even a word the model has never seen breaks down into subwords it has
This is the sweet spot. Let’s build it.
BPE: The Algorithm, Step by Step
BPE has two phases: training (learning which pairs to merge) and encoding (applying those merges to new text). Let’s walk through training with a concrete example.
Imagine our training corpus contains these words with their frequencies:
corpus = {"hug": 10, "pug": 5, "pun": 12, "bun": 4, "hugs": 5}
We start by splitting every word into individual characters and collecting all unique characters as our initial vocabulary:
# Split words into character sequences (with frequencies)
# "hug" (10 times) -> ["h", "u", "g"]
# "pug" (5 times) -> ["p", "u", "g"]
# "pun" (12 times) -> ["p", "u", "n"]
# "bun" (4 times) -> ["b", "u", "n"]
# "hugs" (5 times) -> ["h", "u", "g", "s"]
initial_vocab = ["b", "g", "h", "n", "p", "s", "u"] # 7 characters
Now we count every adjacent pair, weighted by word frequency:
("u", "g"): appears in “hug” (10) + “pug” (5) + “hugs” (5) = 20("u", "n"): appears in “pun” (12) + “bun” (4) = 16("h", "u"): appears in “hug” (10) + “hugs” (5) = 15("p", "u"): appears in “pug” (5) + “pun” (12) = 17("g", "s"): appears in “hugs” (5) = 5("b", "u"): appears in “bun” (4) = 4
Merge 1: The most frequent pair is ("u", "g") with count 20. We merge it into a new token "ug" and replace everywhere:
# After merge 1: ("u", "g") -> "ug"
# "hug" -> ["h", "ug"] (was ["h", "u", "g"])
# "pug" -> ["p", "ug"] (was ["p", "u", "g"])
# "pun" -> ["p", "u", "n"] (unchanged - no "u","g" pair)
# "bun" -> ["b", "u", "n"] (unchanged)
# "hugs" -> ["h", "ug", "s"] (was ["h", "u", "g", "s"])
vocab = ["b", "g", "h", "n", "p", "s", "u", "ug"] # 8 tokens
Merge 2: Recount pairs. ("u", "n") appears 16 times (still the runner-up). Merge it:
# After merge 2: ("u", "n") -> "un"
# "pun" -> ["p", "un"]
# "bun" -> ["b", "un"]
vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un"] # 9 tokens
Merge 3: Now ("h", "ug") appears 15 times (in “hug” and “hugs”). Merge it into "hug":
# After merge 3: ("h", "ug") -> "hug"
# "hug" -> ["hug"] (fully compressed!)
# "hugs" -> ["hug", "s"]
vocab = ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] # 10 tokens
Three merges and the word “hug” is already a single token. The algorithm naturally promoted the most common words to single-token status first.
Encoding new text is straightforward: split into characters, then apply the learned merges in order. To encode “bug”:
- Start:
["b", "u", "g"] - Apply merge 1 (
u + g → ug):["b", "ug"] - No more applicable merges.
- Result:
["b", "ug"]— the model has never seen “bug” as a whole word, but it recognizes both pieces.
Building BPE from Scratch in Python
Here’s a complete BPE tokenizer in about 70 lines. We pre-tokenize on whitespace, train by repeatedly merging the most common pair, and encode by replaying the merges in order:
from collections import Counter
class BPETokenizer:
def __init__(self):
self.merges = [] # ordered list of (pair, merged_token)
self.vocab = {} # token -> id
def train(self, text, num_merges):
"""Learn `num_merges` merge rules from a training corpus."""
# Pre-tokenize: split on whitespace, count word frequencies
word_freqs = Counter(text.split())
# Split each word into a tuple of characters
splits = {word: list(word) for word in word_freqs}
for i in range(num_merges):
# Count every adjacent pair, weighted by word frequency
pair_counts = Counter()
for word, freq in word_freqs.items():
symbols = splits[word]
for j in range(len(symbols) - 1):
pair_counts[(symbols[j], symbols[j + 1])] += freq
if not pair_counts:
break
# Find the most frequent pair
best_pair = pair_counts.most_common(1)[0][0]
merged = best_pair[0] + best_pair[1]
self.merges.append((best_pair, merged))
# Apply this merge everywhere
for word in splits:
symbols = splits[word]
new_symbols = []
j = 0
while j < len(symbols):
if (j < len(symbols) - 1
and symbols[j] == best_pair[0]
and symbols[j + 1] == best_pair[1]):
new_symbols.append(merged)
j += 2
else:
new_symbols.append(symbols[j])
j += 1
splits[word] = new_symbols
# Build vocabulary: all unique tokens across all splits + base chars
all_tokens = set()
for symbols in splits.values():
all_tokens.update(symbols)
for (pair, merged) in self.merges:
all_tokens.add(pair[0])
all_tokens.add(pair[1])
all_tokens.add(merged)
self.vocab = {tok: i for i, tok in enumerate(sorted(all_tokens))}
def encode(self, text):
"""Encode a string into a list of token IDs."""
tokens = []
for word in text.split():
symbols = list(word)
# Replay merges in training order
for (pair, merged) in self.merges:
new_symbols = []
j = 0
while j < len(symbols):
if (j < len(symbols) - 1
and symbols[j] == pair[0]
and symbols[j + 1] == pair[1]):
new_symbols.append(merged)
j += 2
else:
new_symbols.append(symbols[j])
j += 1
symbols = new_symbols
tokens.extend(symbols)
return [self.vocab[t] for t in tokens]
def decode(self, ids):
"""Decode token IDs back to a string."""
id_to_tok = {i: t for t, i in self.vocab.items()}
return " ".join(id_to_tok[i] for i in ids)
Let’s train it on a small corpus and see what it learns:
text = ("the cat sat on the mat the cat and the bat "
"the rat sat on the flat mat that the cat sat on")
tokenizer = BPETokenizer()
tokenizer.train(text, num_merges=15)
# What merges did it learn?
for i, (pair, merged) in enumerate(tokenizer.merges):
print(f" Merge {i+1}: '{pair[0]}' + '{pair[1]}' -> '{merged}'")
# Merge 1: 'a' + 't' -> 'at' (count: 12 — cat, sat, mat, bat, rat, flat, that)
# Merge 2: 't' + 'h' -> 'th' (count: 8 — from "the" x7 + "that" x1)
# Merge 3: 'th' + 'e' -> 'the' (count: 7)
# Merge 4: 'c' + 'at' -> 'cat' (count: 3)
# Merge 5: 's' + 'at' -> 'sat' (count: 3)
# Merge 6: 'o' + 'n' -> 'on' (count: 3)
# ...
# Encode new text
ids = tokenizer.encode("the cat sat on the mat")
print(f"Token IDs: {ids}")
# Each common word is now a single token!
# Encode something new
ids = tokenizer.encode("bat hat")
# "bat" -> ["b", "at"] (knows "at" from training)
# "hat" -> ["h", "at"] (same!)
Notice how the algorithm discovered English morphology on its own. Our corpus is full of rhyming words (cat, sat, mat, bat, rat), so it merged a + t → at first — building the shared suffix that six different words have in common. Then it merged t + h → th and th + e → the, compressing the most frequent word into a single token. Nobody told it about these patterns — frequency did all the work.
Byte-Level BPE: The GPT-2 Innovation
Our character-level BPE has a subtle remaining problem: if the training data never included the character “ü” or “é” or the emoji “🚀”, we can’t tokenize text that contains them. In 2019, OpenAI’s GPT-2 paper introduced an elegant fix: start with bytes instead of characters.
Every string in every language can be encoded as a sequence of UTF-8 bytes — integers from 0 to 255. By using these 256 values as the base vocabulary, byte-level BPE can represent any text, in any language, including emojis, mathematical notation, and scripts it has never seen. Zero unknown tokens, ever.
# Character-level: "café" -> ['c', 'a', 'f', 'é']
# But what if 'é' wasn't in training data? -> [UNK]!
# Byte-level: "café" -> [99, 97, 102, 195, 169]
# 'é' is 2 bytes in UTF-8: [195, 169] — both in 0-255 range
# Every possible character decomposes into known bytes
text = "caf\u00e9 \U0001F680"
byte_tokens = list(text.encode("utf-8"))
print(byte_tokens) # [99, 97, 102, 195, 169, 32, 240, 159, 154, 128]
# 'c'=99, 'a'=97, 'f'=102, 'é'=[195,169], ' '=32, '🚀'=[240,159,154,128]
GPT-2’s final vocabulary: 256 base byte tokens + 50,000 learned merges + 1 special end-of-text token = 50,257 tokens. Every single one of those 50,000 merges was learned from data — the algorithm decided which byte sequences appeared frequently enough to deserve their own token.
The Modern Landscape
Today’s production LLMs use highly optimized BPE implementations. Here’s how the vocabulary sizes have grown:
| Model | Vocab Size | Tokenizer | Year |
|---|---|---|---|
| GPT-2 | 50,257 | Byte-level BPE | 2019 |
| T5 | 32,000 | SentencePiece | 2019 |
| LLaMA 2 | 32,000 | SentencePiece BPE | 2023 |
| GPT-4 | 100,256 | tiktoken (BPE) | 2023 |
| LLaMA 3 | 128,000 | BPE | 2024 |
| GPT-4o | ~200,000 | tiktoken (BPE) | 2024 |
Larger vocabularies mean more merges, which means better compression — more text fits into the same context window. GPT-4o’s 200K vocabulary represents the same English text in significantly fewer tokens than GPT-2’s 50K vocabulary, effectively giving it a larger context window for the same model architecture.
Two production-grade tokenizer libraries dominate:
- tiktoken (OpenAI): Byte-level BPE implemented in Rust for speed. Powers GPT-3.5, GPT-4, and GPT-4o. Uses a regex-based pre-tokenizer that splits text into chunks before applying BPE merges — preventing merges across word boundaries.
- SentencePiece (Google): Supports both BPE and an alternative algorithm called Unigram. Works directly on raw text without pre-tokenization, using a special “▁” character to mark word boundaries. Powers T5, LLaMA 1/2, and XLNet.
Surprising gotchas
Tokenization has consequences most people don’t expect:
- Arithmetic is hard because numbers split inconsistently. “380” might be one token while “381” splits into “38” + “1”. The model never sees digits aligned for column addition.
- Non-English text is expensive. A tokenizer trained mostly on English produces 3–4x more tokens for the same meaning in Hindi or Japanese, directly increasing cost and reducing effective context length.
- Glitch tokens appear when the tokenizer’s training data differs from the model’s. Some tokens exist in the vocabulary but were never seen during model training, producing unpredictable behavior when triggered.
Try It: Interactive BPE Visualizer
Watch BPE Tokenize Your Text
Type some text below, then step through the BPE merges one at a time. The tokenizer is pre-trained on common English words — watch how frequent patterns compress first.
The Invisible Foundation
Tokenization is the lens through which every language model perceives text. It determines what a “word” is, how many tokens your prompt costs, how well the model handles non-English input, and even whether it can do basic arithmetic. It’s the most consequential preprocessing step in all of deep learning — and it runs on an algorithm from a 1994 data compression paper.
BPE’s genius is letting frequency do all the work. No linguistic rules, no hand-crafted dictionaries — just count pairs and merge the winners. The result is a vocabulary that naturally captures the statistical structure of language: common patterns compress into single tokens, while rare patterns stay decomposed into recognizable pieces.
With this post, we’ve now built every major piece of the modern LLM pipeline from scratch. Tokenization chops text into subwords. Embeddings turn those subwords into vectors with meaning. Attention lets those vectors communicate and find relevance. Autograd trains the whole thing end to end. Four posts, four building blocks, one complete foundation.
References & Further Reading
- Philip Gage — “A New Algorithm for Data Compression” (1994) — the original BPE paper that started it all
- Sennrich, Haddow & Birch — “Neural Machine Translation of Rare Words with Subword Units” (ACL 2016) — adapted BPE for NLP tokenization
- Radford et al. — “Language Models are Unsupervised Multitask Learners” (2019) — GPT-2 paper introducing byte-level BPE
- Kudo & Richardson — “SentencePiece” (EMNLP 2018) — Google’s language-independent tokenizer
- Andrej Karpathy — minbpe — clean, minimal BPE implementations for learning
- OpenAI — tiktoken — production BPE tokenizer for GPT models
- Google — SentencePiece — C++ library supporting BPE and Unigram algorithms