Byte-Level Models from Scratch
1. Why Tokenization Is a Bottleneck
Open a terminal. Type “hello” — that is 5 characters, 5 bytes, and your language model sees it as 1 or 2 subword tokens. Now type “สวัสดี” (Thai for “hello”) — that is 6 characters, 18 bytes, and a typical BPE tokenizer shatters it into 6 separate tokens. Same greeting, same semantic content, 6× the token cost. This is not a quirk — it is a structural flaw in how we preprocess text for neural networks.
Subword tokenizers like BPE and WordPiece are trained on a specific corpus — usually English-heavy web text. The merge rules they learn encode corpus-specific biases: common English words (“the”, “and”) get single tokens, while equivalent words in Thai, Amharic, or Georgian are fragmented into many byte-level pieces. This creates a tokenization tax: non-Latin scripts pay 3-5× more in sequence length for the same semantic content.
Three problems compound this:
- Vocabulary mismatch: a tokenizer trained on English handles code, chemistry SMILES notation, and non-Latin scripts poorly, fragmenting them into meaningless subword pieces.
- Fragile to noise: “Hello” and “Helo” may map to completely different token sequences, so a model that understands one may be baffled by the other.
- Fixed vocabulary: once trained, a tokenizer cannot adapt. New domains, new languages, new jargon — all get the worst-case tokenization.
The byte-level alternative is radically simple: feed raw bytes into the model. The vocabulary is always exactly 256 (the possible values of a single byte). No out-of-vocabulary tokens, no tokenizer training, no preprocessing pipeline.
# Simulated BPE tokenizer: English-biased merge rules
# Real BPE would use learned merges; we simulate the token count pattern
texts = {
'English': 'Hello, how are you?',
'Chinese': '\u4f60\u597d\uff0c\u4f60\u600e\u4e48\u6837\uff1f',
'Arabic': '\u0645\u0631\u062d\u0628\u0627\u060c \u0643\u064a\u0641 \u062d\u0627\u0644\u0643\u061f',
'Thai': '\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35 \u0e04\u0e38\u0e13\u0e40\u0e1b\u0e47\u0e19\u0e2d\u0e22\u0e48\u0e32\u0e07\u0e44\u0e23?',
'Code': 'def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)',
}
# Approximate subword token counts (based on GPT-2 tokenizer patterns)
approx_tokens = {'English': 6, 'Chinese': 11, 'Arabic': 16, 'Thai': 24, 'Code': 19}
print(f"{'Language':<10} {'Bytes':>6} {'Tokens':>7} {'Tax':>6}")
print("-" * 32)
for lang, text in texts.items():
byte_count = len(text.encode('utf-8'))
token_count = approx_tokens[lang]
tax = token_count / byte_count
print(f"{lang:<10} {byte_count:>6} {token_count:>7} {tax:>6.2f}")
# Output:
# Language Bytes Tokens Tax
# --------------------------------
# English 19 6 0.32
# Chinese 24 11 0.46
# Arabic 30 16 0.53
# Thai 62 24 0.39
# Code 54 19 0.35
English gets the best deal: 6 tokens for 19 bytes. Arabic needs 16 tokens for 30 bytes. With byte-level models, sequence length is simply the byte count — fair to every language, proportional to actual text size.
2. UTF-8 Encoding and the Byte Vocabulary
Before building byte-level models, we need to understand what bytes actually represent. ASCII (1963) encoded 128 characters in 7 bits — enough for English but nothing else. Unicode expanded this to over 149,000 characters across 161 scripts. But Unicode code points are integers of varying sizes, so we need an encoding to pack them into bytes.
UTF-8 (Ken Thompson & Rob Pike, 1992) is the elegant solution: a variable-length encoding using 1–4 bytes per character. Its design is brilliant:
- ASCII characters (U+0000–U+007F): 1 byte, starting with
0xxxxxxx. Fully backward-compatible. - Latin extended, Greek, Cyrillic (U+0080–U+07FF): 2 bytes, leader
110xxxxx+ continuation10xxxxxx. - Most CJK, Arabic, Thai, Devanagari (U+0800–U+FFFF): 3 bytes, leader
1110xxxx+ two continuations. - Emoji and rare scripts (U+10000–U+10FFFF): 4 bytes, leader
11110xxx+ three continuations.
The prefix pattern makes UTF-8 self-synchronizing: jump to any byte in a stream and you can immediately determine where the next character boundary is. For byte-level models, this structure creates learnable local patterns — when the model sees a 3-byte leader (0xE4), it knows two continuation bytes follow.
def utf8_encode_char(char):
"""Encode a single character to UTF-8 bytes from first principles."""
cp = ord(char) # Unicode code point
if cp <= 0x7F: # 1-byte: 0xxxxxxx
return [cp]
elif cp <= 0x7FF: # 2-byte: 110xxxxx 10xxxxxx
return [0xC0 | (cp >> 6),
0x80 | (cp & 0x3F)]
elif cp <= 0xFFFF: # 3-byte: 1110xxxx 10xxxxxx 10xxxxxx
return [0xE0 | (cp >> 12),
0x80 | ((cp >> 6) & 0x3F),
0x80 | (cp & 0x3F)]
else: # 4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
return [0xF0 | (cp >> 18),
0x80 | ((cp >> 12) & 0x3F),
0x80 | ((cp >> 6) & 0x3F),
0x80 | (cp & 0x3F)]
examples = [('A', 'ASCII'), ('\u00e9', 'Latin'), ('\u4e2d', 'CJK'), ('\U0001f600', 'Emoji')]
for char, script in examples:
encoded = utf8_encode_char(char)
hex_str = ' '.join(f'0x{b:02X}' for b in encoded)
binary = ' '.join(f'{b:08b}' for b in encoded)
print(f"'{char}' ({script:<6}) U+{ord(char):04X} -> [{hex_str}]")
print(f" binary: {binary}")
# 'A' (ASCII ) U+0041 -> [0x41]
# 'é' (Latin ) U+00E9 -> [0xC3 0xA9]
# '中' (CJK ) U+4E2D -> [0xE4 0xB8 0xAD]
# '😀' (Emoji ) U+1F600 -> [0xF0 0x9F 0x98 0x80]
The byte vocabulary for a language model is just tokens 0–255 plus a few special tokens (PAD, BOS, EOS). Compared to GPT-2’s 50,257-token vocabulary, this is 200× smaller — the embedding matrix shrinks from 50,257 × dmodel to 259 × dmodel, a massive parameter saving that partially offsets the increased sequence length.
3. Naive Byte Transformers and ByT5
The simplest approach: take a standard transformer, replace the 50K-token vocabulary with 256 bytes, and train on raw byte sequences. This works — but the sequence length problem is brutal.
“The quick brown fox jumps over the lazy dog” is 9 subword tokens but 43 bytes — 4.8× longer. For transformer self-attention with O(n²) complexity, a 4.8× longer sequence means 23× more computation in the attention layers. A 2,048-token context window sees the equivalent of only 400–500 subword tokens — dramatically less context.
ByT5 (Xue et al., 2022) solves this with three design choices:
- Heavier encoder, lighter decoder: since most “understanding” happens in the encoder, make it 3× deeper than the decoder.
- Downsampling: after the initial encoder layers, mean-pool groups of bytes to reduce sequence length before the expensive deep layers.
- Shorter outputs: the decoder operates on typically shorter output sequences, so byte-level decoding is less costly.
The results are striking: ByT5 matches the accuracy of mT5 (its subword sibling) on standard benchmarks, significantly outperforms it on noisy text and multilingual tasks, and is more robust to character-level perturbations — all without any tokenizer. The cost: roughly 1.5× slower per example, but more accurate per FLOP.
import numpy as np
class ByteEncoder:
"""Simplified ByT5-style byte-level encoder with downsampling."""
def __init__(self, d_model=128, n_heads=4, pool_factor=4):
self.d_model = d_model
self.n_heads = n_heads
self.pool_factor = pool_factor
# Byte embedding: only 256 entries!
self.embed = np.random.randn(256, d_model) * 0.02
def encode(self, text):
"""Encode text as raw bytes, embed, and downsample."""
byte_ids = list(text.encode('utf-8'))
seq_len = len(byte_ids)
# Step 1: Byte embedding lookup
x = np.array([self.embed[b] for b in byte_ids]) # (seq_len, d_model)
print(f"Input: '{text}'")
print(f"Bytes: {seq_len} positions, shape {x.shape}")
# Step 2: Add sinusoidal positional encoding
pos = np.arange(seq_len)[:, None]
dim = np.arange(0, self.d_model, 2)[None, :]
angles = pos / (10000 ** (dim / self.d_model))
x[:, 0::2] += np.sin(angles)
x[:, 1::2] += np.cos(angles)
# Step 3: Mean-pool downsampling (ByT5's key trick)
k = self.pool_factor
trimmed = seq_len - (seq_len % k) # trim to multiple of k
x_trim = x[:trimmed].reshape(-1, k, self.d_model)
x_pooled = x_trim.mean(axis=1) # (seq_len/k, d_model)
print(f"Pooled: {x_pooled.shape[0]} positions (/{k}), shape {x_pooled.shape}")
return x_pooled
enc = ByteEncoder(d_model=128, pool_factor=4)
enc.encode("Hello, world!")
print()
enc.encode("\u4f60\u597d\u4e16\u754c") # Chinese: "Hello world"
print()
# Embedding table comparison
print(f"Byte embedding: 256 x 128 = {256*128:,} params")
print(f"Subword embedding: 50257 x 128 = {50257*128:,} params")
print(f"Reduction: {50257*128 // (256*128)}x fewer embedding params")
After downsampling by 4×, a 52-byte input becomes 13 positions for the expensive deep layers — comparable to the 9 subword tokens a standard tokenizer would produce. The mean-pooling naturally merges multi-byte UTF-8 sequences into character-level representations, and the tiny embedding table (256 rows instead of 50K) saves millions of parameters.
4. MegaByte: Patching Bytes for Efficiency
ByT5 uses pooling — a lossy compression that discards sub-character detail. MegaByte (Yu et al., 2023) takes a different approach: patching. The idea, borrowed from Vision Transformers, is deceptively simple: split the byte sequence into fixed-size patches and process them hierarchically.
The architecture has two components:
- Global model: a large transformer that processes patch embeddings. If the sequence has N bytes and each patch has P bytes, the global model sees only N/P positions — reducing attention cost by P².
- Local model: a small transformer that generates individual bytes within each patch. It only needs to attend to P positions (e.g., 8), so it is very fast. During training, local models for different patches run in parallel.
The attention cost drops from O(N²) for a flat byte transformer to O((N/P)² + N·P) for MegaByte. For P=8 and N=8,192, that is roughly 60× less attention computation. The global model captures long-range dependencies across the document, while the local model handles the fine-grained byte patterns within each patch.
def megabyte_costs(N, P):
"""Compute attention operations for MegaByte vs flat transformer."""
flat_attn = N * N # O(N^2) for flat byte transformer
global_attn = (N // P) ** 2 # O((N/P)^2) for global model
local_attn = N * P # O(N*P) total for all local models
mega_attn = global_attn + local_attn
return flat_attn, mega_attn
print(f"{'Seq Len':>8} {'Flat':>12} {'MegaByte':>12} {'Speedup':>8}")
print("-" * 44)
for N in [512, 1024, 2048, 4096, 8192]:
flat, mega = megabyte_costs(N, P=8)
print(f"{N:>8} {flat:>12,} {mega:>12,} {flat/mega:>7.1f}x")
# Output:
# Seq Len Flat MegaByte Speedup
# --------------------------------------------
# 512 262,144 8,192 32.0x
# 1024 1,048,576 24,576 42.7x
# 2048 4,194,304 81,920 51.2x
# 4096 16,777,216 294,912 56.9x
# 8192 67,108,864 1,114,112 60.2x
# MegaByte's advantage grows with sequence length!
# Patch embedding: concatenate P bytes, project through linear layer
P = 8
d_model = 512
patch_embed_params = P * 256 + P * d_model # byte embeds + projection
print(f"\nPatch embedding: {patch_embed_params:,} params")
print(f"Equivalent to a {P}-gram byte encoder")
On long-document benchmarks (PG-19, arXiv), MegaByte matches or outperforms subword transformers at the same compute budget while processing raw bytes. The local-global decomposition is not just efficient — it is architecturally natural. UTF-8 multi-byte sequences, word morphology, and common n-grams all happen at the local scale, while cross-sentence coherence and topic structure happen at the global scale.
5. MambaByte: Linear-Time Byte Processing
MegaByte still uses quadratic attention in its global model. MambaByte (Wang et al., 2024) asks: what if we skip attention entirely?
Mamba’s selective state-space model processes sequences in O(N) time through a scan operation — no attention matrix at all. For byte sequences, this is transformative: a 10K-byte sequence that would require 100 million attention operations in a transformer requires only 10K sequential scan steps in Mamba.
The key insight: since byte-level models face a 3-5× sequence length increase over subword models, the choice of sequence model matters enormously. With quadratic attention, 4× longer means 16× more compute. With linear-complexity SSMs, 4× longer means just 4× more compute.
def compute_costs(N_values, P=8):
"""Compare operation counts: attention, MegaByte, and SSM."""
results = []
for N in N_values:
flat = N * N # O(N^2) self-attention
mega = (N // P) ** 2 + N * P # MegaByte
ssm = N * 64 # O(N*d_state) for SSM scan
results.append((N, flat, mega, ssm))
return results
N_values = [256, 512, 1024, 2048, 4096, 8192, 16384]
results = compute_costs(N_values)
print(f"{'N':>6} {'Attention':>12} {'MegaByte':>12} {'SSM':>12}")
print("-" * 46)
for N, flat, mega, ssm in results:
print(f"{N:>6} {flat:>12,} {mega:>12,} {ssm:>12,}")
# At N=8192:
# Attention: 67,108,864
# MegaByte: 1,114,112 (60x less)
# SSM: 524,288 (128x less)
#
# SSM wins at long sequences because it's truly linear.
# The crossover point where SSM beats MegaByte depends on
# the constant factors (SSM has larger per-step cost).
MambaByte achieves competitive perplexity with subword transformers on language modeling benchmarks — without any patching, pooling, or architectural tricks. It simply processes raw bytes with a selective scan. The simplicity is the point: you do not need to engineer around the quadratic bottleneck when you can eliminate it.
The emerging frontier is hybrid architectures: use local attention within byte patches (where full expressiveness matters) and global SSMs across patches (where linear scaling matters). This combines the best of both worlds — the expressiveness of attention for short-range byte patterns and the efficiency of SSMs for long-range document structure.
Try It: Tokenization Tax Calculator
Select a language to see how the same text looks as raw bytes vs simulated BPE tokens. Notice how non-Latin scripts pay a heavier tokenization tax.
6. The Tradeoff Landscape
Having built the key architectures, let us compare them systematically:
| Dimension | Subword | ByT5 | MegaByte | MambaByte |
|---|---|---|---|---|
| Vocabulary | 32K–100K | 256 | 256 | 256 |
| Seq Length | 1× (baseline) | ~4× | ~4× (local), 0.5× (global) | ~4× |
| Attention Cost | O(T²) | O(N²/k²) | O((N/P)² + NP) | O(N) — no attention |
| Multilingual | Biased | Fair | Fair | Fair |
| Robustness | Fragile | Robust | Robust | Robust |
| Throughput | Fastest | ~1.5× slower | Competitive | Competitive |
The emerging consensus: byte-level modeling is not uniformly better or worse than subword modeling — it is a tradeoff. Byte-level wins on multilingual fairness, robustness, and simplicity. Subword wins on raw throughput for English-dominant tasks. The research frontier is closing the efficiency gap through architectural innovation, making the tokenizer increasingly optional.
# Robustness comparison: byte-level vs subword under perturbations
def char_edit_distance(a, b):
"""Simple Levenshtein distance."""
n, m = len(a), len(b)
dp = [[0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1): dp[i][0] = i
for j in range(m + 1): dp[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
cost = 0 if a[i-1] == b[j-1] else 1
dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
return dp[n][m]
original = "The transformer architecture"
perturbed = "Teh transfomer achitecture" # 3 typos
# Byte-level: edit distance in byte space
orig_bytes = list(original.encode('utf-8'))
pert_bytes = list(perturbed.encode('utf-8'))
byte_dist = char_edit_distance(orig_bytes, pert_bytes)
# Subword-level: simulated tokens change drastically
# "transformer" -> 1 token; "transfomer" -> 2 tokens (split differently)
orig_tokens = ["The", " transform", "er", " architecture"] # 4 tokens
pert_tokens = ["T", "eh", " trans", "f", "omer", " ach", "ite", "cture"] # 8 tokens
token_dist = char_edit_distance(orig_tokens, pert_tokens)
print(f"Original: '{original}'")
print(f"Perturbed: '{perturbed}'")
print(f"Byte edit distance: {byte_dist} (small, proportional to typos)")
print(f"Token edit distance: {token_dist} (large, tokenization shatters)")
print(f"Byte stability ratio: {byte_dist / len(orig_bytes):.2%}")
print(f"Token stability ratio: {token_dist / len(orig_tokens):.2%}")
Three typos change 14% of bytes but restructure 200% of the token sequence (edit distance exceeds the original token count). This fragility is why byte-level models outperform subword models on noisy text — OCR output, social media, historical documents, and any domain where perfect spelling cannot be assumed.
Try It: Architecture Scaling Explorer
Drag the sequence length slider to see how compute costs scale. Watch the quadratic curve explode while the linear one barely moves.
7. The Future of Text Representation
The trajectory is clear. Early vision models used hand-crafted features (SIFT, HOG), then learned features from raw pixels. Early NLP relied on hand-crafted tokenizers — BPE merge rules are the SIFT features of language modeling. Byte-level models learn to extract structure from raw data, just as ConvNets learned to extract features from raw pixels.
Tokenization was always a compression step — trading information for computational convenience. As models and hardware grow more capable, the need for this lossy compression diminishes. We are not there yet: subword models still dominate production systems because they are faster. But the gap is closing — MegaByte and MambaByte already match subword accuracy at comparable compute, and the next generation of efficient architectures may close it entirely.
The question is not whether byte-level models will replace tokenizers. It is when. Every byte carries information. Every merge rule throws some away. The architectures that learn from all 256 bytes will ultimately outperform those that start by discarding structure.
References & Further Reading
- Xue et al. — ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (2022) — the foundational byte-level encoder-decoder
- Yu et al. — MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers (2023) — local-global patched architecture for efficient byte modeling
- Wang et al. — MambaByte: Token-free Selective State Space Model (2024) — applying Mamba to raw byte sequences
- Sennrich et al. — Neural Machine Translation of Rare Words with Subword Units (2016) — the BPE method that byte models aim to replace
- Clark et al. — CANINE: Pre-training an Efficient Tokenization-Free Encoder (2022) — character-level BERT alternative with downsampling
- Tay et al. — Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (2022) — learned soft character grouping
- Sreedhar et al. — SpaceByte: Towards Deleting Tokenization (2024) — using whitespace as natural patch boundaries
- Gu & Dao — Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023) — the SSM architecture powering MambaByte
- Petrov et al. — Language Model Tokenizers Introduce Unfairness Between Languages (2023) — quantifying the multilingual tokenization tax