← Back to Blog

Byte-Level Models from Scratch

1. Why Tokenization Is a Bottleneck

Open a terminal. Type “hello” — that is 5 characters, 5 bytes, and your language model sees it as 1 or 2 subword tokens. Now type “สวัสดี” (Thai for “hello”) — that is 6 characters, 18 bytes, and a typical BPE tokenizer shatters it into 6 separate tokens. Same greeting, same semantic content, 6× the token cost. This is not a quirk — it is a structural flaw in how we preprocess text for neural networks.

Subword tokenizers like BPE and WordPiece are trained on a specific corpus — usually English-heavy web text. The merge rules they learn encode corpus-specific biases: common English words (“the”, “and”) get single tokens, while equivalent words in Thai, Amharic, or Georgian are fragmented into many byte-level pieces. This creates a tokenization tax: non-Latin scripts pay 3-5× more in sequence length for the same semantic content.

Three problems compound this:

The byte-level alternative is radically simple: feed raw bytes into the model. The vocabulary is always exactly 256 (the possible values of a single byte). No out-of-vocabulary tokens, no tokenizer training, no preprocessing pipeline.

# Simulated BPE tokenizer: English-biased merge rules
# Real BPE would use learned merges; we simulate the token count pattern
texts = {
    'English': 'Hello, how are you?',
    'Chinese': '\u4f60\u597d\uff0c\u4f60\u600e\u4e48\u6837\uff1f',
    'Arabic':  '\u0645\u0631\u062d\u0628\u0627\u060c \u0643\u064a\u0641 \u062d\u0627\u0644\u0643\u061f',
    'Thai':    '\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35 \u0e04\u0e38\u0e13\u0e40\u0e1b\u0e47\u0e19\u0e2d\u0e22\u0e48\u0e32\u0e07\u0e44\u0e23?',
    'Code':    'def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)',
}

# Approximate subword token counts (based on GPT-2 tokenizer patterns)
approx_tokens = {'English': 6, 'Chinese': 11, 'Arabic': 16, 'Thai': 24, 'Code': 19}

print(f"{'Language':<10} {'Bytes':>6} {'Tokens':>7} {'Tax':>6}")
print("-" * 32)
for lang, text in texts.items():
    byte_count = len(text.encode('utf-8'))
    token_count = approx_tokens[lang]
    tax = token_count / byte_count
    print(f"{lang:<10} {byte_count:>6} {token_count:>7} {tax:>6.2f}")

# Output:
# Language   Bytes  Tokens    Tax
# --------------------------------
# English       19       6   0.32
# Chinese       24      11   0.46
# Arabic        30      16   0.53
# Thai          62      24   0.39
# Code          54      19   0.35

English gets the best deal: 6 tokens for 19 bytes. Arabic needs 16 tokens for 30 bytes. With byte-level models, sequence length is simply the byte count — fair to every language, proportional to actual text size.

2. UTF-8 Encoding and the Byte Vocabulary

Before building byte-level models, we need to understand what bytes actually represent. ASCII (1963) encoded 128 characters in 7 bits — enough for English but nothing else. Unicode expanded this to over 149,000 characters across 161 scripts. But Unicode code points are integers of varying sizes, so we need an encoding to pack them into bytes.

UTF-8 (Ken Thompson & Rob Pike, 1992) is the elegant solution: a variable-length encoding using 1–4 bytes per character. Its design is brilliant:

The prefix pattern makes UTF-8 self-synchronizing: jump to any byte in a stream and you can immediately determine where the next character boundary is. For byte-level models, this structure creates learnable local patterns — when the model sees a 3-byte leader (0xE4), it knows two continuation bytes follow.

def utf8_encode_char(char):
    """Encode a single character to UTF-8 bytes from first principles."""
    cp = ord(char)  # Unicode code point
    if cp <= 0x7F:          # 1-byte: 0xxxxxxx
        return [cp]
    elif cp <= 0x7FF:       # 2-byte: 110xxxxx 10xxxxxx
        return [0xC0 | (cp >> 6),
                0x80 | (cp & 0x3F)]
    elif cp <= 0xFFFF:      # 3-byte: 1110xxxx 10xxxxxx 10xxxxxx
        return [0xE0 | (cp >> 12),
                0x80 | ((cp >> 6) & 0x3F),
                0x80 | (cp & 0x3F)]
    else:                    # 4-byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        return [0xF0 | (cp >> 18),
                0x80 | ((cp >> 12) & 0x3F),
                0x80 | ((cp >> 6) & 0x3F),
                0x80 | (cp & 0x3F)]

examples = [('A', 'ASCII'), ('\u00e9', 'Latin'), ('\u4e2d', 'CJK'), ('\U0001f600', 'Emoji')]
for char, script in examples:
    encoded = utf8_encode_char(char)
    hex_str = ' '.join(f'0x{b:02X}' for b in encoded)
    binary = ' '.join(f'{b:08b}' for b in encoded)
    print(f"'{char}' ({script:<6}) U+{ord(char):04X} -> [{hex_str}]")
    print(f"  binary: {binary}")
# 'A'  (ASCII ) U+0041 -> [0x41]
# 'é'  (Latin ) U+00E9 -> [0xC3 0xA9]
# '中' (CJK   ) U+4E2D -> [0xE4 0xB8 0xAD]
# '😀' (Emoji ) U+1F600 -> [0xF0 0x9F 0x98 0x80]

The byte vocabulary for a language model is just tokens 0–255 plus a few special tokens (PAD, BOS, EOS). Compared to GPT-2’s 50,257-token vocabulary, this is 200× smaller — the embedding matrix shrinks from 50,257 × dmodel to 259 × dmodel, a massive parameter saving that partially offsets the increased sequence length.

3. Naive Byte Transformers and ByT5

The simplest approach: take a standard transformer, replace the 50K-token vocabulary with 256 bytes, and train on raw byte sequences. This works — but the sequence length problem is brutal.

“The quick brown fox jumps over the lazy dog” is 9 subword tokens but 43 bytes — 4.8× longer. For transformer self-attention with O(n²) complexity, a 4.8× longer sequence means 23× more computation in the attention layers. A 2,048-token context window sees the equivalent of only 400–500 subword tokens — dramatically less context.

ByT5 (Xue et al., 2022) solves this with three design choices:

  1. Heavier encoder, lighter decoder: since most “understanding” happens in the encoder, make it 3× deeper than the decoder.
  2. Downsampling: after the initial encoder layers, mean-pool groups of bytes to reduce sequence length before the expensive deep layers.
  3. Shorter outputs: the decoder operates on typically shorter output sequences, so byte-level decoding is less costly.

The results are striking: ByT5 matches the accuracy of mT5 (its subword sibling) on standard benchmarks, significantly outperforms it on noisy text and multilingual tasks, and is more robust to character-level perturbations — all without any tokenizer. The cost: roughly 1.5× slower per example, but more accurate per FLOP.

import numpy as np

class ByteEncoder:
    """Simplified ByT5-style byte-level encoder with downsampling."""

    def __init__(self, d_model=128, n_heads=4, pool_factor=4):
        self.d_model = d_model
        self.n_heads = n_heads
        self.pool_factor = pool_factor
        # Byte embedding: only 256 entries!
        self.embed = np.random.randn(256, d_model) * 0.02

    def encode(self, text):
        """Encode text as raw bytes, embed, and downsample."""
        byte_ids = list(text.encode('utf-8'))
        seq_len = len(byte_ids)

        # Step 1: Byte embedding lookup
        x = np.array([self.embed[b] for b in byte_ids])  # (seq_len, d_model)
        print(f"Input:  '{text}'")
        print(f"Bytes:  {seq_len} positions, shape {x.shape}")

        # Step 2: Add sinusoidal positional encoding
        pos = np.arange(seq_len)[:, None]
        dim = np.arange(0, self.d_model, 2)[None, :]
        angles = pos / (10000 ** (dim / self.d_model))
        x[:, 0::2] += np.sin(angles)
        x[:, 1::2] += np.cos(angles)

        # Step 3: Mean-pool downsampling (ByT5's key trick)
        k = self.pool_factor
        trimmed = seq_len - (seq_len % k)  # trim to multiple of k
        x_trim = x[:trimmed].reshape(-1, k, self.d_model)
        x_pooled = x_trim.mean(axis=1)  # (seq_len/k, d_model)
        print(f"Pooled: {x_pooled.shape[0]} positions (/{k}), shape {x_pooled.shape}")

        return x_pooled

enc = ByteEncoder(d_model=128, pool_factor=4)
enc.encode("Hello, world!")
print()
enc.encode("\u4f60\u597d\u4e16\u754c")  # Chinese: "Hello world"
print()
# Embedding table comparison
print(f"Byte embedding:    256 x 128 = {256*128:,} params")
print(f"Subword embedding: 50257 x 128 = {50257*128:,} params")
print(f"Reduction: {50257*128 // (256*128)}x fewer embedding params")

After downsampling by 4×, a 52-byte input becomes 13 positions for the expensive deep layers — comparable to the 9 subword tokens a standard tokenizer would produce. The mean-pooling naturally merges multi-byte UTF-8 sequences into character-level representations, and the tiny embedding table (256 rows instead of 50K) saves millions of parameters.

4. MegaByte: Patching Bytes for Efficiency

ByT5 uses pooling — a lossy compression that discards sub-character detail. MegaByte (Yu et al., 2023) takes a different approach: patching. The idea, borrowed from Vision Transformers, is deceptively simple: split the byte sequence into fixed-size patches and process them hierarchically.

The architecture has two components:

The attention cost drops from O(N²) for a flat byte transformer to O((N/P)² + N·P) for MegaByte. For P=8 and N=8,192, that is roughly 60× less attention computation. The global model captures long-range dependencies across the document, while the local model handles the fine-grained byte patterns within each patch.

def megabyte_costs(N, P):
    """Compute attention operations for MegaByte vs flat transformer."""
    flat_attn = N * N           # O(N^2) for flat byte transformer
    global_attn = (N // P) ** 2 # O((N/P)^2) for global model
    local_attn = N * P          # O(N*P) total for all local models
    mega_attn = global_attn + local_attn
    return flat_attn, mega_attn

print(f"{'Seq Len':>8} {'Flat':>12} {'MegaByte':>12} {'Speedup':>8}")
print("-" * 44)
for N in [512, 1024, 2048, 4096, 8192]:
    flat, mega = megabyte_costs(N, P=8)
    print(f"{N:>8} {flat:>12,} {mega:>12,} {flat/mega:>7.1f}x")

# Output:
# Seq Len         Flat     MegaByte  Speedup
# --------------------------------------------
#     512      262,144        8,192    32.0x
#    1024    1,048,576       24,576    42.7x
#    2048    4,194,304       81,920    51.2x
#    4096   16,777,216      294,912    56.9x
#    8192   67,108,864    1,114,112    60.2x

# MegaByte's advantage grows with sequence length!
# Patch embedding: concatenate P bytes, project through linear layer
P = 8
d_model = 512
patch_embed_params = P * 256 + P * d_model  # byte embeds + projection
print(f"\nPatch embedding: {patch_embed_params:,} params")
print(f"Equivalent to a {P}-gram byte encoder")

On long-document benchmarks (PG-19, arXiv), MegaByte matches or outperforms subword transformers at the same compute budget while processing raw bytes. The local-global decomposition is not just efficient — it is architecturally natural. UTF-8 multi-byte sequences, word morphology, and common n-grams all happen at the local scale, while cross-sentence coherence and topic structure happen at the global scale.

5. MambaByte: Linear-Time Byte Processing

MegaByte still uses quadratic attention in its global model. MambaByte (Wang et al., 2024) asks: what if we skip attention entirely?

Mamba’s selective state-space model processes sequences in O(N) time through a scan operation — no attention matrix at all. For byte sequences, this is transformative: a 10K-byte sequence that would require 100 million attention operations in a transformer requires only 10K sequential scan steps in Mamba.

The key insight: since byte-level models face a 3-5× sequence length increase over subword models, the choice of sequence model matters enormously. With quadratic attention, 4× longer means 16× more compute. With linear-complexity SSMs, 4× longer means just 4× more compute.

def compute_costs(N_values, P=8):
    """Compare operation counts: attention, MegaByte, and SSM."""
    results = []
    for N in N_values:
        flat = N * N                          # O(N^2) self-attention
        mega = (N // P) ** 2 + N * P          # MegaByte
        ssm = N * 64                          # O(N*d_state) for SSM scan
        results.append((N, flat, mega, ssm))
    return results

N_values = [256, 512, 1024, 2048, 4096, 8192, 16384]
results = compute_costs(N_values)

print(f"{'N':>6} {'Attention':>12} {'MegaByte':>12} {'SSM':>12}")
print("-" * 46)
for N, flat, mega, ssm in results:
    print(f"{N:>6} {flat:>12,} {mega:>12,} {ssm:>12,}")

# At N=8192:
#   Attention: 67,108,864
#   MegaByte:   1,114,112  (60x less)
#   SSM:          524,288  (128x less)
#
# SSM wins at long sequences because it's truly linear.
# The crossover point where SSM beats MegaByte depends on
# the constant factors (SSM has larger per-step cost).

MambaByte achieves competitive perplexity with subword transformers on language modeling benchmarks — without any patching, pooling, or architectural tricks. It simply processes raw bytes with a selective scan. The simplicity is the point: you do not need to engineer around the quadratic bottleneck when you can eliminate it.

The emerging frontier is hybrid architectures: use local attention within byte patches (where full expressiveness matters) and global SSMs across patches (where linear scaling matters). This combines the best of both worlds — the expressiveness of attention for short-range byte patterns and the efficiency of SSMs for long-range document structure.

Try It: Tokenization Tax Calculator

Select a language to see how the same text looks as raw bytes vs simulated BPE tokens. Notice how non-Latin scripts pay a heavier tokenization tax.

English: 19 bytes, ~6 tokens, tax ratio 0.32

6. The Tradeoff Landscape

Having built the key architectures, let us compare them systematically:

Dimension Subword ByT5 MegaByte MambaByte
Vocabulary 32K–100K 256 256 256
Seq Length 1× (baseline) ~4× ~4× (local), 0.5× (global) ~4×
Attention Cost O(T²) O(N²/k²) O((N/P)² + NP) O(N) — no attention
Multilingual Biased Fair Fair Fair
Robustness Fragile Robust Robust Robust
Throughput Fastest ~1.5× slower Competitive Competitive

The emerging consensus: byte-level modeling is not uniformly better or worse than subword modeling — it is a tradeoff. Byte-level wins on multilingual fairness, robustness, and simplicity. Subword wins on raw throughput for English-dominant tasks. The research frontier is closing the efficiency gap through architectural innovation, making the tokenizer increasingly optional.

# Robustness comparison: byte-level vs subword under perturbations
def char_edit_distance(a, b):
    """Simple Levenshtein distance."""
    n, m = len(a), len(b)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    for i in range(n + 1): dp[i][0] = i
    for j in range(m + 1): dp[0][j] = j
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            cost = 0 if a[i-1] == b[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
    return dp[n][m]

original = "The transformer architecture"
perturbed = "Teh transfomer achitecture"  # 3 typos

# Byte-level: edit distance in byte space
orig_bytes = list(original.encode('utf-8'))
pert_bytes = list(perturbed.encode('utf-8'))
byte_dist = char_edit_distance(orig_bytes, pert_bytes)

# Subword-level: simulated tokens change drastically
# "transformer" -> 1 token; "transfomer" -> 2 tokens (split differently)
orig_tokens = ["The", " transform", "er", " architecture"]     # 4 tokens
pert_tokens = ["T", "eh", " trans", "f", "omer", " ach", "ite", "cture"]  # 8 tokens
token_dist = char_edit_distance(orig_tokens, pert_tokens)

print(f"Original:  '{original}'")
print(f"Perturbed: '{perturbed}'")
print(f"Byte edit distance:    {byte_dist} (small, proportional to typos)")
print(f"Token edit distance:   {token_dist} (large, tokenization shatters)")
print(f"Byte stability ratio:  {byte_dist / len(orig_bytes):.2%}")
print(f"Token stability ratio: {token_dist / len(orig_tokens):.2%}")

Three typos change 14% of bytes but restructure 200% of the token sequence (edit distance exceeds the original token count). This fragility is why byte-level models outperform subword models on noisy text — OCR output, social media, historical documents, and any domain where perfect spelling cannot be assumed.

Try It: Architecture Scaling Explorer

Drag the sequence length slider to see how compute costs scale. Watch the quadratic curve explode while the linear one barely moves.

N=1024: Attention 1,048,576 | MegaByte 24,576 (42.7x) | SSM 65,536 (16.0x)

7. The Future of Text Representation

The trajectory is clear. Early vision models used hand-crafted features (SIFT, HOG), then learned features from raw pixels. Early NLP relied on hand-crafted tokenizers — BPE merge rules are the SIFT features of language modeling. Byte-level models learn to extract structure from raw data, just as ConvNets learned to extract features from raw pixels.

Tokenization was always a compression step — trading information for computational convenience. As models and hardware grow more capable, the need for this lossy compression diminishes. We are not there yet: subword models still dominate production systems because they are faster. But the gap is closing — MegaByte and MambaByte already match subword accuracy at comparable compute, and the next generation of efficient architectures may close it entirely.

The question is not whether byte-level models will replace tokenizers. It is when. Every byte carries information. Every merge rule throws some away. The architectures that learn from all 256 bytes will ultimately outperform those that start by discarding structure.

References & Further Reading