Text-to-Speech from Scratch
1. The TTS Pipeline
You type "Hello, how are you today?" into a text box and hit play. Within milliseconds, a natural-sounding human voice speaks your words — with proper intonation, natural pauses, and even emotion. How does a string of characters become air pressure waves that sound like a person?
If you've read our speech recognition post, you already know the forward direction: audio → features → text. Text-to-speech is the inverse: text → linguistic features → mel spectrogram → waveform. But "inverse" doesn't mean "reverse the same steps." Speech recognition is many-to-one (many different pronunciations of "hello" all map to the same word), while TTS is one-to-many (the word "hello" could be spoken in countless ways). This fundamental asymmetry makes TTS a harder generation problem.
The modern TTS pipeline has three stages. First, text analysis converts raw characters into a linguistic representation — expanding abbreviations, resolving ambiguities, and converting letters to phonemes. Second, an acoustic model transforms this linguistic sequence into a mel spectrogram (a time-frequency representation of sound, as we explored in our audio features post). Third, a vocoder converts the mel spectrogram into an actual audio waveform you can hear.
But before any of that, we need to tackle a surprisingly tricky problem: what do the letters in a word actually sound like?
Text Normalization and Grapheme-to-Phoneme Conversion
Raw text is messy. "Dr. Smith lives at 123 Main St." contains an abbreviation ("Dr."), a number ("123"), and another abbreviation ("St."). A TTS system needs to expand these: "Doctor Smith lives at one hundred twenty-three Main Street." Dates, currencies, acronyms — all need special handling before we can think about pronunciation.
Then comes grapheme-to-phoneme (G2P) conversion. English spelling is notoriously irregular: "through," "though," "thought," "tough," and "trough" share four letters but sound completely different. A G2P module maps written text to phonemes — the actual sound units of speech. The word "through" becomes [θ, r, uː], not [t, h, r, o, u, g, h].
import re
# Simple text normalizer
def normalize_text(text):
"""Expand common abbreviations and numbers to spoken form."""
replacements = {
r'\bDr\.': 'Doctor', r'\bMr\.': 'Mister',
r'\bSt\.': 'Street', r'\bvs\.': 'versus',
}
for pattern, replacement in replacements.items():
text = re.sub(pattern, replacement, text)
return text
# Lookup-based G2P (real systems use neural seq2seq models)
PHONEME_MAP = {
'hello': ['HH', 'AH', 'L', 'OW'],
'world': ['W', 'ER', 'L', 'D'],
'through': ['TH', 'R', 'UW'],
'though': ['DH', 'OW'],
'thought': ['TH', 'AO', 'T'],
'tough': ['T', 'AH', 'F'],
'speech': ['S', 'P', 'IY', 'CH'],
}
def grapheme_to_phoneme(word):
"""Convert word to phonemes via dictionary lookup."""
word_lower = word.lower().strip('.,!?')
if word_lower in PHONEME_MAP:
return PHONEME_MAP[word_lower]
# Fallback: naive letter-to-sound (real systems use seq2seq)
return [ch.upper() for ch in word_lower if ch.isalpha()]
text = "Dr. Smith said hello"
normalized = normalize_text(text)
for word in normalized.split():
phonemes = grapheme_to_phoneme(word)
print(f"{word:<10} -> {phonemes}")
# Doctor -> ['D', 'O', 'C', 'T', 'O', 'R'] (fallback)
# Smith -> ['S', 'M', 'I', 'T', 'H'] (fallback)
# said -> ['S', 'A', 'I', 'D'] (fallback)
# hello -> ['HH', 'AH', 'L', 'OW'] (dictionary)
Production TTS systems like Google's use neural G2P models trained on pronunciation dictionaries (CMU Pronouncing Dictionary has ~134,000 words). For out-of-vocabulary words, they fall back to learned letter-to-sound rules. Our lookup table captures the core idea: spelling doesn't equal sound, and we need an explicit mapping layer.
2. Duration Prediction — Controlling Speech Timing
Here's a fundamental mismatch: the word "hello" is 5 characters (or 4 phonemes), but when spoken at normal speed, it occupies roughly 50 mel spectrogram frames (~500 milliseconds). Each phoneme lasts a different amount of time — vowels are typically longer than consonants, stressed syllables longer than unstressed ones, and the final syllable of a sentence often gets stretched (a phenomenon called sentence-final lengthening).
How does a TTS model bridge this length gap? It needs a duration predictor — a small network that estimates how many mel frames each phoneme should span. FastSpeech (Ren et al., 2019) introduced the length regulator: given phoneme embeddings and their predicted durations, simply repeat each embedding the predicted number of times. If phoneme "HH" gets duration 8 and "AH" gets duration 12, we repeat the "HH" embedding 8 times and the "AH" embedding 12 times, producing a sequence that matches the target mel spectrogram length.
import numpy as np
VOWELS = {'AH', 'OW', 'IY', 'UW', 'AO', 'ER', 'EY', 'AY', 'AE'}
def predict_durations(phonemes, seed=42):
"""Predict duration (in mel frames) for each phoneme.
Real systems use a small conv network; we use heuristics."""
rng = np.random.RandomState(seed)
durations = []
for p in phonemes:
if p in VOWELS:
durations.append(rng.randint(10, 18)) # Vowels: longer
else:
durations.append(rng.randint(4, 10)) # Consonants: shorter
return durations
def length_regulator(embeddings, durations):
"""Expand phoneme embeddings by repeating each one duration times.
This is FastSpeech's key insight for non-autoregressive TTS."""
expanded = []
for emb, dur in zip(embeddings, durations):
expanded.extend([emb] * dur)
return np.array(expanded)
# "hello" = 4 phonemes -> expanded to ~50 mel frames
phonemes = ['HH', 'AH', 'L', 'OW']
embed_dim = 64
rng = np.random.RandomState(7)
embeddings = [rng.randn(embed_dim) for _ in phonemes]
durations = predict_durations(phonemes)
expanded = length_regulator(embeddings, durations)
print(f"Phonemes: {phonemes}")
print(f"Durations: {durations} (total: {sum(durations)} frames)")
print(f"Expanded: {expanded.shape}")
# Phonemes: ['HH', 'AH', 'L', 'OW']
# Durations: [7, 14, 6, 17] (total: 44 frames)
# Expanded: (44, 64)
This is elegantly simple. The duration predictor turns a discrete phoneme sequence into a continuous-time representation by controlling how long each phoneme "occupies" in the output. Change the durations, and you change the rhythm of speech — speed it up, slow it down, add dramatic pauses, or create the natural ebb and flow of conversation.
Try It: Phoneme Duration Visualizer
3. Autoregressive Mel Generation — Tacotron
Now we arrive at the heart of neural TTS. Given a sequence of phoneme embeddings (expanded to the right length by the duration predictor), how do we generate the actual mel spectrogram — those 80 frequency bins at each time step that capture the spectral envelope of speech?
Tacotron 2 (Shen et al., 2018) solved this with an encoder-decoder architecture with attention. If you've read our attention post, this will feel familiar, but with a crucial twist.
The encoder processes the phoneme sequence through convolutional layers and a bidirectional LSTM, producing hidden states that encode each sound in context. The decoder is an autoregressive LSTM that generates mel frames one at a time. At each step it:
- Attends over the encoder outputs to determine "where am I in the text?"
- Combines the attention context with its own hidden state
- Predicts the next mel frame (80 values) plus a stop token ("am I done?")
- Feeds the predicted mel frame back as input to the next step
The attention mechanism uses location-sensitive attention — it considers not just content similarity but the previous attention weights, encouraging smooth left-to-right progression through the text. The resulting attention alignment matrix should look roughly diagonal: the model reads through the input in order.
During training, the decoder uses teacher forcing: it receives the ground-truth previous mel frame as input. At inference, it feeds back its own predictions — a mismatch called exposure bias that can cause error accumulation and attention failures.
import numpy as np
def location_sensitive_attention(query, keys, prev_weights, W_loc):
"""Attention that uses previous alignment for monotonic progression."""
scores = keys @ query # Content-based: (enc_len,)
loc_bias = W_loc @ prev_weights # Location-based bias
scores = scores + loc_bias
weights = np.exp(scores - scores.max()) # Softmax
weights = weights / weights.sum()
return weights
def tacotron_decode(encoder_out, max_steps=60, mel_dim=80):
"""Simplified Tacotron 2 autoregressive decoder."""
enc_len, enc_dim = encoder_out.shape
rng = np.random.RandomState(42)
hidden = np.zeros(enc_dim)
prev_weights = np.zeros(enc_len)
prev_weights[0] = 1.0 # Start at first phoneme
W_loc = rng.randn(enc_len, enc_len) * 0.1
W_out = rng.randn(mel_dim, enc_dim * 2) * 0.05
mel_frames, alignments = [], []
for step in range(max_steps):
# Where are we in the text?
weights = location_sensitive_attention(
hidden, encoder_out, prev_weights, W_loc
)
context = weights @ encoder_out # Weighted sum of encoder
# Predict mel frame from context + hidden state
combined = np.concatenate([context, hidden])
mel_frame = np.tanh(W_out @ combined)
# Stop check: end when attention reaches final phoneme
if np.argmax(weights) == enc_len - 1 and step > enc_len * 3:
break
mel_frames.append(mel_frame)
alignments.append(weights)
# Autoregressive: update state with current context
hidden = 0.9 * hidden + 0.1 * context
prev_weights = weights
mel_spec = np.stack(mel_frames)
alignment = np.stack(alignments)
print(f"Generated {mel_spec.shape[0]} mel frames from {enc_len} phonemes")
print(f"Mel shape: {mel_spec.shape}, Alignment shape: {alignment.shape}")
return mel_spec, alignment
# 8 phonemes for "hello world" -> ~60 mel frames
encoder_out = np.random.RandomState(7).randn(8, 128) * 0.3
mel, align = tacotron_decode(encoder_out)
# Generated 60 mel frames from 8 phonemes
# Mel shape: (60, 80), Alignment shape: (60, 8)
The attention alignment is a powerful diagnostic. A clean diagonal means the model is reading in order. Blurry or jumbled alignments signal problems — repeated words, skipped phrases, or gibberish. Tacotron 2's location-sensitive attention was specifically designed to prevent these failures by biasing toward monotonic left-to-right progression.
4. Spectrogram Inversion — Griffin-Lim
We have a mel spectrogram — 80 frequency bins at each time step, capturing what frequencies are present and how loud they are. But to produce actual audio, we need a waveform: amplitude values at 22,050+ samples per second. How do we get from a compressed spectral representation back to a time-domain signal?
The Short-Time Fourier Transform decomposes audio into magnitude and phase. Our mel spectrogram only captures magnitude information (and compressed at that). The phase — encoding the precise timing of each frequency component — is missing. This is the phase reconstruction problem, and it's why early TTS sounded robotic.
The Griffin-Lim algorithm (1984) is an elegant iterative solution. Start with random phase, then repeatedly enforce a consistency constraint:
- Combine the target magnitude with the current phase estimate
- Inverse STFT to get a time-domain signal
- Forward STFT to get a consistent magnitude + phase
- Replace the magnitude with our target (keeping the new phase)
- Repeat until convergence (~30–60 iterations)
import numpy as np
def griffin_lim(magnitude, n_iter=32, hop=256, win_len=1024):
"""Reconstruct audio from magnitude spectrogram via phase estimation."""
n_fft = (magnitude.shape[0] - 1) * 2
sig_len = magnitude.shape[1] * hop
window = np.hanning(win_len)
rng = np.random.RandomState(0)
phase = np.exp(2j * np.pi * rng.rand(*magnitude.shape))
for i in range(n_iter):
# Combine target magnitude with estimated phase
S = magnitude * phase
# Inverse STFT -> time domain
signal = np.zeros(sig_len)
for t in range(S.shape[1]):
frame = np.real(np.fft.irfft(S[:, t], n=n_fft))
start = t * hop
end = min(start + win_len, sig_len)
signal[start:end] += frame[:end - start] * window[:end - start]
# Forward STFT -> extract new (consistent) phase
for t in range(S.shape[1]):
start = t * hop
frame = signal[start:start + win_len]
if len(frame) < win_len:
frame = np.pad(frame, (0, win_len - len(frame)))
spec = np.fft.rfft(frame * window)
phase[:, t] = np.exp(1j * np.angle(spec[:magnitude.shape[0]]))
# Final reconstruction
S = magnitude * phase
signal = np.zeros(sig_len)
for t in range(S.shape[1]):
frame = np.real(np.fft.irfft(S[:, t], n=n_fft))
start = t * hop
end = min(start + win_len, sig_len)
signal[start:end] += frame[:end - start] * window[:end - start]
return signal
mag = np.abs(np.random.RandomState(42).randn(257, 50)) * 0.5
audio = griffin_lim(mag, n_iter=32)
print(f"Input: {mag.shape[0]} freq bins x {mag.shape[1]} frames")
print(f"Output: {len(audio)} samples ({len(audio)/22050:.2f}s at 22050 Hz)")
# Input: 257 freq bins x 50 frames
# Output: 12800 samples (0.58s at 22050 Hz)
Each iteration brings the phase estimate closer to a physically plausible solution. After 30–60 iterations, the result is intelligible — though it retains a characteristic metallic quality. This motivated the search for neural vocoders that learn to generate realistic waveforms directly.
Try It: Griffin-Lim Phase Estimation
5. Neural Vocoders — WaveNet to HiFi-GAN
In 2016, DeepMind's WaveNet shattered expectations for synthetic speech quality. The key insight: instead of estimating missing phase information, learn to generate raw audio samples directly, conditioned on the mel spectrogram.
WaveNet generates audio at the sample level — predicting each of the 24,000 samples needed for one second of audio. To make this tractable, it uses dilated causal convolutions. A standard convolution with kernel size 2 only looks at 2 previous samples. But by exponentially increasing the dilation — 1, 2, 4, 8, ..., 512 — a stack of 10 layers can "see" 1,024 previous samples (~43ms at 24 kHz) without needing 1,024 layers.
Each sample is quantized using mu-law encoding, compressing the 16-bit range (65,536 values) down to 256 categories. The model outputs a softmax over these 256 bins and samples from it — subtle variation that sounds natural rather than deterministic.
import numpy as np
def build_dilated_stack(n_layers=10, kernel_size=2):
"""Stack of dilated causal convolutions with exponential dilation."""
layers = []
receptive_field = 1
for i in range(n_layers):
dilation = 2 ** i
reach = dilation * (kernel_size - 1)
receptive_field += reach
layers.append({
'dilation': dilation,
'reach': reach,
'cumulative_rf': receptive_field,
})
return layers, receptive_field
def mu_law_encode(x, mu=255):
"""Compress audio to mu-law quantized categories."""
x = np.clip(x, -1.0, 1.0)
return np.sign(x) * np.log1p(mu * np.abs(x)) / np.log1p(mu)
def mu_law_decode(y, mu=255):
"""Expand mu-law back to linear amplitude."""
return np.sign(y) * (1.0 / mu) * ((1 + mu) ** np.abs(y) - 1)
# Compare receptive fields: dilated vs standard convolutions
dilated_layers, rf_dilated = build_dilated_stack(10)
rf_no_dilation = 1 + 10 * 1 # 10 layers, kernel 2, dilation always 1
print("Dilated causal convolution stack:")
for layer in dilated_layers:
print(f" dilation={layer['dilation']:<4} reach={layer['reach']:<5} "
f"RF={layer['cumulative_rf']}")
print(f"\n10 dilated layers: RF = {rf_dilated} samples")
print(f"10 standard layers: RF = {rf_no_dilation} samples")
print(f"Ratio: {rf_dilated // rf_no_dilation}x wider, same depth")
# Mu-law in action: compress dynamic range for quantization
audio = np.array([0.01, 0.1, 0.5, 0.9])
encoded = mu_law_encode(audio)
decoded = mu_law_decode(encoded)
print(f"\nMu-law: {audio} -> {np.round(encoded, 3)} -> {np.round(decoded, 3)}")
# 10 dilated layers: RF = 1024 samples
# 10 standard layers: RF = 11 samples
# Ratio: 93x wider, same depth
# Mu-law: [0.01 0.1 0.5 0.9 ] -> [0.228 0.591 0.876 0.981] -> [0.01 0.1 0.5 0.9 ]
WaveNet produced stunning quality but at a crushing cost: one second of audio required 24,000 sequential forward passes. A 10-second clip took minutes on a GPU.
The race to speed up neural vocoders led to HiFi-GAN (Kong et al., 2020), which flipped the paradigm. Instead of sample-by-sample autoregression, HiFi-GAN uses a GAN-based architecture (see our GAN post) that generates the entire waveform in one forward pass. The generator uses transposed convolutions to upsample the mel spectrogram from ~86 frames/second to 22,050 samples/second. Two discriminators — a multi-period discriminator (MPD) that checks periodic patterns and a multi-scale discriminator (MSD) that evaluates at different resolutions — ensure the output sounds realistic. The result: near-WaveNet quality at real-time speeds on a CPU.
6. Modern TTS — VITS and Beyond
Every system we've discussed has a seam: one model generates the mel spectrogram, another converts it to audio. Each is trained separately, and errors compound at the boundary. What if we could train the entire pipeline end-to-end?
VITS (Kim et al., 2021) does exactly this by combining three powerful ideas. A variational autoencoder (see our autoencoders post) learns a latent speech representation. A normalizing flow (see our normalizing flows post) transforms a simple Gaussian into complex speech distributions. And an adversarial discriminator ensures the generated audio is indistinguishable from real recordings. The combined loss — reconstruction + KL divergence + adversarial — trains everything jointly.
But the real paradigm shift came when researchers realized TTS is fundamentally a sequence-to-sequence generation problem — just like language modeling. Bark (Suno AI, 2023) treats audio as discrete tokens via a neural codec (Encodec), then generates those tokens with a GPT-style transformer. The same architecture that generates text can generate speech, music, and environmental sounds. VALL-E (Wang et al., 2023) takes this further: given just 3 seconds of someone's voice, it can synthesize arbitrary text in that voice — zero-shot voice cloning via in-context learning.
import numpy as np
def vits_pipeline(phoneme_ids, embed_dim=64, latent_dim=16, mel_dim=80):
"""Simplified VITS: phonemes -> encoder -> VAE latent -> mel frames."""
rng = np.random.RandomState(42)
n_phones = len(phoneme_ids)
# Text encoder: phoneme IDs -> hidden states
embed_table = rng.randn(40, embed_dim) * 0.3
enc_out = np.array([embed_table[pid] for pid in phoneme_ids])
context = enc_out.mean(axis=0) # Simplified: mean-pool
# VAE: predict mean and log-variance of latent distribution
W_mu = rng.randn(latent_dim, embed_dim) * 0.1
W_logvar = rng.randn(latent_dim, embed_dim) * 0.1
mu = W_mu @ context
logvar = W_logvar @ context
std = np.exp(0.5 * logvar)
# Sample different z values = different "voices" for same text
W_dec = rng.randn(mel_dim, latent_dim) * 0.2
for seed in [0, 1, 2]:
eps = np.random.RandomState(seed).randn(latent_dim)
z = mu + std * eps # Reparameterization trick
mel = np.tanh(W_dec @ z)
mel_full = np.tile(mel, (n_phones * 8, 1))
print(f"Voice {seed}: z_norm={np.linalg.norm(z):.3f}, "
f"mel_range=[{mel_full.min():.3f}, {mel_full.max():.3f}], "
f"shape={mel_full.shape}")
kl = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar))
print(f"\nKL divergence from prior: {kl:.3f}")
print("Same phonemes + different z = different speech characteristics")
# "hello" as phoneme IDs: HH=7, AH=2, L=11, OW=14
vits_pipeline([7, 2, 11, 14])
# Voice 0: z_norm=4.079, mel_range=[-0.959, 0.943], shape=(32, 80)
# Voice 1: z_norm=4.590, mel_range=[-0.968, 0.991], shape=(32, 80)
# Voice 2: z_norm=4.478, mel_range=[-0.992, 0.925], shape=(32, 80)
#
# KL divergence from prior: 0.191
# Same phonemes + different z = different speech characteristics
The convergence is striking: modern TTS looks increasingly like language modeling, operating on audio tokens instead of text tokens. The same attention mechanisms, autoregressive generation, and scaling laws apply. The boundary between "understanding language" and "speaking language" is dissolving into a unified sequence prediction problem.
7. Putting It All Together
We've traced the complete journey from text to speech: normalize the raw text, convert letters to phonemes, predict how long each phoneme should last, generate a mel spectrogram frame by frame, and synthesize an audio waveform. Each stage adds information that wasn't in the original text — durations add timing, the acoustic model adds spectral detail, and the vocoder adds fine-grained waveform structure.
The evolution from Griffin-Lim's metallic reconstructions to VITS's natural output mirrors the arc of deep learning: classical signal processing gave way to neural networks, separate components merged into end-to-end systems, and the architecture converged on the transformer. Today's best TTS systems are, at their core, language models that output sound instead of text — closing the loop between understanding and generation that has fascinated AI researchers for decades.
References & Further Reading
- Shen et al. — Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (2018) — the Tacotron 2 paper establishing the modern encoder-decoder TTS paradigm
- van den Oord et al. — WaveNet: A Generative Model for Raw Audio (2016) — DeepMind's breakthrough neural vocoder using dilated causal convolutions
- Ren et al. — FastSpeech: Fast, Robust and Controllable Text to Speech (2019) — introduced the duration predictor and length regulator for non-autoregressive TTS
- Kong et al. — HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2020) — GAN-based vocoder achieving real-time CPU synthesis
- Kim et al. — Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (2021) — VITS: the first fully end-to-end TTS model
- Wang et al. — Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023) — VALL-E: 3-second voice cloning via audio language modeling
- Griffin & Lim — Signal Estimation from Modified Short-Time Fourier Transform (1984) — the classic iterative phase reconstruction algorithm
- Suno AI — Bark (2023) — GPT-style text-to-audio generation for speech, music, and sound effects