Speech Recognition from Scratch

February 28, 2026 · Elementary · 11 min read

1. The Speech Recognition Problem

You say "Hey Siri, set a timer for five minutes" — 0.8 seconds of air pressure waves hit a microphone, get digitized into 12,800 samples, and somehow become the text "set a timer for five minutes." How does a machine bridge the gap between vibrating air and written language?

The core challenge is an input-output mismatch. A speech waveform sampled at 16 kHz produces 16,000 numbers per second. Convert that to a mel spectrogram at 100 frames per second and a 3-second utterance still yields ~300 frames — but the transcription "set a timer" has only 12 characters. The input sequence is an order of magnitude longer than the output.

Worse, we don't know which frames correspond to which characters. The word "timer" might occupy frames 120-220, but the exact boundaries are fuzzy. Vowels bleed into consonants. Speakers pause unpredictably. Unlike image classification where one input maps to one label, speech recognition is a sequence-to-sequence problem with unknown alignment.

Traditional systems solved this with a cascade of hand-engineered components: a Gaussian Mixture Model to score acoustic features, a Hidden Markov Model for temporal alignment, a pronunciation dictionary to map phonemes to words, and a language model to pick the most likely sentence. Each piece required specialized linguistic expertise. Modern end-to-end neural systems replace the entire pipeline with a single model trained on (audio, transcript) pairs. The question is: how do you train a neural network when the input has 300 time steps and the output has 12, and you don't know the alignment between them?

If you've read our audio features post, you already know how to build the front end — spectrograms, mel filterbanks, MFCCs. Now we build the brain.

import numpy as np

# Simulate mel spectrogram extraction from a waveform
np.random.seed(42)
sr = 16000              # 16 kHz sample rate
duration = 3.0          # 3 seconds of speech
n_samples = int(sr * duration)  # 48,000 raw samples

# Generate synthetic speech-like signal (sum of formant frequencies)
t = np.linspace(0, duration, n_samples)
waveform = 0.5 * np.sin(2 * np.pi * 250 * t)   # F1 ~250 Hz
waveform += 0.3 * np.sin(2 * np.pi * 2500 * t)  # F2 ~2500 Hz
waveform += 0.1 * np.random.randn(n_samples)     # noise

# STFT -> mel spectrogram (simplified)
hop = sr // 100         # 10ms hop -> 100 frames/sec
n_fft = 512
n_frames = (n_samples - n_fft) // hop + 1
n_mels = 80            # 80 mel frequency bins

# Simulated mel spectrogram (in practice: STFT + mel filterbank)
mel_spec = np.random.randn(n_frames, n_mels) * 0.5 + 2.0

transcript = "set a timer"
print(f"Waveform:       {n_samples:,} samples")
print(f"Mel spectrogram: {mel_spec.shape[0]} frames x {mel_spec.shape[1]} mels")
print(f"Transcript:      '{transcript}' ({len(transcript)} chars)")
print(f"Mismatch:        {mel_spec.shape[0]} frames -> {len(transcript)} characters")
# Waveform:       48,000 samples
# Mel spectrogram: 297 frames x 80 mels
# Transcript:      'set a timer' (11 chars)
# Mismatch:        297 frames -> 11 characters

297 frames in, 11 characters out. Something needs to bridge that gap. Enter CTC.

2. CTC — Connectionist Temporal Classification

In 2006, Alex Graves and colleagues introduced Connectionist Temporal Classification (CTC) — an elegant algorithm that eliminated the need for frame-level alignment labels. Before CTC, training a speech model required someone to annotate which frames corresponded to which phonemes, a painstaking and expensive process. CTC lets you train with just (audio, transcript) pairs.

The key insight is a special blank token, written as ε (epsilon). At each time frame, the model can either emit a character or emit blank, meaning "no output yet." A sequence of frame-level emissions is called a CTC path. The word "cat" can be produced by many different paths:

εcεaaaεtεε — blank, then c, blank, three a's, blank, t, two blanks
ccεεaεttεε — two c's, blanks, one a, blank, two t's, blanks
cεaεtεεεεε — one of each, then trailing blanks

All of these collapse to "cat" under the CTC collapse rule: (1) merge consecutive duplicate characters, then (2) remove all blanks. So ccccεaaaεtt → cεaεt → cat. But note: cεcεaεt collapses to ccat, because the blank separates the two c's, preventing them from merging. This is how CTC handles repeated characters like the double-l in "hello."

The CTC loss is defined as the negative log probability of the target transcript, marginalizing over all paths that collapse to it: P(Y|X) = Σ over all valid paths π that collapse to Y of Π_t P(π_t | X). With T frames and a label of length L, there could be exponentially many valid paths. Summing over all of them seems impossible.

The solution is the forward algorithm — dynamic programming over a trellis. First, build an expanded label by inserting blanks between every character and at both ends. For "cat", this gives [ε, c, ε, a, ε, t, ε] — length 2L+1 = 7. Define α(t, s) as the total probability of all partial paths that (a) end at position s in the expanded label at time t, and (b) are consistent with the target up to that point.

The transitions follow three rules: from position s, you can (1) stay at s (repeat the same symbol), (2) move to s+1 (next symbol in the expanded label), or (3) skip from s to s+2 only if s+2 is not blank and the character at s+2 differs from the character at s. This skip allows going directly from one character to the next when there's no ambiguity. The final CTC probability is α(T, S) + α(T, S-1), summing the two valid ending positions (last character or trailing blank).

import numpy as np

def ctc_forward(log_probs, target, blank=0):
    """CTC forward algorithm.
    log_probs: (T, C) log probabilities per frame per character
    target: list of character indices (without blanks)
    Returns: negative log-likelihood (CTC loss)
    """
    T = log_probs.shape[0]
    # Build expanded label: insert blanks between chars and at ends
    L = len(target)
    S = 2 * L + 1
    expanded = [blank] * S
    for i in range(L):
        expanded[2 * i + 1] = target[i]

    # Alpha matrix: log domain for numerical stability
    LOG_ZERO = -1e30
    alpha = np.full((T, S), LOG_ZERO)

    # Initialize: can start at blank or first char
    alpha[0, 0] = log_probs[0, expanded[0]]
    if S > 1:
        alpha[0, 1] = log_probs[0, expanded[1]]

    # Fill the trellis
    for t in range(1, T):
        for s in range(S):
            # Stay at same position
            a = alpha[t-1, s]
            # Move from previous position
            if s > 0:
                a = np.logaddexp(a, alpha[t-1, s-1])
            # Skip blank (char_i -> char_j, i != j)
            if s > 1 and expanded[s] != blank and expanded[s] != expanded[s-2]:
                a = np.logaddexp(a, alpha[t-1, s-2])
            alpha[t, s] = a + log_probs[t, expanded[s]]

    # Total probability: end at last blank or last char
    loss = np.logaddexp(alpha[T-1, S-1], alpha[T-1, S-2])
    return -loss, alpha

# Example: target "cat" with T=10 frames, C=5 (blank=0, c=1, a=2, t=3, x=4)
np.random.seed(7)
T, C = 10, 5
logits = np.random.randn(T, C) * 0.5
# Bias frames toward correct alignment
for t in range(0, 3): logits[t, 1] += 2.0   # c
for t in range(3, 7): logits[t, 2] += 2.0   # a
for t in range(7, 10): logits[t, 3] += 2.0  # t

log_probs = logits - np.logaddexp.reduce(logits, axis=1, keepdims=True)
target = [1, 2, 3]  # c, a, t

loss, alpha = ctc_forward(log_probs, target)
print(f"CTC loss (neg log-likelihood): {loss:.4f}")
print(f"Alpha matrix shape: {alpha.shape}  (T={T} x S={2*len(target)+1})")
# CTC loss (neg log-likelihood): 3.9043
# Alpha matrix shape: (10, 7)  (T=10 x S=7)

That 35-line function is the mathematical core of CTC. The alpha matrix encodes every valid alignment simultaneously — a beautiful application of dynamic programming that made end-to-end speech recognition practical.

Try It: CTC Alignment Explorer

Word: Frames (T): 12

3. Decoding — From Probabilities to Text

A trained CTC model outputs a matrix of per-frame character probabilities. The training algorithm (CTC forward + backward) is settled. But at inference time, we need to turn those probabilities into an actual text string. This is the decoding problem.

Greedy Decoding

The simplest approach: at each time frame, pick the most probable character (argmax). Then apply the CTC collapse rule — merge duplicates and remove blanks. Greedy decoding runs in O(T) time and often works surprisingly well. But it's provably suboptimal: it considers only a single path through the probability lattice, ignoring the fact that many different paths can collapse to the same output string.

Beam Search

Beam search maintains the top B partial hypotheses at each time step. At frame t, each hypothesis is extended by every possible character, scored, and the top B survivors are kept. The critical detail is prefix merging: if two different CTC paths collapse to the same text prefix, their probabilities should be added, not kept separate. For example, the paths εcεa and ccεa both collapse to "ca" — beam search merges them into a single hypothesis with combined probability.

In practice, beam search with B=5-10 significantly outperforms greedy decoding. You can also incorporate a language model via shallow fusion: the combined score becomes α · log P_acoustic + β · log P_LM + γ · |Y|, where the word count bonus γ · |Y| prevents the model from favoring short outputs.

import numpy as np

def greedy_decode(log_probs, alphabet, blank=0):
    """Greedy CTC decoding: argmax at each frame, then collapse."""
    path = np.argmax(log_probs, axis=1)
    # Collapse: merge duplicates, remove blanks
    result = []
    prev = -1
    for idx in path:
        if idx != prev:
            if idx != blank:
                result.append(alphabet[idx])
            prev = idx
    return ''.join(result)

def beam_decode(log_probs, alphabet, blank=0, beam_width=5):
    """Beam search CTC decoding with prefix merging."""
    T, C = log_probs.shape
    # Each beam: (prefix_string, log_prob_blank_end, log_prob_nonblank_end)
    LOG_ZERO = -1e30
    beams = {'': (0.0, LOG_ZERO)}  # start with empty prefix

    for t in range(T):
        new_beams = {}
        for prefix, (pb, pnb) in beams.items():
            p_total = np.logaddexp(pb, pnb)
            for c in range(C):
                lp = log_probs[t, c]
                if c == blank:
                    # Blank extends without changing prefix
                    key = prefix
                    new_pb = p_total + lp
                    if key in new_beams:
                        new_beams[key] = (np.logaddexp(new_beams[key][0], new_pb),
                                          new_beams[key][1])
                    else:
                        new_beams[key] = (new_pb, LOG_ZERO)
                else:
                    ch = alphabet[c]
                    # If repeating last char, only extend from blank-ending paths
                    if prefix and ch == prefix[-1]:
                        new_pnb = pb + lp  # must come via blank
                    else:
                        new_pnb = p_total + lp
                    key = prefix + ch
                    if key in new_beams:
                        new_beams[key] = (new_beams[key][0],
                                          np.logaddexp(new_beams[key][1], new_pnb))
                    else:
                        new_beams[key] = (LOG_ZERO, new_pnb)
        # Prune to top beam_width
        scored = {k: np.logaddexp(v[0], v[1]) for k, v in new_beams.items()}
        top_keys = sorted(scored, key=scored.get, reverse=True)[:beam_width]
        beams = {k: new_beams[k] for k in top_keys}

    best = max(beams, key=lambda k: np.logaddexp(beams[k][0], beams[k][1]))
    return best

# Synthetic example where greedy fails but beam search succeeds
# Target: "go". At frames 3-6, blank barely beats 'o' (0.51 vs 0.48).
# Greedy always picks blank -> outputs "g". Beam search finds that
# emitting 'o' at frame 3, 4, 5, OR 6 all collapse to "go", and
# prefix merging sums their probabilities to beat the all-blank path.
T = 10
alphabet = ['_', 'g', 'o']
probs = np.array([
    [0.05, 0.90, 0.05],  # strong g
    [0.60, 0.30, 0.10],  # g fading
    [0.90, 0.03, 0.07],  # blank transition
    [0.51, 0.01, 0.48],  # blank BARELY beats o
    [0.51, 0.01, 0.48],  # blank BARELY beats o
    [0.51, 0.01, 0.48],  # blank BARELY beats o
    [0.51, 0.01, 0.48],  # blank BARELY beats o
    [0.95, 0.01, 0.04],  # strong blank
    [0.97, 0.01, 0.02],  # strong blank
    [0.99, 0.005, 0.005] # strong blank
])
log_probs = np.log(probs)
print(f"Greedy:     '{greedy_decode(log_probs, alphabet)}'")
print(f"Beam (B=5): '{beam_decode(log_probs, alphabet, beam_width=5)}'")
# Greedy:     'g'   (blank always wins -> 'o' never emitted)
# Beam (B=5): 'go'  (prefix merging recovers the correct answer)

Greedy decoding missed the "o" in "go" because blank barely beat it at every frame (0.51 vs 0.48). Taking the argmax at each frame always picks blank, so the "o" is never emitted. Beam search, by maintaining multiple hypotheses, discovers that emitting "o" at any one of four frames all collapse to the same prefix "go" — and their merged probability beats the all-blank path.

Try It: Greedy vs Beam Search Decoder

Beam width: 5

4. Attention-Based Encoder-Decoder

CTC has an important limitation: it assumes that outputs at each frame are conditionally independent. The probability of emitting "l" at frame 5 doesn't depend on whether "e" was emitted at frame 3. This means CTC has no built-in language model — it can't learn that "th" is more likely than "tq" in English.

The Listen, Attend and Spell (LAS) model (Chan et al., 2016) takes a fundamentally different approach. It's a full encoder-decoder architecture with three components:

The Encoder ("Listen") processes mel spectrogram frames into a sequence of hidden states. Typically a stack of bidirectional LSTMs or Conformer blocks, often with subsampling layers (stride-2 convolutions) that reduce the sequence length by 4x before the attention bottleneck. If you started with 300 frames, the encoder might output 75 hidden states.

The Attention ("Attend") mechanism is computed at each decoder step. Given the decoder's current hidden state as a query and all encoder hidden states as keys/values, it computes a soft alignment — attention weights that indicate which acoustic frames are relevant for predicting the current character. The weighted sum of encoder states produces a context vector that summarizes the relevant audio.

The Decoder ("Spell") is an autoregressive model (LSTM or Transformer) that generates one character or subword at a time, conditioned on the attention context and all previously generated tokens. During training, it uses teacher forcing (feeding the ground-truth previous token). At inference, it feeds its own predictions back in.

The elegant part: the alignment emerges automatically. If you visualize the attention weights as a matrix (decoder steps vs. encoder frames), you see a roughly diagonal pattern — the model learns temporal correspondence without any alignment supervision. CTC vs. attention represents a fundamental trade-off: CTC is frame-independent (fast, parallelizable, monotonic), while attention is autoregressive (slower, but models inter-character dependencies and can theoretically handle reordering). Modern systems like ESPnet use hybrid CTC/Attention (Watanabe et al., 2017), combining both losses to get the best of both worlds.

import numpy as np

def attention_mechanism(encoder_states, decoder_query):
    """Compute attention weights and context vector.
    encoder_states: (T, D) hidden states from the encoder
    decoder_query:  (D,) current decoder hidden state
    Returns: attention weights (T,), context vector (D,)
    """
    # Dot-product attention scores
    scores = encoder_states @ decoder_query  # (T,)
    # Softmax to get attention weights
    scores -= np.max(scores)  # numerical stability
    weights = np.exp(scores) / np.sum(np.exp(scores))
    # Weighted sum of encoder states
    context = weights @ encoder_states  # (D,)
    return weights, context

# Simulate encoder output for "set a timer" (T=75 after subsampling, D=256)
np.random.seed(42)
T_enc, D = 75, 256
encoder_out = np.random.randn(T_enc, D) * 0.1

# Make encoder states cluster by rough word position
# "set" ~frames 0-15, "a" ~frames 20-25, "timer" ~frames 30-65
for t in range(0, 15):  encoder_out[t, :10] += 2.0
for t in range(20, 25): encoder_out[t, 10:20] += 2.0
for t in range(30, 65): encoder_out[t, 20:30] += 2.0

# Decoder generates one char at a time; simulate query for 't' in "timer"
# Query should attend to frames 30-65
decoder_query = np.zeros(D)
decoder_query[20:30] = 2.0  # aligns with "timer" frames

weights, context = attention_mechanism(encoder_out, decoder_query)

peak = np.argmax(weights)
print(f"Encoder states: {encoder_out.shape}")
print(f"Attention weights sum: {weights.sum():.4f}")
print(f"Peak attention at frame {peak} (expected: 30-65 range)")
print(f"Context vector norm: {np.linalg.norm(context):.4f}")

# Show alignment: which frames does decoder attend to?
top5 = np.argsort(weights)[-5:][::-1]
print(f"Top-5 attended frames: {top5}")
# Encoder states: (75, 256)
# Attention weights sum: 1.0000
# Peak attention at frame 57 (expected: 30-65 range)
# Context vector norm: 6.4046
# Top-5 attended frames: [57 48 58 54 35]

The decoder query "pulled" attention toward frames 30-65, exactly where the encoder had representations for "timer." In a real model, this mechanism runs at every decoding step, producing a diagonal-ish attention pattern that aligns audio frames with output characters.

5. Evaluation — Word Error Rate

How do you measure whether a speech recognition system is any good? The standard metric is Word Error Rate (WER), defined as:

WER = (S + D + I) / N

where S is the number of substitutions (wrong word), D is deletions (missing word), I is insertions (extra word), and N is the total number of words in the reference transcript. WER is computed using the Levenshtein edit distance algorithm at the word level.

Worked example: reference = "set a timer for five minutes", hypothesis = "set the timer for five minute". The edits are: "a" → "the" (substitution) and "minutes" → "minute" (substitution). That's S=2, D=0, I=0 out of N=6 reference words, giving WER = 2/6 = 33.3%.

Character Error Rate (CER) uses the same formula but at the character level. It's particularly useful for languages without clear word boundaries like Chinese or Japanese.

State-of-the-art benchmarks on LibriSpeech: the best models achieve ~1.4% WER on test-clean (read audiobooks, quiet conditions) and ~2.6% on test-other (noisier, more diverse speakers). For comparison, human WER on conversational speech (Switchboard corpus) is estimated at ~5.5%. On clean read speech, machines have arguably surpassed human accuracy.

import numpy as np

def word_error_rate(reference, hypothesis):
    """Compute WER using dynamic programming (Levenshtein distance at word level).
    reference: string (ground truth transcript)
    hypothesis: string (model output)
    Returns: WER as float, and (substitutions, deletions, insertions)
    """
    ref = reference.split()
    hyp = hypothesis.split()
    N, M = len(ref), len(hyp)

    # DP table: dp[i][j] = edit distance between ref[:i] and hyp[:j]
    dp = np.zeros((N + 1, M + 1), dtype=int)
    for i in range(N + 1): dp[i, 0] = i  # deletions
    for j in range(M + 1): dp[0, j] = j  # insertions

    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if ref[i-1] == hyp[j-1]:
                dp[i, j] = dp[i-1, j-1]
            else:
                dp[i, j] = 1 + min(dp[i-1, j-1],  # substitution
                                    dp[i-1, j],     # deletion
                                    dp[i, j-1])     # insertion

    # Backtrace to count S, D, I
    i, j, S, D, I = N, M, 0, 0, 0
    while i > 0 or j > 0:
        if i > 0 and j > 0 and ref[i-1] == hyp[j-1]:
            i -= 1; j -= 1
        elif i > 0 and j > 0 and dp[i,j] == dp[i-1,j-1] + 1:
            S += 1; i -= 1; j -= 1
        elif i > 0 and dp[i,j] == dp[i-1,j] + 1:
            D += 1; i -= 1
        else:
            I += 1; j -= 1

    wer = (S + D + I) / N if N > 0 else 0.0
    return wer, (S, D, I)

ref = "set a timer for five minutes"
hyp = "set the timer for five minute"
wer, (s, d, i) = word_error_rate(ref, hyp)
print(f"Reference:  '{ref}'")
print(f"Hypothesis: '{hyp}'")
print(f"S={s}, D={d}, I={i}, N={len(ref.split())}")
print(f"WER = ({s}+{d}+{i})/{len(ref.split())} = {wer:.1%}")
# Reference:  'set a timer for five minutes'
# Hypothesis: 'set the timer for five minute'
# S=2, D=0, I=0, N=6
# WER = (2+0+0)/6 = 33.3%

6. Modern ASR — Whisper and Beyond

The past five years have seen an explosion in ASR capability, driven by three parallel trends: massive data, self-supervised pretraining, and architectural innovations.

Whisper

OpenAI Whisper (Radford et al., 2022) trained an encoder-decoder Transformer on 680,000 hours of multilingual web audio — orders of magnitude more data than previous systems. The result is a model that's remarkably robust to noise, accents, and domain shift. Whisper is also multitask: the same model handles transcription, translation, language identification, and timestamp prediction, all controlled by special text tokens in the decoder. It demonstrated that with enough diverse data, a simple architecture can match or beat specialist models.

Conformer

The Conformer (Gulati et al., 2020) combines convolution and self-attention in a single block. Convolution captures local acoustic patterns — formant transitions, fricative bursts, stop consonants — while attention captures global dependencies like coarticulation across syllables and prosodic contours. The Conformer block interleaves feed-forward, self-attention, convolution, and another feed-forward layer. It's the state-of-the-art encoder architecture for both streaming and non-streaming ASR.

Self-Supervised Pretraining

wav2vec 2.0 (Baevski et al., 2020) learns speech representations from 60,000 hours of unlabeled audio using contrastive learning on masked latent speech units. The approach is analogous to BERT for text: mask some input frames, learn to predict them from context. After pretraining, fine-tuning with just 10 minutes of labeled data produces usable ASR. This represents a massive reduction in labeling cost — especially important for low-resource languages where transcribed audio is scarce.

Streaming ASR

Production voice assistants can't wait for the user to finish speaking. Streaming ASR must transcribe in real time, which rules out bidirectional encoders (no future context). Solutions include chunked attention (process fixed-size audio chunks), causal convolutions, and the RNN-Transducer (RNN-T), which extends CTC with a prediction network that models output dependencies — giving it the benefits of an autoregressive decoder while maintaining online decoding capability.

import numpy as np

def simulate_asr_pipeline(utterance, n_frames=15):
    """End-to-end ASR simulation: spectrogram -> CTC logits -> decode -> WER."""
    np.random.seed(42)
    chars = list("_abcdefghijklmnopqrstuvwxyz ")  # blank=0, space=27
    char2idx = {c: i for i, c in enumerate(chars)}
    C = len(chars)

    # Step 1: Simulate model output (frame-level logits)
    logits = np.random.randn(n_frames, C) * 0.3
    # Bias toward correct characters at appropriate frames
    target_chars = list(utterance)
    frames_per_char = n_frames / len(target_chars)
    for ci, ch in enumerate(target_chars):
        t_start = int(ci * frames_per_char)
        t_end = int((ci + 1) * frames_per_char)
        idx = char2idx.get(ch, 0)
        for t in range(t_start, min(t_end, n_frames)):
            logits[t, idx] += 2.5

    # Step 2: CTC greedy decode
    log_probs = logits - np.logaddexp.reduce(logits, axis=1, keepdims=True)
    path = np.argmax(log_probs, axis=1)
    result = []
    prev = -1
    for idx in path:
        if idx != prev:
            if idx != 0:  # skip blank
                result.append(chars[idx])
            prev = idx
    hypothesis = ''.join(result)

    # Step 3: Compute WER
    ref_words = utterance.split()
    hyp_words = hypothesis.split()
    N, M = len(ref_words), len(hyp_words)
    dp = np.zeros((N+1, M+1), dtype=int)
    for i in range(N+1): dp[i, 0] = i
    for j in range(M+1): dp[0, j] = j
    for i in range(1, N+1):
        for j in range(1, M+1):
            cost = 0 if ref_words[i-1] == hyp_words[j-1] else 1
            dp[i,j] = min(dp[i-1,j-1]+cost, dp[i-1,j]+1, dp[i,j-1]+1)
    wer = dp[N, M] / N if N > 0 else 0.0

    print(f"Utterance:  '{utterance}'")
    print(f"Frames:      {n_frames} (logits shape: {logits.shape})")
    print(f"Decoded:    '{hypothesis}'")
    print(f"WER:         {wer:.1%}")
    return hypothesis, wer

simulate_asr_pipeline("hi dad", n_frames=15)
# Utterance:  'hi dad'
# Frames:      15 (logits shape: (15, 28))
# Decoded:    'hi dad'
# WER:         0.0%

Fifteen lines tie together everything from this post: spectrogram features feed into a model that produces frame-level logits, CTC greedy decoding collapses them into text, and WER scores the result. In a real system, the "model" is a Conformer or Whisper encoder processing actual mel spectrograms — but the pipeline structure is identical.

Conclusion

The speech recognition pipeline has a clean conceptual arc: waveform → spectrogram → encoder → CTC or attention → decoded text → WER evaluation. At every stage, the central challenge is the same: alignment. How do you map a long, continuous acoustic signal to a short, discrete sequence of characters?

CTC solved this with a blank token and dynamic programming, elegantly marginalizing over all possible alignments. Attention-based models learned the alignment implicitly through soft attention weights. Modern hybrid systems combine both. And self-supervised pretraining (wav2vec 2.0) made it possible to learn powerful speech representations from raw audio alone, dramatically reducing the need for transcribed data.

From Graves's 2006 CTC paper to today's Whisper handling 100+ languages, the trajectory has been remarkable. The alignment problem that once required rooms full of linguists annotating phoneme boundaries is now solved by a 35-line forward algorithm and a few hundred thousand hours of audio. The machines are listening — and understanding — better than ever.

References & Further Reading

Graves et al. — Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006) — the CTC paper
Chan et al. — Listen, Attend and Spell (2016) — attention-based end-to-end ASR
Graves — Sequence Transduction with Recurrent Neural Networks (2012) — RNN-Transducer
Watanabe et al. — Hybrid CTC/Attention Architecture for End-to-End Speech Recognition (2017) — combining both approaches
Gulati et al. — Conformer: Convolution-augmented Transformer for Speech Recognition (2020) — conv+attention hybrid encoder
Baevski et al. — wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020) — self-supervised pretraining
Radford et al. — Robust Speech Recognition via Large-Scale Weak Supervision (2022) — Whisper
Hannun — Sequence Modeling with CTC (Distill.pub, 2017) — excellent visual CTC tutorial