← Back to Blog

Audio Features from Scratch

1. Sound as a Signal

When you speak the word "hello," your vocal cords vibrate between 85–255 Hz, creating pressure waves that travel through air at 343 m/s. A microphone converts these pressure changes into voltage, which a sound card samples 44,100 times per second into integers between −32,768 and 32,767. That's it — every song you've ever loved, every word you've ever heard on a phone, is just a long list of integers. The challenge of audio ML is turning that list into something meaningful.

This process of turning continuous sound into discrete numbers is called Pulse-Code Modulation (PCM). Two parameters control the quality: the sampling rate (how many times per second we measure the wave) and the bit depth (how precisely we record each measurement). At 16-bit depth, each sample can take one of 65,536 discrete values, giving about 96 dB of dynamic range — more than enough to capture a whisper and a shout in the same recording.

Common sampling rates tell you what each format was designed for: 8 kHz for telephone (intelligible speech up to 4 kHz), 16 kHz for modern speech recognition, 44.1 kHz for CDs (capturing the full 20 Hz–20 kHz range of human hearing), and 48 kHz for video production. The Nyquist-Shannon theorem explains these choices: to faithfully capture a frequency, you must sample at least twice as fast. Sample below that threshold and higher frequencies "fold back" into lower ones — a distortion called aliasing that creates phantom tones that weren't in the original signal.

Here's the problem: a 3-second clip at 16 kHz is 48,000 numbers. Adjacent samples are highly correlated (the wave changes smoothly), and the raw waveform doesn't directly encode what we hear — pitch, timbre, loudness. From an information-theoretic perspective, raw audio is massively redundant. We need representations that compress this redundancy and align with human auditory perception. That journey starts with the Fourier transform.

import numpy as np

def generate_waveforms(sr=16000, duration=0.05):
    """Generate three synthetic audio signals."""
    t = np.arange(int(sr * duration)) / sr

    # Pure 440 Hz sine wave (concert A)
    sine = np.sin(2 * np.pi * 440 * t)

    # Three-tone chord: 330 Hz (E4) + 440 Hz (A4) + 554 Hz (C#5)
    chord = (np.sin(2 * np.pi * 330 * t) +
             np.sin(2 * np.pi * 440 * t) +
             np.sin(2 * np.pi * 554 * t)) / 3

    # Linear chirp: sweeps 200 Hz to 2000 Hz
    freq = 200 + (2000 - 200) * t / t[-1]
    chirp = np.sin(2 * np.pi * np.cumsum(freq) / sr)

    return t, sine, chord, chirp

# Aliasing demonstration: sample a 7 kHz tone at 10 kHz
# Nyquist frequency = 5 kHz, so 7 kHz aliases to 10 - 7 = 3 kHz
sr_low = 10000
t_low = np.arange(500) / sr_low
original_7k = np.sin(2 * np.pi * 7000 * t_low)
# This array looks identical to a 3 kHz sine wave!

2. The Fourier Transform — From Time to Frequency

Here's the fundamental insight that makes audio processing possible: any signal, no matter how complex, is just a sum of sine waves at different frequencies. A piano chord, a spoken sentence, a dog bark — they're all combinations of pure tones. The Discrete Fourier Transform (DFT) decomposes a signal into these frequency components, telling you exactly how much of each frequency is present.

The DFT takes N time-domain samples and produces N frequency-domain coefficients:

X[k] = ∑n=0N-1 x[n] · e-j2πkn/N    for k = 0, 1, ..., N-1

Each X[k] is a complex number. Its magnitude |X[k]| tells you how much of frequency k is present, and its phase tells you the timing offset of that frequency component. For ML, we almost always discard the phase and keep only the magnitude (or power) spectrum. Frequency bin k corresponds to a physical frequency of fk = k · fs / N. The bin X[0] is the DC component (the signal's mean value), and for real-valued input, bins beyond N/2 are redundant mirror images.

The direct computation requires O(N²) operations — each of N output bins sums over N input samples. The Fast Fourier Transform (Cooley & Tukey, 1965) exploits a divide-and-conquer trick: split the N-point DFT into two N/2-point DFTs over even and odd samples, then combine them with "butterfly" operations. This reduces complexity to O(N log N), making real-time audio analysis practical.

But there's a catch. If you apply the DFT to a windowed chunk of audio, the chunk's edges create artificial discontinuities that smear energy across all frequencies — an artifact called spectral leakage. The fix is to multiply each frame by a window function that tapers smoothly to zero at the edges. The Hann window (w[n] = 0.5 − 0.5·cos(2πn/(N−1))) is the most popular general-purpose choice. The Hamming window (w[n] = 0.54 − 0.46·cos(2πn/(N−1))) offers better suppression of the nearest sidelobe and is the traditional default for speech MFCC computation.

import numpy as np

def dft_from_scratch(x):
    """Compute the Discrete Fourier Transform (direct O(N^2) method)."""
    N = len(x)
    n = np.arange(N)
    k = np.arange(N).reshape(-1, 1)
    # The DFT matrix: each entry is e^(-j * 2pi * k * n / N)
    W = np.exp(-2j * np.pi * k * n / N)
    return W @ x

def hann_window(N):
    """Hann window: tapers to zero at both ends."""
    n = np.arange(N)
    return 0.5 - 0.5 * np.cos(2 * np.pi * n / (N - 1))

# Generate a chord: 330 Hz + 440 Hz + 554 Hz
sr = 16000
t = np.arange(512) / sr
signal = (np.sin(2 * np.pi * 330 * t) +
          np.sin(2 * np.pi * 440 * t) +
          np.sin(2 * np.pi * 554 * t))

# Apply window, then DFT
windowed = signal * hann_window(512)
X = dft_from_scratch(windowed)

# Magnitude spectrum (only first N/2+1 bins are unique)
magnitudes = np.abs(X[:257])
freqs = np.arange(257) * sr / 512
# Peaks at bins near 330, 440, and 554 Hz

3. The Spectrogram — Frequency Over Time

A single FFT gives you the average frequency content of an entire signal. But audio is inherently time-varying: speech has different phonemes, music has different notes, and environmental sounds evolve constantly. We need to see how frequency content changes over time.

The Short-Time Fourier Transform (STFT) solves this by sliding a window across the signal and computing the FFT on each overlapping segment. The result is a 2D matrix — the spectrogram — where the x-axis is time (one column per window position), the y-axis is frequency, and the color/brightness represents magnitude or power.

Three parameters control the STFT:

These parameters create a fundamental tradeoff rooted in a signal-processing analog of the Heisenberg uncertainty principle: you cannot have arbitrarily fine resolution in both time and frequency simultaneously. A longer window captures more cycles of each frequency, giving sharper frequency resolution — but each frame spans more time, blurring rapid temporal changes. A shorter window localizes events precisely in time but smears the frequency content. For speech, 25 ms is a sweet spot: long enough to resolve the harmonics of human voice (fundamental frequency 85–255 Hz), short enough to track the rapid transitions between phonemes.

In practice, we almost always take the logarithm of the power spectrogram. Human loudness perception follows a roughly logarithmic scale (that's why we use decibels), so log-compression makes the visual representation match what we actually hear.

import numpy as np

def stft(signal, sr=16000, win_ms=25, hop_ms=10, n_fft=512):
    """Short-Time Fourier Transform from scratch."""
    win_len = int(sr * win_ms / 1000)  # 400 samples at 16 kHz
    hop_len = int(sr * hop_ms / 1000)  # 160 samples at 16 kHz

    # Hann window
    window = 0.5 - 0.5 * np.cos(2 * np.pi * np.arange(win_len) / (win_len - 1))

    # Number of frames
    n_frames = 1 + (len(signal) - win_len) // hop_len
    spectrogram = np.zeros((n_fft // 2 + 1, n_frames))

    for i in range(n_frames):
        start = i * hop_len
        frame = signal[start:start + win_len] * window
        # Zero-pad to n_fft and compute FFT
        padded = np.zeros(n_fft)
        padded[:win_len] = frame
        spectrum = np.fft.rfft(padded)
        spectrogram[:, i] = np.abs(spectrum) ** 2  # Power spectrum

    return spectrogram

# Compute log-power spectrogram
signal = np.random.randn(16000)  # 1 second of noise
S = stft(signal)
log_S = 10 * np.log10(S + 1e-10)  # Log-power in dB
# Shape: (257 frequency bins, 100 time frames)

Try It: Waveform to Spectrogram

Select a signal type and adjust the window size to see the time-frequency resolution tradeoff. Longer windows give sharper frequency peaks but blur time; shorter windows localize time events but smear frequencies.

Signal: Pure 440 Hz tone · Window: 25 ms (400 samples) · 257 frequency bins · 100 frames/sec

4. The Mel Scale and Filterbanks

The spectrogram treats all frequencies equally — the bins from 0–100 Hz get the same resolution as 7,900–8,000 Hz. But human hearing doesn't work that way. The difference between 100 Hz and 200 Hz sounds like a huge leap (an entire octave), while 7,900 Hz to 8,000 Hz is barely noticeable. Our ears perceive pitch on a logarithmic scale: equal ratios of frequency sound like equal intervals.

The mel scale (named after "melody") formalizes this perceptual mapping:

mel(f) = 2595 · log10(1 + f/700)

This function is approximately linear below 1,000 Hz and logarithmic above it, matching the frequency resolution of the human cochlea. To exploit this, we build a mel filterbank: a set of overlapping triangular bandpass filters whose center frequencies are equally spaced on the mel scale (but not on the linear frequency scale).

The construction algorithm is elegant:

  1. Choose frequency bounds (e.g., 0 Hz to 8,000 Hz)
  2. Convert to mel: mellow, melhigh
  3. Create M+2 equally spaced points on the mel axis (M filters + 2 boundary points)
  4. Convert back to Hz, then map to the nearest FFT bin indices
  5. For each filter m, build a triangle that rises from bin[m] to bin[m+1] and falls to bin[m+2]

The result is beautiful and biologically inspired: the low-frequency filters are narrow (high resolution where our hearing is most sensitive — we can distinguish a 100 Hz tone from a 110 Hz tone), while the high-frequency filters are wide (coarse resolution where our hearing is less discriminating). The overlapping triangles ensure no frequency energy falls through the cracks.

To compute a mel spectrogram, multiply each frame's power spectrum by the filterbank matrix, then take the logarithm. This compresses the 257 FFT bins down to typically 40 mel bands (classic speech recognition), 80 (OpenAI's Whisper), or 128 (music/audio classification). The CNN-based models that dominate modern audio classification treat these mel spectrograms as single-channel images.

import numpy as np

def hz_to_mel(f):
    return 2595 * np.log10(1 + f / 700)

def mel_to_hz(m):
    return 700 * (10 ** (m / 2595) - 1)

def mel_filterbank(n_mels=40, n_fft=512, sr=16000, f_low=0, f_high=None):
    """Build a mel-spaced triangular filterbank matrix."""
    f_high = f_high or sr / 2
    n_bins = n_fft // 2 + 1  # 257 for n_fft=512

    # M+2 equally spaced points on mel scale
    mel_points = np.linspace(hz_to_mel(f_low), hz_to_mel(f_high), n_mels + 2)
    hz_points = mel_to_hz(mel_points)
    bin_indices = np.round(hz_points * n_fft / sr).astype(int)

    filters = np.zeros((n_mels, n_bins))
    for m in range(n_mels):
        left, center, right = bin_indices[m], bin_indices[m + 1], bin_indices[m + 2]
        # Rising slope: left to center
        for k in range(left, center):
            filters[m, k] = (k - left) / (center - left)
        # Falling slope: center to right
        for k in range(center, right):
            filters[m, k] = (right - k) / (right - center)

    return filters  # Shape: (n_mels, n_bins)

# Apply to a power spectrum
fb = mel_filterbank(n_mels=40)
power_spectrum = np.abs(np.fft.rfft(np.random.randn(512))) ** 2
mel_energies = fb @ power_spectrum    # 40 mel-band energies
log_mel = np.log(mel_energies + 1e-10)  # Log compression

5. MFCCs — Mel-Frequency Cepstral Coefficients

For three decades (1980–2012), MFCCs were the undisputed king of audio features. The idea comes from a clever observation: the log mel-filterbank energies are correlated because the triangular filters overlap. If we apply the Discrete Cosine Transform (DCT) — a real-valued cousin of the DFT — to these log energies, it decorrelates them (much like PCA) and compacts the information into the first few coefficients. We can then discard the rest, achieving natural dimensionality reduction.

The name "cepstrum" (an anagram of "spectrum") reveals the deeper structure: the cepstrum is the inverse Fourier transform of the log spectrum. This operation separates the slowly-varying spectral envelope (which encodes the shape of the vocal tract — what phoneme is being spoken) from the rapidly-varying harmonic structure (which encodes pitch — how high or low the voice is). For speech recognition, we care about what sound, not what pitch, so MFCCs give us exactly the right information.

Each coefficient captures a different level of spectral detail:

The standard recipe extracts 13 coefficients per frame. To capture dynamics (how the spectrum evolves over time — critical for distinguishing consonants from vowels), we compute delta features (first derivative over time) and delta-delta features (second derivative). This gives a 39-dimensional feature vector per frame: 13 static + 13 delta + 13 delta-delta.

One more practical detail: the pre-emphasis filter y[n] = x[n] − 0.97·x[n−1] is applied before everything else. Speech has a natural spectral tilt (the glottal source produces more energy at low frequencies), so this first-order high-pass filter flattens the spectrum, giving the FFT a more balanced signal to work with. Some modern deep learning pipelines skip this step, since the network can learn to compensate on its own. The complete MFCC pipeline is: pre-emphasis → frame → window → FFT → power spectrum → mel filterbank → log → DCT.

import numpy as np

def preemphasis(signal, coeff=0.97):
    """High-pass filter to flatten speech spectrum."""
    return np.append(signal[0], signal[1:] - coeff * signal[:-1])

def dct_matrix(n_mfcc, n_mels):
    """Type-II DCT matrix for cepstral transform."""
    n = np.arange(n_mels)
    k = np.arange(n_mfcc).reshape(-1, 1)
    return np.cos(np.pi * k * (2 * n + 1) / (2 * n_mels))

def extract_mfccs(signal, sr=16000, n_mfcc=13, n_mels=40):
    """Full MFCC pipeline from scratch."""
    # Step 1: Pre-emphasis
    emphasized = preemphasis(signal)

    # Step 2: Frame and window (25 ms window, 10 ms hop)
    win_len, hop_len, n_fft = 400, 160, 512
    window = 0.54 - 0.46 * np.cos(2 * np.pi * np.arange(win_len) / (win_len - 1))
    n_frames = 1 + (len(emphasized) - win_len) // hop_len

    mfccs = np.zeros((n_mfcc, n_frames))
    fb = mel_filterbank(n_mels, n_fft, sr)  # From Section 4
    dct = dct_matrix(n_mfcc, n_mels)

    for i in range(n_frames):
        frame = emphasized[i * hop_len : i * hop_len + win_len] * window
        # Step 3: FFT and power spectrum
        padded = np.zeros(n_fft)
        padded[:win_len] = frame
        power = np.abs(np.fft.rfft(padded)) ** 2
        # Step 4: Mel filterbank, log, DCT
        mel_energy = fb @ power
        log_mel = np.log(mel_energy + 1e-10)
        mfccs[:, i] = dct @ log_mel

    return mfccs  # Shape: (13, n_frames)

# Extract MFCCs from 1 second of synthetic speech
signal = np.random.randn(16000)
mfccs = extract_mfccs(signal)
print(f"MFCC shape: {mfccs.shape}")  # (13, 100)

Try It: Mel Filterbank & MFCC Explorer

Explore the full MFCC extraction pipeline step by step. Each panel shows one stage of processing. Adjust the number of mel filters and MFCC coefficients to see how they compress the raw frequency information.

Signal: Pure 440 Hz tone · 40 mel filters · 13 MFCCs · Compression: 257 bins → 40 mels → 13 coefficients

6. Beyond MFCCs — Spectral Features and Augmentation

Modern deep learning has largely replaced MFCCs with log-mel spectrograms as the input of choice. Why? The DCT step in MFCCs is a lossy compression — it discards information that a CNN might find useful. By feeding the full mel spectrogram (without DCT) to a convolutional network, we let the model decide which spectral patterns matter. This approach treats audio as a single-channel image, and all the tricks from computer vision transfer directly.

Beyond mel features, several other spectral descriptors are useful for building audio representations:

For data augmentation, the audio domain has its own powerful technique: SpecAugment (Park et al. 2019). Instead of augmenting the raw waveform, SpecAugment operates directly on the spectrogram with two simple masking operations:

SpecAugment is beautifully simple — it's essentially Cutout applied to spectrograms — yet it achieved state-of-the-art results on LibriSpeech when introduced.

import numpy as np

def spectral_centroid(power_spectrum, freqs):
    """Center of mass of the power spectrum."""
    return np.sum(freqs * power_spectrum) / (np.sum(power_spectrum) + 1e-10)

def spectral_bandwidth(power_spectrum, freqs, centroid):
    """Spread of the spectrum around its centroid."""
    deviation = (freqs - centroid) ** 2
    return np.sqrt(np.sum(deviation * power_spectrum) / (np.sum(power_spectrum) + 1e-10))

def spec_augment(mel_spec, freq_mask_param=15, time_mask_param=25,
                 n_freq_masks=2, n_time_masks=2):
    """SpecAugment: frequency and time masking on mel spectrogram."""
    augmented = mel_spec.copy()
    n_mels, n_frames = augmented.shape

    for _ in range(n_freq_masks):
        f = np.random.randint(0, freq_mask_param + 1)
        f0 = np.random.randint(0, max(1, n_mels - f))
        augmented[f0:f0 + f, :] = 0  # Zero out frequency bands

    for _ in range(n_time_masks):
        t = np.random.randint(0, time_mask_param + 1)
        t0 = np.random.randint(0, max(1, n_frames - t))
        augmented[:, t0:t0 + t] = 0  # Zero out time steps

    return augmented

# Example: augment a mel spectrogram
mel = np.random.rand(40, 100)  # 40 mel bands, 100 frames
augmented = spec_augment(mel)
# Visible horizontal and vertical black bars in the spectrogram

7. From Features to Foundation Models

The history of audio feature extraction mirrors a broader story in machine learning: the march from hand-crafted representations to learned ones.

In the 1980s, Davis & Mermelstein showed that 13 MFCCs could achieve 96.5% word recognition accuracy — a triumph of domain knowledge baked into a feature pipeline. For thirty years, MFCCs + Gaussian Mixture Models + Hidden Markov Models was the speech recognition stack. Every component was designed by humans who understood acoustics.

Around 2012, deep learning started replacing each hand-crafted component. First, the GMM-HMM back-end was replaced by deep neural networks trained on the same MFCC features. Then researchers discovered that feeding log-mel spectrograms directly to CNNs worked better than MFCCs — the network could learn its own cepstral-like features, plus additional patterns that the fixed DCT discarded.

wav2vec 2.0 (Baevski et al. 2020) pushed this further by learning from raw waveforms using self-supervised learning. A 7-layer 1D CNN encodes raw 16 kHz audio into latent vectors at ~49 Hz. During training, the model masks random segments and must identify the correct quantized representation among 100 distractors — a contrastive learning objective. With just 10 minutes of labeled data (plus 53,000 hours of unlabeled audio for pre-training), wav2vec 2.0 achieves 4.8% word error rate on LibriSpeech — approaching human performance.

OpenAI's Whisper (Radford et al. 2023) took a different path: instead of self-supervision, it uses massive supervised training on 680,000 hours of weakly-labeled internet audio. Whisper feeds 80-bin log-mel spectrograms (25 ms window, 10 ms hop) through two 1D convolution layers into a Transformer encoder, then an autoregressive decoder generates the transcript. The sheer scale of training data makes Whisper remarkably robust to accents, background noise, and technical jargon — without any fine-tuning.

The Audio Spectrogram Transformer (Gong et al. 2021) brings the self-attention revolution full circle: it treats the mel spectrogram as an image, splits it into 16×16 patches (just like a Vision Transformer), and classifies audio events with purely attention-based processing — no convolutions at all. Initializing from an ImageNet-pretrained ViT provides a strong starting point, showing that visual features transfer surprisingly well to audio.

The trajectory is clear: MFCCs (1980) → log-mel + CNNs (2012) → raw waveforms + self-supervision (2020) → foundation models (2023). Each step moves the "feature engineering" boundary deeper into the model, replacing human acoustic knowledge with learned representations. But understanding the hand-crafted pipeline — Fourier transforms, mel filterbanks, cepstral coefficients — remains essential for debugging, designing architectures, and building intuition about what these models are actually learning inside their layers.

Conclusion

We've walked the entire path from pressure waves hitting a microphone to the feature representations that power modern speech recognition and audio classification. The journey has a satisfying structure: raw samples are too high-dimensional and poorly aligned with perception, so the Fourier transform maps them to frequency space. The spectrogram adds time resolution. The mel filterbank compresses frequencies to match human hearing. The DCT decorrelates and compacts into MFCCs. And modern deep learning is gradually learning to do all of this — and more — automatically.

Whether you're building a voice assistant, a music recommendation engine, or an environmental sound classifier, the principles in this post form the foundation. Even when using a pre-trained Whisper model or wav2vec 2.0, understanding the pipeline helps you debug unexpected failures (is the sampling rate wrong? is the mel spectrogram normalized correctly?) and make informed design choices (how many mel bands? what window size for your use case?).

The audio domain is also a beautiful illustration of a deeper truth in ML: the best features are the ones that capture the right invariances. MFCCs succeed because they're invariant to pitch while preserving phoneme identity. Mel filterbanks succeed because they match the frequency resolution of human hearing. And foundation models succeed because they learn invariances we never thought to encode by hand.

References & Further Reading