← Back to Blog

Video Understanding from Scratch

1. What Makes Video Hard

A video of someone throwing a ball and a video of someone catching a ball can contain identical frames — just in different order. Any frame-by-frame image classifier would score them identically. Temporal ordering isn't a luxury; it's the signal.

Every vision post in this series — convnets, ResNets, ViTs, object detection, segmentation — treats an image as a static 2D grid. But video adds a dimension that changes everything: time. And with time come three challenges that images never face.

Temporal redundancy. A 30fps video produces 30 images per second, but the scene changes slowly. Adjacent frames are nearly identical. Processing every frame independently is absurdly wasteful — most of the computation would be repeated.

Long-range dependency. An action like "making a sandwich" plays out over minutes. A 16-frame clip (roughly half a second) captures a hand reaching for bread but misses the spreading, the stacking, and the cutting. Understanding the full action requires reasoning across long temporal windows.

Motion as a feature. The difference between a raised arm and a lowered arm is spatial — you can see it in a single frame. But the difference between waving and holding still is purely temporal. Motion exists only in the gap between frames.

The fundamental data format is a 4D tensor: (T, H, W, C) — time, height, width, channels. Extending our 2D toolbox to handle this volume is the core engineering challenge. We can either process each frame separately (losing temporal information), compute differences between frames (cheap but crude), or extend our filters into the temporal dimension (3D convolutions). Every approach in this post explores a different tradeoff along these axes.

Let's start by seeing what raw frame differences look like:

import numpy as np

# Simulate a 10-frame video: 64x64 RGB frames
# Static background with a white square moving right
T, H, W = 10, 64, 64
video = np.zeros((T, H, W, 3), dtype=np.float32)

for t in range(T):
    video[t, :, :] = [0.2, 0.3, 0.4]        # blue-gray background
    x_start = 5 + t * 5                       # square moves right
    video[t, 20:35, x_start:x_start+15] = [1.0, 1.0, 1.0]  # white square

# Frame differences: where did pixels change?
diffs = np.abs(video[1:] - video[:-1])         # shape (9, 64, 64, 3)
motion_magnitude = diffs.mean(axis=-1)          # average across RGB

print(f"Video shape: {video.shape}")            # (10, 64, 64, 3)
print(f"Diff shape:  {diffs.shape}")            # (9, 64, 64, 3)
print(f"Max motion at frame 0->1: {motion_magnitude[0].max():.3f}")
print(f"Static region motion:     {motion_magnitude[0, 0, 0]:.3f}")
# Moving square has high difference; static background is zero

The frame difference tensor is the crudest form of temporal signal — but it already separates moving objects from static background. Every method we'll build improves on this basic idea.

2. Frame Sampling Strategies

Before any model touches a video, it must decide which frames to look at. A 10-second clip at 30fps has 300 frames — processing all of them is expensive and redundant. The sampling strategy determines what temporal context the model actually sees.

Dense sampling grabs a fixed-length clip of consecutive frames. C3D, the pioneering 3D convolution model, uses 16-frame clips. This is simple but myopic — 16 frames at 30fps covers only half a second. You might capture a punch but miss the windup.

Uniform stride selects T frames evenly spaced across the entire video. If you have a 300-frame video and want 16 frames, you take every 18th frame. This covers the full temporal extent but can alias fast motions — a hand clap between sampled frames is invisible.

Temporal Segment Networks (Wang et al., 2016) introduced a smarter approach: divide the video into K equal temporal segments and randomly sample one snippet from each. The K snippet predictions are aggregated by averaging. The key insight is that consecutive frames are so redundant that one representative per segment is enough — and by covering all K segments, you see the full action arc. TSN achieved 94.2% on UCF-101 and won the ActivityNet challenge in 2016.

import numpy as np

def dense_sample(n_frames, clip_len=16, start=0):
    """Consecutive frames starting at 'start'."""
    return list(range(start, min(start + clip_len, n_frames)))

def uniform_stride(n_frames, n_samples=16):
    """Evenly spaced frames across full video."""
    stride = max(1, n_frames // n_samples)
    return [i * stride for i in range(n_samples) if i * stride < n_frames]

def tsn_sample(n_frames, n_segments=8, rng=None):
    """One random frame per temporal segment (TSN-style)."""
    rng = rng or np.random.default_rng(42)
    seg_len = n_frames // n_segments
    indices = []
    for k in range(n_segments):
        start = k * seg_len
        end = min(start + seg_len, n_frames)
        indices.append(rng.integers(start, end))
    return indices

n_frames = 300  # 10 seconds at 30fps
print("Dense (16 frames):", dense_sample(n_frames))
# [0, 1, 2, ..., 15] — only first 0.5 seconds
print("Uniform stride:   ", uniform_stride(n_frames, 16))
# [0, 18, 36, ..., 270] — full video, rigid spacing
print("TSN (8 segments): ", tsn_sample(n_frames, 8))
# One random frame per ~37-frame segment — full coverage, variety

Modern architectures like SlowFast combine aspects of all three: a "slow" pathway samples sparse frames at large stride for semantics, while a "fast" pathway samples dense frames at small stride for motion. The right sampling strategy depends on whether your task cares more about what happens (appearance) or when it happens (temporal dynamics).

3. Optical Flow — Classical Motion Estimation

Frame differences tell us that something moved. Optical flow tells us where it moved. It computes a 2D velocity vector (u, v) for every pixel, pointing from its position in frame t to its corresponding position in frame t+1. A ball moving right gets flow vectors pointing right. A person walking left gets flow vectors pointing left. The static background gets near-zero vectors.

The foundation is the brightness constancy assumption: a pixel's intensity doesn't change as it moves. If point (x, y) at time t moves to (x+dx, y+dy) at time t+dt, then:

I(x, y, t) = I(x + dx, y + dy, t + dt)

Taking a first-order Taylor expansion and dividing by dt gives the optical flow constraint equation:

Ix · u + Iy · v + It = 0

where Ix, Iy are spatial gradients and It is the temporal gradient. One equation, two unknowns — this is the aperture problem. Through a small window, you can measure motion perpendicular to an edge but not motion along it.

Lucas-Kanade (1981)

The Lucas-Kanade method resolves the aperture problem by assuming that flow is constant within a small patch. A 5×5 window gives 25 pixels, each contributing one constraint equation, but all sharing the same unknown (u, v). This overdetermined system is solved by least squares:

[u, v]T = (ATA)-1 ATb, where A = [[Ix1, Iy1], ...], b = [-It1, ...]

Since ATA is just a 2×2 matrix (the structure tensor), this is cheap to invert analytically. Lucas-Kanade is fast and sparse — it works well at corners and textured regions but fails on uniform areas (where ATA is singular) and at large displacements (where the local constancy assumption breaks).

Horn-Schunck (1981)

Horn-Schunck takes the opposite approach: compute dense flow everywhere by adding a global smoothness regularizer. The flow field should vary smoothly — nearby pixels should have similar velocities. This produces flow at every pixel but assumes globally smooth motion, which breaks at object boundaries where a foreground and background move differently.

Here's a from-scratch Lucas-Kanade implementation:

import numpy as np

def lucas_kanade_flow(frame1, frame2, window=5):
    """Compute sparse optical flow between two grayscale frames."""
    # Spatial gradients via Sobel-like kernels
    kx = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) / 8.0
    ky = kx.T
    Ix = convolve2d(frame1, kx)
    Iy = convolve2d(frame1, ky)
    It = frame2 - frame1  # temporal gradient

    h, w = frame1.shape
    half = window // 2
    u_flow = np.zeros_like(frame1)
    v_flow = np.zeros_like(frame1)

    for y in range(half, h - half):
        for x in range(half, w - half):
            # Extract local window
            ix = Ix[y-half:y+half+1, x-half:x+half+1].flatten()
            iy = Iy[y-half:y+half+1, x-half:x+half+1].flatten()
            it = It[y-half:y+half+1, x-half:x+half+1].flatten()

            # A^T A is a 2x2 matrix: [[sum(Ix*Ix), sum(Ix*Iy)],
            #                          [sum(Ix*Iy), sum(Iy*Iy)]]
            ATA = np.array([[np.sum(ix*ix), np.sum(ix*iy)],
                            [np.sum(ix*iy), np.sum(iy*iy)]])
            ATb = np.array([-np.sum(ix*it), -np.sum(iy*it)])

            det = ATA[0,0]*ATA[1,1] - ATA[0,1]*ATA[1,0]
            if abs(det) > 1e-6:  # only where structure tensor is invertible
                flow = np.linalg.solve(ATA, ATb)
                u_flow[y, x] = flow[0]
                v_flow[y, x] = flow[1]

    return u_flow, v_flow

def convolve2d(img, kernel):
    """Simple 2D convolution with zero padding."""
    kh, kw = kernel.shape
    ph, pw = kh // 2, kw // 2
    padded = np.pad(img, ((ph, ph), (pw, pw)), mode='constant')
    out = np.zeros_like(img)
    for y in range(img.shape[0]):
        for x in range(img.shape[1]):
            out[y, x] = np.sum(padded[y:y+kh, x:x+kw] * kernel)
    return out

# Example: moving object produces nonzero flow
frame1 = np.zeros((32, 32), dtype=np.float64)
frame2 = np.zeros((32, 32), dtype=np.float64)
frame1[12:20, 10:18] = 1.0   # white square at position A
frame2[12:20, 13:21] = 1.0   # same square shifted right by 3px

u, v = lucas_kanade_flow(frame1, frame2, window=5)
print(f"Flow at object edge (15,17): u={u[15,17]:.2f}, v={v[15,17]:.2f}")
# u > 0 (rightward motion), v ~ 0 (no vertical motion)

Optical flow was the temporal backbone of video understanding for over a decade. Two-stream networks (Section 5) feed stacked flow fields to a dedicated CNN, achieving dramatic improvements over RGB-only models. Modern deep learning methods like RAFT (Teed & Deng, 2020) replace the handcrafted optimization with learned feature matching, but the fundamental idea — motion vectors as a feature channel — remains central.

Try It: Optical Flow Visualizer

A circle moves across a static background. The left panel shows two consecutive frames; the right panel shows the Lucas-Kanade optical flow field. Adjust velocity to see how flow vectors change with speed and direction.

4. 3D Convolutions — C3D, I3D, and R(2+1)D

Optical flow is a handcrafted temporal feature. What if we let the network learn its own temporal patterns? The natural extension is to add a time dimension to convolution filters. A 2D convolution filter with shape (kH, kW) slides over the spatial dimensions (H, W). A 3D convolution filter with shape (kT, kH, kW) slides over the spatiotemporal volume (T, H, W), capturing local patterns that span both space and time.

C3D (Tran et al., 2015) showed that 3×3×3 convolutions throughout a deep network learn effective spatiotemporal features. Trained on the Sports-1M dataset, C3D features transferred well to action recognition, scene classification, and video captioning — establishing 3D conv features as the "ImageNet features" of video. The catch: 78 million parameters and no way to leverage ImageNet pretrained weights.

I3D (Carreira & Zisserman, 2017) solved the pretraining problem with a beautiful idea: weight inflation. Take a pretrained 2D filter of shape (kH, kW), replicate it kT times along the time axis, and divide by kT. The inflated 3D filter produces the same output as the 2D filter on a static frame, but can learn temporal patterns through fine-tuning. I3D achieved 80.2% on HMDB-51 and introduced the Kinetics benchmark that became the standard for action recognition.

R(2+1)D (Tran et al., 2018) went further by factorizing 3D convolution: instead of a single (3, 3, 3) filter, use a spatial (1, 3, 3) filter followed by a temporal (3, 1, 1) filter. This decomposition adds an extra nonlinearity between the spatial and temporal operations, making optimization easier. R(2+1)D outperformed full 3D convolutions by 2.3%.

import numpy as np

def conv3d(volume, kernel):
    """Apply a single 3D convolution filter to a (T, H, W) volume."""
    kT, kH, kW = kernel.shape
    T, H, W = volume.shape
    oT = T - kT + 1
    oH = H - kH + 1
    oW = W - kW + 1
    output = np.zeros((oT, oH, oW))
    for t in range(oT):
        for y in range(oH):
            for x in range(oW):
                patch = volume[t:t+kT, y:y+kH, x:x+kW]
                output[t, y, x] = np.sum(patch * kernel)
    return output

def r2plus1d(volume, spatial_k, temporal_k):
    """Factorized (2+1)D: spatial conv then temporal conv."""
    T, H, W = volume.shape
    # Spatial: apply (1, 3, 3) kernel per frame
    kH, kW = spatial_k.shape
    oH, oW = H - kH + 1, W - kW + 1
    spatial_out = np.zeros((T, oH, oW))
    for t in range(T):
        for y in range(oH):
            for x in range(oW):
                spatial_out[t, y, x] = np.sum(volume[t, y:y+kH, x:x+kW] * spatial_k)
    spatial_out = np.maximum(spatial_out, 0)  # ReLU between spatial and temporal

    # Temporal: apply (3, 1, 1) kernel across time
    kT = temporal_k.shape[0]
    oT = T - kT + 1
    output = np.zeros((oT, oH, oW))
    for t in range(oT):
        for y in range(oH):
            for x in range(oW):
                output[t, y, x] = np.sum(spatial_out[t:t+kT, y, x] * temporal_k)
    return output

# Compare: full 3D vs factorized R(2+1)D
rng = np.random.default_rng(42)
volume = rng.standard_normal((8, 16, 16))  # 8 frames, 16x16 spatial
kernel_3d = rng.standard_normal((3, 3, 3))
spatial_k = rng.standard_normal((3, 3))
temporal_k = rng.standard_normal((3,))

out_3d = conv3d(volume, kernel_3d)
out_r21d = r2plus1d(volume, spatial_k, temporal_k)
print(f"3D conv output: {out_3d.shape}")     # (6, 14, 14)
print(f"R(2+1)D output: {out_r21d.shape}")   # (6, 14, 14)
# Same output shape, but R(2+1)D has an extra ReLU and is easier to optimize

The progression tells a clear story: C3D proved that learned 3D features work, I3D showed that ImageNet pretraining transfers via inflation, and R(2+1)D demonstrated that factorizing spatial and temporal dimensions is both cheaper and more effective than full 3D. This factorization idea resurfaces in video transformers (Section 6), where divided attention separates spatial and temporal self-attention.

5. Two-Stream Networks

In 2014, Simonyan and Zisserman proposed an architecture inspired by neuroscience. The primate visual cortex has two distinct processing streams: the ventral stream ("what pathway") handles object recognition and appearance, while the dorsal stream ("where/how fast pathway") processes motion. Two-stream networks implement this decomposition directly:

The key insight is that appearance and motion are complementary. The spatial stream learns that a person in a swimming pose is probably swimming. The temporal stream learns that rhythmic arm-rotation patterns indicate swimming regardless of what the person looks like. Together they achieve what neither can alone.

The results were striking: 88.0% on UCF-101. The temporal stream alone hit 83.7% — showing optical flow is an extraordinarily powerful feature for action recognition.

import numpy as np

def spatial_features(rgb_frame, filters):
    """Simulate spatial stream: extract appearance features from RGB."""
    # Apply learned filters to detect textures, edges, objects
    h, w, c = rgb_frame.shape
    n_filters = filters.shape[0]
    features = np.zeros((h - 2, w - 2, n_filters))
    for f in range(n_filters):
        for y in range(h - 2):
            for x in range(w - 2):
                patch = rgb_frame[y:y+3, x:x+3, :]
                features[y, x, f] = np.sum(patch * filters[f])
    return np.maximum(features, 0)  # ReLU

def temporal_features(flow_stack, filters):
    """Simulate temporal stream: extract motion features from flow."""
    # flow_stack shape: (L, H, W, 2) — L frames of (u, v) flow
    # Stack into (H, W, 2*L) input like the original paper
    h, w = flow_stack.shape[1], flow_stack.shape[2]
    stacked = flow_stack.transpose(1, 2, 0, 3).reshape(h, w, -1)  # (H, W, 2L)
    n_filters = filters.shape[0]
    features = np.zeros((h - 2, w - 2, n_filters))
    for f in range(n_filters):
        for y in range(h - 2):
            for x in range(w - 2):
                patch = stacked[y:y+3, x:x+3, :]
                features[y, x, f] = np.sum(patch * filters[f])
    return np.maximum(features, 0)

# Simulate: RGB detects appearance, flow detects motion
rng = np.random.default_rng(42)
rgb = rng.random((16, 16, 3))
flow = rng.standard_normal((10, 16, 16, 2)) * 0.5  # 10 flow frames

spatial_f = rng.standard_normal((4, 3, 3, 3))      # 4 filters for RGB
temporal_f = rng.standard_normal((4, 3, 3, 20))     # 4 filters for 20-ch flow

s_feat = spatial_features(rgb, spatial_f)
t_feat = temporal_features(flow, temporal_f)
print(f"Spatial features:  {s_feat.shape}")   # (14, 14, 4) — appearance
print(f"Temporal features: {t_feat.shape}")   # (14, 14, 4) — motion
# Late fusion: average softmax scores from both streams for final prediction

The weakness of two-stream networks is practical: optical flow is expensive to compute (several seconds per frame on CPU), and running two full CNNs doubles inference cost. This motivated end-to-end approaches that learn temporal features without precomputed flow — 3D convolutions and, ultimately, video transformers.

6. Video Transformers — TimeSformer and ViViT

Once Vision Transformers showed that pure self-attention (no convolutions) could match CNNs on images, the question was obvious: can transformers replace 3D convolutions for video?

The naive approach tokenizes a video clip of (T, H, W) into T × N patch tokens (where N = (H/16) × (W/16) patches per frame) and applies full self-attention over all of them. But full space-time attention has complexity O(T2N2) — for 8 frames of 14×14 patches, that's 12,544 tokens attending to each other. Prohibitively expensive.

TimeSformer (Bertasius et al., 2021)

TimeSformer's solution is elegant: divided space-time attention. At each transformer block, apply temporal and spatial self-attention separately, in sequence:

Complexity drops from O(T2N2) to O(T2N + TN2). For 8 frames of 196 patches, that's ~320K vs ~2.5M attention pairs — roughly an 8× reduction. The savings grow with longer videos: at T=32 frames, the ratio jumps to ~28×.

ViViT (Arnab et al., 2021)

ViViT explored four factorization variants and found the "factorized encoder" most effective: run spatial transformer blocks over all patches within each frame first, producing a per-frame class token, then run temporal transformer blocks across the T per-frame tokens. This sequential separation achieved 81.3% on Kinetics-400.

import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def attention(Q, K, V):
    """Standard scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V, weights

def divided_spacetime_attention(tokens, T, N, Wq, Wk, Wv):
    """
    tokens: (T*N, D) — all patch tokens from T frames, N patches each
    Returns updated tokens after temporal then spatial attention.
    """
    D = tokens.shape[-1]
    tokens_4d = tokens.reshape(T, N, D)

    # --- Temporal attention: each spatial position attends across time ---
    # Rearrange to (N, T, D) so each spatial position is a "batch"
    temporal_in = tokens_4d.transpose(1, 0, 2)  # (N, T, D)
    Q = temporal_in @ Wq  # (N, T, D)
    K = temporal_in @ Wk
    V = temporal_in @ Wv
    temporal_out, t_weights = attention(Q, K, V)  # (N, T, D)
    tokens_4d = temporal_out.transpose(1, 0, 2)   # (T, N, D)

    # --- Spatial attention: each frame's patches attend to each other ---
    Q = tokens_4d @ Wq  # (T, N, D)
    K = tokens_4d @ Wk
    V = tokens_4d @ Wv
    spatial_out, s_weights = attention(Q, K, V)    # (T, N, D)

    return spatial_out.reshape(T * N, D), t_weights, s_weights

# Example: 4 frames, 9 patches each (3x3 grid), 16-dim embeddings
rng = np.random.default_rng(42)
T, N, D = 4, 9, 16
tokens = rng.standard_normal((T * N, D))
Wq = rng.standard_normal((D, D)) * 0.1
Wk = rng.standard_normal((D, D)) * 0.1
Wv = rng.standard_normal((D, D)) * 0.1

out, t_w, s_w = divided_spacetime_attention(tokens, T, N, Wq, Wk, Wv)
print(f"Input:  {tokens.shape}")     # (36, 16) — 36 tokens
print(f"Output: {out.shape}")        # (36, 16) — same shape
print(f"Temporal attn: {t_w.shape}") # (9, 4, 4) — each patch: 4x4 across time
print(f"Spatial attn:  {s_w.shape}") # (4, 9, 9) — each frame: 9x9 within frame
print(f"Divided pairs:  {T*T*N + T*N*N}")  # 4*4*9 + 4*9*9 = 468
print(f"Full ST pairs:  {(T*N)**2}")        # 36*36 = 1296

The leap from 3D CNNs to video transformers mirrors the image domain (CNNs → ViT): transformers have a global receptive field from layer 1, while 3D CNNs build up temporal reach gradually through stacked layers. And unlike two-stream networks, transformers need no precomputed optical flow — they learn to attend to motion implicitly by comparing patch embeddings across frames.

Try It: Temporal Attention Explorer

Click any patch in the 6-frame video strip to see what it attends to. Toggle between full space-time attention (all pairs) and divided attention (temporal + spatial separately) to see the dramatic complexity reduction.

7. Modern Approaches — SlowFast and VideoMAE

Two recent innovations reshaped the landscape of video understanding.

SlowFast Networks (Feichtenhofer et al., 2019)

SlowFast is a dual-pathway network inspired by the observation that not all temporal information needs the same spatial detail. The Slow pathway processes T/α frames (e.g., 8 from a 64-frame clip) at full spatial resolution — it captures what the scene looks like, what objects are present, the semantic context. The Fast pathway processes all T frames but at dramatically reduced channel width (β = 1/8 of Slow's channels) — it captures rapid motion with fine temporal resolution but minimal spatial computation.

Lateral connections feed Fast pathway features into the Slow pathway at each stage. The result: 79.8% top-1 on Kinetics-400, with less computation than running two separate full-resolution streams. The key insight is that slow-changing semantics require spatial resolution, while fast-changing motion can be encoded efficiently with narrow representations.

VideoMAE (Tong et al., 2022)

VideoMAE adapts Masked Autoencoder pretraining to video. The core challenge is temporal redundancy: in images, masking 75% of patches forces the model to learn meaningful visual features. But in video, adjacent frames are nearly identical — a masked patch can be trivially reconstructed by copying from the previous or next frame.

The solution is tube masking: apply the same random spatial mask consistently across all T frames, creating "tubes" of masked patches through the temporal dimension. Now reconstruction requires genuine spatiotemporal understanding, not just temporal interpolation. With an aggressive 90-95% masking ratio, VideoMAE achieves state-of-the-art on Kinetics-400 while training on only 3,400 videos — an 8× reduction in labeled data compared to supervised methods.

The most recent milestone is InternVideo2 (Wang et al., 2024), a 6-billion parameter video encoder trained progressively: first VideoMAE-style masked reconstruction for spatiotemporal perception, then CLIP-style contrastive video-text alignment, then instruction tuning for open-ended video dialogue. It sets state-of-the-art on over 60 video and audio benchmarks, demonstrating that the same scaling recipe that transformed NLP and image understanding applies to video.

8. Conclusion

The temporal dimension transforms computer vision from static pattern recognition into dynamic event understanding. We've traced a clear progression:

  1. Frame differences reveal that motion exists
  2. Optical flow quantifies where and how fast things move
  3. Two-stream networks learn appearance and motion features separately
  4. 3D convolutions learn spatiotemporal patterns jointly
  5. Video transformers attend globally across space and time
  6. Masked pretraining learns from temporal redundancy rather than fighting it

The same three tensions that run through every deep learning topic — local vs. global representations, supervised vs. self-supervised learning, computation vs. expressiveness — reappear here, resolved in novel ways by the video domain's unique structure. Temporal redundancy, which makes video expensive to process, turns out to be the very property that makes self-supervised pretraining (VideoMAE) so effective.

With this post, the multimodal perception arc is complete: static images (convnets through ViT), audio (features, recognition, synthesis), and now temporal video understanding. The frontier — multimodal models that jointly reason over video, audio, and language — builds on all of these foundations.

References & Further Reading