Support Vector Machines from Scratch

February 26, 2026 · Elementary · 14 min read

The Maximum Margin Idea

Imagine you have a 2D scatter plot with red dots and blue dots, and you need to draw a line separating them. There are infinitely many lines that work. So which one should you choose?

A neural network's answer depends on its random initialization and learning rate — run it twice and you might get two different boundaries. A decision tree draws axis-aligned rectangles that feel unnatural. But there's an algorithm that gives a principled, unique answer: choose the line that maximizes the margin — the widest possible gap between the decision boundary and the nearest data points from each class.

This is the core idea behind Support Vector Machines (SVMs), and it leads to one of the most elegant algorithms in machine learning.

The points closest to the boundary are called support vectors, and here's the remarkable part: they're the only data points that matter. You can move, add, or remove any point that isn't a support vector and the boundary won't budge. This makes SVMs naturally robust to outliers lurking in the interior of each class.

Mathematically, the margin is 2/||w|| where w is the weight vector defining the boundary. So maximizing the margin means minimizing ||w||². This gives us a constrained optimization problem:

Minimize ½||w||² subject to yᵢ(w·xᵢ + b) ≥ 1 for all data points i.

Notice something familiar? Minimizing ||w||² is exactly L2 regularization — the same technique we explored in Regularization from Scratch. But while neural networks add regularization as an afterthought (a penalty term tacked onto the loss), SVMs have it baked into their DNA. The algorithm is regularization.

Hard-Margin SVM — The Gradient Descent Approach

Let's build our first SVM from scratch. We'll use the hinge loss formulation, which converts the constrained optimization into an unconstrained one that we can solve with gradient descent — the same tool we used in Backpropagation from Scratch.

The hinge loss for a single data point is max(0, 1 - yᵢ(w·xᵢ + b)). If the point is on the correct side of the margin (the product yᵢ(w·xᵢ + b) ≥ 1), the loss is zero. If it's inside the margin or on the wrong side, the loss grows linearly with the violation. This is one of the loss functions that shaped the history of machine learning.

The total objective combines hinge loss with the regularization term:

L = ½||w||² + C · Σ max(0, 1 - yᵢ(w·xᵢ + b))

For a hard-margin SVM, we set C very large so that any margin violation is heavily penalized. Here's the implementation:

import numpy as np

class HardMarginSVM:
    """SVM trained via sub-gradient descent on the hinge loss."""

    def __init__(self, C=1000.0, lr=0.001, epochs=1000):
        self.C = C          # penalty for margin violations
        self.lr = lr         # learning rate
        self.epochs = epochs

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0.0

        for epoch in range(self.epochs):
            for i in range(n_samples):
                margin = y[i] * (X[i] @ self.w + self.b)

                if margin < 1:
                    # point violates margin: hinge loss gradient
                    self.w -= self.lr * (self.w - self.C * y[i] * X[i])
                    self.b -= self.lr * (-self.C * y[i])
                else:
                    # point is safely classified: only regularization gradient
                    self.w -= self.lr * self.w

    def decision_function(self, X):
        return X @ self.w + self.b

    def predict(self, X):
        return np.sign(self.decision_function(X))

    def support_vectors(self, X, y):
        margins = y * self.decision_function(X)
        # support vectors lie near the margin boundary (margin ≈ 1)
        return X[margins <= 1.01]

# Generate linearly separable data
np.random.seed(42)
X_pos = np.random.randn(30, 2) + np.array([2, 2])
X_neg = np.random.randn(30, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*30 + [-1]*30)

svm = HardMarginSVM(C=1000, lr=0.0005, epochs=500)
svm.fit(X, y)
print(f"Weight vector: [{svm.w[0]:.3f}, {svm.w[1]:.3f}]")
print(f"Bias: {svm.b:.3f}")
print(f"Support vectors found: {len(svm.support_vectors(X, y))}")
# Weight vector: [1.247, 1.198]
# Bias: -0.034
# Support vectors found: 4

The sub-gradient is the key to understanding this code. The hinge loss isn't differentiable at exactly margin = 1 (it has a "kink"), so we use a sub-gradient: when the point violates the margin, the gradient pulls w toward correctly classifying that point; when the point is safe, the gradient just shrinks w (regularization). Over many epochs, the algorithm converges to the maximum-margin boundary.

Soft-Margin SVM — Handling Real Data

Real-world datasets are almost never perfectly linearly separable. A few noisy points from one class might land on the wrong side, and a hard-margin SVM would contort its boundary trying to classify every single point correctly — or fail entirely if perfect separation is impossible.

The soft-margin SVM introduces slack variables ξᵢ that allow some points to violate the margin or even be misclassified. The C parameter controls this tradeoff:

Large C: "I really care about classifying every point correctly" → narrow margin, risks overfitting to noise
Small C: "I'd rather have a wide margin, even if some points are misclassified" → wide margin, risks underfitting
Moderate C: The sweet spot where the SVM generalizes well

The beautiful thing is: the code barely changes. We already have the C parameter — we just need to set it to a reasonable value instead of cranking it to 1000.

class SoftMarginSVM:
    """SVM with tunable C for the bias-variance tradeoff."""

    def __init__(self, C=1.0, lr=0.001, epochs=1000):
        self.C = C
        self.lr = lr
        self.epochs = epochs

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0.0

        for epoch in range(self.epochs):
            for i in range(n_samples):
                margin = y[i] * (X[i] @ self.w + self.b)

                if margin < 1:
                    self.w -= self.lr * (self.w - self.C * y[i] * X[i])
                    self.b -= self.lr * (-self.C * y[i])
                else:
                    self.w -= self.lr * self.w

    def decision_function(self, X):
        return X @ self.w + self.b

    def predict(self, X):
        return np.sign(self.decision_function(X))

# Add noise: move some points close to the boundary
np.random.seed(42)
X_pos = np.random.randn(50, 2) + np.array([1.5, 1.5])
X_neg = np.random.randn(50, 2) + np.array([-1.5, -1.5])
X_noisy = np.vstack([X_pos, X_neg])
y_noisy = np.array([1]*50 + [-1]*50)

for C_val in [100.0, 1.0, 0.01]:
    svm = SoftMarginSVM(C=C_val, lr=0.0005, epochs=500)
    svm.fit(X_noisy, y_noisy)
    preds = svm.predict(X_noisy)
    acc = np.mean(preds == y_noisy) * 100
    margin_width = 2.0 / (np.linalg.norm(svm.w) + 1e-8)
    print(f"C={C_val:>6}: accuracy={acc:.0f}%, margin width={margin_width:.2f}")
# C= 100.0: accuracy=95%, margin width=0.87
# C=   1.0: accuracy=93%, margin width=1.42
# C=  0.01: accuracy=85%, margin width=3.61

See the tradeoff in action: C=100 achieves 95% training accuracy but with a razor-thin margin of 0.87. C=0.01 only hits 85% accuracy but enjoys a wide margin of 3.61. In practice, you'd use cross-validation to find the C that maximizes test accuracy — usually somewhere in between.

Try It: Margin Explorer

Click to add points (left-click = blue class, right-click = red class). Drag existing points to move them. Watch how only support vectors affect the boundary.

C: 10.0

Support vectors: 0 | Margin width: —

The Kernel Trick — Infinite Dimensions for Free

Everything so far has been about linear boundaries — straight lines in 2D, flat planes in 3D. But what about data that can't be separated by any line?

Picture two concentric circles: the inner circle is class A, the outer ring is class B. No straight line can separate them. But here's a beautiful insight: if you add a third dimension z = x² + y², the circles "lift" apart vertically. In this 3D space, a flat plane easily separates the two classes — and when you project that plane back to 2D, it becomes a circle.

The problem with explicit feature mapping is computation. If you want polynomial features of degree d in p dimensions, the feature space has O(p^d) dimensions. For degree 5 with 100 features, that's 10 billion dimensions. And the RBF kernel? Its implicit feature space is infinite-dimensional.

The kernel trick solves this by exploiting a key property of SVMs: the decision function only depends on dot products between data points. Instead of computing the high-dimensional features φ(x) and then taking dot products, we directly compute K(x, x') = φ(x) · φ(x') using a kernel function — which operates entirely in the original feature space.

Three kernels, three different implicit feature spaces:

Linear: K(x, x') = x · x' — no transformation (standard SVM)
Polynomial: K(x, x') = (x · x' + 1)^d — all polynomial terms up to degree d
RBF (Gaussian): K(x, x') = exp(-γ||x - x'||²) — infinite-dimensional feature space, computed with a single exponential!

class KernelSVM:
    """SVM with kernel trick using simplified SMO-style coordinate descent."""

    def __init__(self, kernel='rbf', C=1.0, gamma=1.0, degree=3, epochs=200):
        self.kernel = kernel
        self.C = C
        self.gamma = gamma
        self.degree = degree
        self.epochs = epochs

    def _kernel(self, x1, x2):
        if self.kernel == 'linear':
            return x1 @ x2.T
        elif self.kernel == 'poly':
            return (x1 @ x2.T + 1) ** self.degree
        elif self.kernel == 'rbf':
            # ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2 * x1 . x2
            sq1 = np.sum(x1 ** 2, axis=1, keepdims=True)
            sq2 = np.sum(x2 ** 2, axis=1, keepdims=True)
            dist_sq = sq1 + sq2.T - 2 * (x1 @ x2.T)
            return np.exp(-self.gamma * dist_sq)

    def fit(self, X, y):
        n = len(y)
        self.X_train = X
        self.y_train = y
        self.alphas = np.zeros(n)
        self.b = 0.0
        K = self._kernel(X, X)

        for epoch in range(self.epochs):
            for i in range(n):
                # decision value for point i
                f_i = np.sum(self.alphas * y * K[i]) + self.b
                error_i = f_i - y[i]

                # KKT violation check
                if (y[i] * f_i < 1 and self.alphas[i] < self.C) or \
                   (y[i] * f_i > 1 and self.alphas[i] > 0):
                    # pick a random j != i
                    j = i
                    while j == i:
                        j = np.random.randint(n)

                    f_j = np.sum(self.alphas * y * K[j]) + self.b
                    error_j = f_j - y[j]

                    eta = K[i, i] + K[j, j] - 2 * K[i, j]
                    if eta <= 0:
                        continue

                    alpha_j_old = self.alphas[j]
                    alpha_i_old = self.alphas[i]

                    # compute bounds
                    if y[i] != y[j]:
                        L = max(0, self.alphas[j] - self.alphas[i])
                        H = min(self.C, self.C + self.alphas[j] - self.alphas[i])
                    else:
                        L = max(0, self.alphas[i] + self.alphas[j] - self.C)
                        H = min(self.C, self.alphas[i] + self.alphas[j])

                    if L == H:
                        continue

                    # update alpha_j
                    self.alphas[j] += y[j] * (error_i - error_j) / eta
                    self.alphas[j] = np.clip(self.alphas[j], L, H)

                    # update alpha_i
                    self.alphas[i] += y[i] * y[j] * (alpha_j_old - self.alphas[j])

                    # update bias
                    b1 = self.b - error_i - y[i] * (self.alphas[i] - alpha_i_old) * K[i, i] \
                         - y[j] * (self.alphas[j] - alpha_j_old) * K[i, j]
                    b2 = self.b - error_j - y[i] * (self.alphas[i] - alpha_i_old) * K[i, j] \
                         - y[j] * (self.alphas[j] - alpha_j_old) * K[j, j]

                    if 0 < self.alphas[i] < self.C:
                        self.b = b1
                    elif 0 < self.alphas[j] < self.C:
                        self.b = b2
                    else:
                        self.b = (b1 + b2) / 2

    def decision_function(self, X):
        K = self._kernel(X, self.X_train)
        return K @ (self.alphas * self.y_train) + self.b

    def predict(self, X):
        return np.sign(self.decision_function(X))

# Concentric circles dataset
np.random.seed(42)
n_per_class = 60
angles = np.random.uniform(0, 2 * np.pi, n_per_class)
r_inner = 1.0 + np.random.randn(n_per_class) * 0.15
r_outer = 3.0 + np.random.randn(n_per_class) * 0.3
X_inner = np.column_stack([r_inner * np.cos(angles), r_inner * np.sin(angles)])
X_outer = np.column_stack([r_outer * np.cos(angles), r_outer * np.sin(angles)])
X_circles = np.vstack([X_inner, X_outer])
y_circles = np.array([1]*n_per_class + [-1]*n_per_class)

for kernel_name in ['linear', 'poly', 'rbf']:
    svm = KernelSVM(kernel=kernel_name, C=10.0, gamma=1.0, degree=2, epochs=100)
    svm.fit(X_circles, y_circles)
    preds = svm.predict(X_circles)
    acc = np.mean(preds == y_circles) * 100
    sv_count = np.sum(svm.alphas > 1e-5)
    print(f"{kernel_name:>6} kernel: accuracy={acc:.0f}%, support vectors={sv_count}")
# linear kernel: accuracy=53%, support vectors=107
#   poly kernel: accuracy=100%, support vectors=16
#    rbf kernel: accuracy=100%, support vectors=18

The results tell a vivid story. The linear kernel is hopeless — 53% accuracy is barely better than random chance because no line can separate concentric circles. The polynomial kernel (degree 2) maps each point to a space that includes x² and y² terms, where the circles become separable — 100% accuracy with just 16 support vectors. The RBF kernel also achieves perfect separation using 18 support vectors, and its implicit feature space is infinite-dimensional — yet we never computed a single infinite-dimensional vector.

This is the magic of the kernel trick: you get the power of infinite dimensions for the cost of computing a single exponential.

The RBF γ Parameter

The RBF kernel has its own knob to turn: γ (gamma). It controls how much influence each training point has:

Small γ (wide Gaussian): each point influences a large area → smooth, simple boundary. Risk: underfitting.
Large γ (narrow Gaussian): each point only influences its immediate neighbors → complex, wiggly boundary that can create isolated "islands." Risk: overfitting.

Think of it like this: with a tiny γ, every support vector is a lighthouse casting a broad beam. With a huge γ, each one is a flashlight illuminating only the point directly in front of it. The challenge is finding the γ that illuminates just enough structure without memorizing noise.

Try It: Kernel Visualization

See how different kernels create non-linear decision boundaries. The heatmap shows the SVM's decision function — blue regions predict class +1, red regions predict class -1, and the boundary is where the color transitions.

Dataset:

Kernel:

γ: 1.0

C: 10.0

Accuracy: —% | Support vectors: 0

SVM vs. Neural Networks vs. Gradient Boosting

We've now built three fundamentally different classification paradigms in this series. How do they compare in practice? Here's an honest head-to-head across the dimensions that matter:

Dimension	SVM	Neural Net	Gradient Boosting
Training speed	O(n²-n³) — slow on large data	O(n) with SGD — scales well	O(n·trees) — scales well
Prediction speed	Depends on # support vectors	Fixed forward pass	Depends on # trees
Interpretability	Linear: weight vector	Opaque (black box)	Feature importance
Small datasets (<10K)	Excels	Struggles	Good
Large datasets (>100K)	Struggles	Excels	Excellent
Feature scaling needed?	Yes (essential)	Yes (helps a lot)	No (invariant)
Theoretical guarantees	Margin-based bounds	Weak generalization bounds	Training error bounds
High-dimensional data	Excellent (text, genomics)	Good with enough data	Good

The practical takeaway: SVMs remain the best choice for small-to-medium datasets with clear margin structure, especially in high-dimensional spaces where the number of features exceeds the number of samples (text classification with bag-of-words, genomics with thousands of gene expression features). For tabular data at scale, gradient boosting usually wins. For images, text sequences, and other perceptual data, neural networks are unmatched.

SVMs also have a unique advantage in one-class classification (anomaly detection) and when you need hard theoretical guarantees about generalization — the margin bound gives a provable limit on test error that no other algorithm family matches.

Conclusion

Support Vector Machines approach classification from a perspective that no other algorithm in our series shares: pure geometry. Instead of minimizing a prediction error, they maximize a margin. Instead of learning features, they leverage kernels to operate in spaces they never explicitly compute. And the result is an algorithm that's uniquely elegant — a handful of support vectors is all you need to define the boundary between classes.

We built three progressively powerful SVMs: a hard-margin classifier that finds the widest possible gap, a soft-margin variant that gracefully handles noise via the C parameter, and a kernelized version that draws curved boundaries in infinite-dimensional spaces using the kernel trick. Along the way, we saw how SVMs connect to ideas we've explored before — L2 regularization, hinge loss, gradient descent — while bringing something fundamentally new: the principle that the decision boundary should be defined by the hardest examples, not the easiest ones.

In the next post, we'll explore the other side of machine learning — what happens when you have no labels at all. K-Means Clustering from Scratch will take us into unsupervised learning, where algorithms discover structure on their own.

References & Further Reading

Vladimir Vapnik — The Nature of Statistical Learning Theory (1995) — the foundational text that introduced SVMs and statistical learning theory
Corinna Cortes & Vladimir Vapnik — Support-Vector Networks (1995) — the original SVM paper introducing soft-margin classification
Bernhard Schölkopf & Alexander Smola — Learning with Kernels (2002) — comprehensive treatment of kernel methods and SVMs
Chih-Chung Chang & Chih-Jen Lin — LIBSVM (2011) — the most widely used SVM implementation, basis for scikit-learn's SVM
Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — chapters 4 and 12 cover SVMs in the context of classification and kernel methods
scikit-learn SVM documentation — practical guide with excellent visual examples

Posts in this series: Regularization from Scratch (L2 as margin maximization), Loss Functions from Scratch (hinge loss), Backpropagation from Scratch (gradient-based optimization), Gradient Boosting from Scratch (decision boundary comparison), ML Evaluation from Scratch (cross-validation for hyperparameter selection)