Support Vector Machines from Scratch
The Maximum Margin Idea
Imagine you have a 2D scatter plot with red dots and blue dots, and you need to draw a line separating them. There are infinitely many lines that work. So which one should you choose?
A neural network's answer depends on its random initialization and learning rate — run it twice and you might get two different boundaries. A decision tree draws axis-aligned rectangles that feel unnatural. But there's an algorithm that gives a principled, unique answer: choose the line that maximizes the margin — the widest possible gap between the decision boundary and the nearest data points from each class.
This is the core idea behind Support Vector Machines (SVMs), and it leads to one of the most elegant algorithms in machine learning.
The points closest to the boundary are called support vectors, and here's the remarkable part: they're the only data points that matter. You can move, add, or remove any point that isn't a support vector and the boundary won't budge. This makes SVMs naturally robust to outliers lurking in the interior of each class.
Mathematically, the margin is 2/||w|| where w is the weight vector defining the boundary. So maximizing the margin means minimizing ||w||². This gives us a constrained optimization problem:
Minimize ½||w||² subject to yᵢ(w·xᵢ + b) ≥ 1 for all data points i.
Notice something familiar? Minimizing ||w||² is exactly L2 regularization — the same technique we explored in Regularization from Scratch. But while neural networks add regularization as an afterthought (a penalty term tacked onto the loss), SVMs have it baked into their DNA. The algorithm is regularization.
Hard-Margin SVM — The Gradient Descent Approach
Let's build our first SVM from scratch. We'll use the hinge loss formulation, which converts the constrained optimization into an unconstrained one that we can solve with gradient descent — the same tool we used in Backpropagation from Scratch.
The hinge loss for a single data point is max(0, 1 - yᵢ(w·xᵢ + b)). If the point is on the correct side of the margin (the product yᵢ(w·xᵢ + b) ≥ 1), the loss is zero. If it's inside the margin or on the wrong side, the loss grows linearly with the violation. This is one of the loss functions that shaped the history of machine learning.
The total objective combines hinge loss with the regularization term:
L = ½||w||² + C · Σ max(0, 1 - yᵢ(w·xᵢ + b))
For a hard-margin SVM, we set C very large so that any margin violation is heavily penalized. Here's the implementation:
import numpy as np
class HardMarginSVM:
"""SVM trained via sub-gradient descent on the hinge loss."""
def __init__(self, C=1000.0, lr=0.001, epochs=1000):
self.C = C # penalty for margin violations
self.lr = lr # learning rate
self.epochs = epochs
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0.0
for epoch in range(self.epochs):
for i in range(n_samples):
margin = y[i] * (X[i] @ self.w + self.b)
if margin < 1:
# point violates margin: hinge loss gradient
self.w -= self.lr * (self.w - self.C * y[i] * X[i])
self.b -= self.lr * (-self.C * y[i])
else:
# point is safely classified: only regularization gradient
self.w -= self.lr * self.w
def decision_function(self, X):
return X @ self.w + self.b
def predict(self, X):
return np.sign(self.decision_function(X))
def support_vectors(self, X, y):
margins = y * self.decision_function(X)
# support vectors lie near the margin boundary (margin ≈ 1)
return X[margins <= 1.01]
# Generate linearly separable data
np.random.seed(42)
X_pos = np.random.randn(30, 2) + np.array([2, 2])
X_neg = np.random.randn(30, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1]*30 + [-1]*30)
svm = HardMarginSVM(C=1000, lr=0.0005, epochs=500)
svm.fit(X, y)
print(f"Weight vector: [{svm.w[0]:.3f}, {svm.w[1]:.3f}]")
print(f"Bias: {svm.b:.3f}")
print(f"Support vectors found: {len(svm.support_vectors(X, y))}")
# Weight vector: [1.247, 1.198]
# Bias: -0.034
# Support vectors found: 4
The sub-gradient is the key to understanding this code. The hinge loss isn't differentiable at exactly margin = 1 (it has a "kink"), so we use a sub-gradient: when the point violates the margin, the gradient pulls w toward correctly classifying that point; when the point is safe, the gradient just shrinks w (regularization). Over many epochs, the algorithm converges to the maximum-margin boundary.
Soft-Margin SVM — Handling Real Data
Real-world datasets are almost never perfectly linearly separable. A few noisy points from one class might land on the wrong side, and a hard-margin SVM would contort its boundary trying to classify every single point correctly — or fail entirely if perfect separation is impossible.
The soft-margin SVM introduces slack variables ξᵢ that allow some points to violate the margin or even be misclassified. The C parameter controls this tradeoff:
- Large C: "I really care about classifying every point correctly" → narrow margin, risks overfitting to noise
- Small C: "I'd rather have a wide margin, even if some points are misclassified" → wide margin, risks underfitting
- Moderate C: The sweet spot where the SVM generalizes well
The beautiful thing is: the code barely changes. We already have the C parameter — we just need to set it to a reasonable value instead of cranking it to 1000.
class SoftMarginSVM:
"""SVM with tunable C for the bias-variance tradeoff."""
def __init__(self, C=1.0, lr=0.001, epochs=1000):
self.C = C
self.lr = lr
self.epochs = epochs
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0.0
for epoch in range(self.epochs):
for i in range(n_samples):
margin = y[i] * (X[i] @ self.w + self.b)
if margin < 1:
self.w -= self.lr * (self.w - self.C * y[i] * X[i])
self.b -= self.lr * (-self.C * y[i])
else:
self.w -= self.lr * self.w
def decision_function(self, X):
return X @ self.w + self.b
def predict(self, X):
return np.sign(self.decision_function(X))
# Add noise: move some points close to the boundary
np.random.seed(42)
X_pos = np.random.randn(50, 2) + np.array([1.5, 1.5])
X_neg = np.random.randn(50, 2) + np.array([-1.5, -1.5])
X_noisy = np.vstack([X_pos, X_neg])
y_noisy = np.array([1]*50 + [-1]*50)
for C_val in [100.0, 1.0, 0.01]:
svm = SoftMarginSVM(C=C_val, lr=0.0005, epochs=500)
svm.fit(X_noisy, y_noisy)
preds = svm.predict(X_noisy)
acc = np.mean(preds == y_noisy) * 100
margin_width = 2.0 / (np.linalg.norm(svm.w) + 1e-8)
print(f"C={C_val:>6}: accuracy={acc:.0f}%, margin width={margin_width:.2f}")
# C= 100.0: accuracy=95%, margin width=0.87
# C= 1.0: accuracy=93%, margin width=1.42
# C= 0.01: accuracy=85%, margin width=3.61
See the tradeoff in action: C=100 achieves 95% training accuracy but with a razor-thin margin of 0.87. C=0.01 only hits 85% accuracy but enjoys a wide margin of 3.61. In practice, you'd use cross-validation to find the C that maximizes test accuracy — usually somewhere in between.
Try It: Margin Explorer
Click to add points (left-click = blue class, right-click = red class). Drag existing points to move them. Watch how only support vectors affect the boundary.
The Kernel Trick — Infinite Dimensions for Free
Everything so far has been about linear boundaries — straight lines in 2D, flat planes in 3D. But what about data that can't be separated by any line?
Picture two concentric circles: the inner circle is class A, the outer ring is class B. No straight line can separate them. But here's a beautiful insight: if you add a third dimension z = x² + y², the circles "lift" apart vertically. In this 3D space, a flat plane easily separates the two classes — and when you project that plane back to 2D, it becomes a circle.
The problem with explicit feature mapping is computation. If you want polynomial features of degree d in p dimensions, the feature space has O(p^d) dimensions. For degree 5 with 100 features, that's 10 billion dimensions. And the RBF kernel? Its implicit feature space is infinite-dimensional.
The kernel trick solves this by exploiting a key property of SVMs: the decision function only depends on dot products between data points. Instead of computing the high-dimensional features φ(x) and then taking dot products, we directly compute K(x, x') = φ(x) · φ(x') using a kernel function — which operates entirely in the original feature space.
Three kernels, three different implicit feature spaces:
- Linear:
K(x, x') = x · x'— no transformation (standard SVM) - Polynomial:
K(x, x') = (x · x' + 1)^d— all polynomial terms up to degree d - RBF (Gaussian):
K(x, x') = exp(-γ||x - x'||²)— infinite-dimensional feature space, computed with a single exponential!
class KernelSVM:
"""SVM with kernel trick using simplified SMO-style coordinate descent."""
def __init__(self, kernel='rbf', C=1.0, gamma=1.0, degree=3, epochs=200):
self.kernel = kernel
self.C = C
self.gamma = gamma
self.degree = degree
self.epochs = epochs
def _kernel(self, x1, x2):
if self.kernel == 'linear':
return x1 @ x2.T
elif self.kernel == 'poly':
return (x1 @ x2.T + 1) ** self.degree
elif self.kernel == 'rbf':
# ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2 * x1 . x2
sq1 = np.sum(x1 ** 2, axis=1, keepdims=True)
sq2 = np.sum(x2 ** 2, axis=1, keepdims=True)
dist_sq = sq1 + sq2.T - 2 * (x1 @ x2.T)
return np.exp(-self.gamma * dist_sq)
def fit(self, X, y):
n = len(y)
self.X_train = X
self.y_train = y
self.alphas = np.zeros(n)
self.b = 0.0
K = self._kernel(X, X)
for epoch in range(self.epochs):
for i in range(n):
# decision value for point i
f_i = np.sum(self.alphas * y * K[i]) + self.b
error_i = f_i - y[i]
# KKT violation check
if (y[i] * f_i < 1 and self.alphas[i] < self.C) or \
(y[i] * f_i > 1 and self.alphas[i] > 0):
# pick a random j != i
j = i
while j == i:
j = np.random.randint(n)
f_j = np.sum(self.alphas * y * K[j]) + self.b
error_j = f_j - y[j]
eta = K[i, i] + K[j, j] - 2 * K[i, j]
if eta <= 0:
continue
alpha_j_old = self.alphas[j]
alpha_i_old = self.alphas[i]
# compute bounds
if y[i] != y[j]:
L = max(0, self.alphas[j] - self.alphas[i])
H = min(self.C, self.C + self.alphas[j] - self.alphas[i])
else:
L = max(0, self.alphas[i] + self.alphas[j] - self.C)
H = min(self.C, self.alphas[i] + self.alphas[j])
if L == H:
continue
# update alpha_j
self.alphas[j] += y[j] * (error_i - error_j) / eta
self.alphas[j] = np.clip(self.alphas[j], L, H)
# update alpha_i
self.alphas[i] += y[i] * y[j] * (alpha_j_old - self.alphas[j])
# update bias
b1 = self.b - error_i - y[i] * (self.alphas[i] - alpha_i_old) * K[i, i] \
- y[j] * (self.alphas[j] - alpha_j_old) * K[i, j]
b2 = self.b - error_j - y[i] * (self.alphas[i] - alpha_i_old) * K[i, j] \
- y[j] * (self.alphas[j] - alpha_j_old) * K[j, j]
if 0 < self.alphas[i] < self.C:
self.b = b1
elif 0 < self.alphas[j] < self.C:
self.b = b2
else:
self.b = (b1 + b2) / 2
def decision_function(self, X):
K = self._kernel(X, self.X_train)
return K @ (self.alphas * self.y_train) + self.b
def predict(self, X):
return np.sign(self.decision_function(X))
# Concentric circles dataset
np.random.seed(42)
n_per_class = 60
angles = np.random.uniform(0, 2 * np.pi, n_per_class)
r_inner = 1.0 + np.random.randn(n_per_class) * 0.15
r_outer = 3.0 + np.random.randn(n_per_class) * 0.3
X_inner = np.column_stack([r_inner * np.cos(angles), r_inner * np.sin(angles)])
X_outer = np.column_stack([r_outer * np.cos(angles), r_outer * np.sin(angles)])
X_circles = np.vstack([X_inner, X_outer])
y_circles = np.array([1]*n_per_class + [-1]*n_per_class)
for kernel_name in ['linear', 'poly', 'rbf']:
svm = KernelSVM(kernel=kernel_name, C=10.0, gamma=1.0, degree=2, epochs=100)
svm.fit(X_circles, y_circles)
preds = svm.predict(X_circles)
acc = np.mean(preds == y_circles) * 100
sv_count = np.sum(svm.alphas > 1e-5)
print(f"{kernel_name:>6} kernel: accuracy={acc:.0f}%, support vectors={sv_count}")
# linear kernel: accuracy=53%, support vectors=107
# poly kernel: accuracy=100%, support vectors=16
# rbf kernel: accuracy=100%, support vectors=18
The results tell a vivid story. The linear kernel is hopeless — 53% accuracy is barely better than random chance because no line can separate concentric circles. The polynomial kernel (degree 2) maps each point to a space that includes x² and y² terms, where the circles become separable — 100% accuracy with just 16 support vectors. The RBF kernel also achieves perfect separation using 18 support vectors, and its implicit feature space is infinite-dimensional — yet we never computed a single infinite-dimensional vector.
This is the magic of the kernel trick: you get the power of infinite dimensions for the cost of computing a single exponential.
The RBF γ Parameter
The RBF kernel has its own knob to turn: γ (gamma). It controls how much influence each training point has:
- Small γ (wide Gaussian): each point influences a large area → smooth, simple boundary. Risk: underfitting.
- Large γ (narrow Gaussian): each point only influences its immediate neighbors → complex, wiggly boundary that can create isolated "islands." Risk: overfitting.
Think of it like this: with a tiny γ, every support vector is a lighthouse casting a broad beam. With a huge γ, each one is a flashlight illuminating only the point directly in front of it. The challenge is finding the γ that illuminates just enough structure without memorizing noise.
Try It: Kernel Visualization
See how different kernels create non-linear decision boundaries. The heatmap shows the SVM's decision function — blue regions predict class +1, red regions predict class -1, and the boundary is where the color transitions.
SVM vs. Neural Networks vs. Gradient Boosting
We've now built three fundamentally different classification paradigms in this series. How do they compare in practice? Here's an honest head-to-head across the dimensions that matter:
| Dimension | SVM | Neural Net | Gradient Boosting |
|---|---|---|---|
| Training speed | O(n²-n³) — slow on large data | O(n) with SGD — scales well | O(n·trees) — scales well |
| Prediction speed | Depends on # support vectors | Fixed forward pass | Depends on # trees |
| Interpretability | Linear: weight vector | Opaque (black box) | Feature importance |
| Small datasets (<10K) | Excels | Struggles | Good |
| Large datasets (>100K) | Struggles | Excels | Excellent |
| Feature scaling needed? | Yes (essential) | Yes (helps a lot) | No (invariant) |
| Theoretical guarantees | Margin-based bounds | Weak generalization bounds | Training error bounds |
| High-dimensional data | Excellent (text, genomics) | Good with enough data | Good |
The practical takeaway: SVMs remain the best choice for small-to-medium datasets with clear margin structure, especially in high-dimensional spaces where the number of features exceeds the number of samples (text classification with bag-of-words, genomics with thousands of gene expression features). For tabular data at scale, gradient boosting usually wins. For images, text sequences, and other perceptual data, neural networks are unmatched.
SVMs also have a unique advantage in one-class classification (anomaly detection) and when you need hard theoretical guarantees about generalization — the margin bound gives a provable limit on test error that no other algorithm family matches.
Conclusion
Support Vector Machines approach classification from a perspective that no other algorithm in our series shares: pure geometry. Instead of minimizing a prediction error, they maximize a margin. Instead of learning features, they leverage kernels to operate in spaces they never explicitly compute. And the result is an algorithm that's uniquely elegant — a handful of support vectors is all you need to define the boundary between classes.
We built three progressively powerful SVMs: a hard-margin classifier that finds the widest possible gap, a soft-margin variant that gracefully handles noise via the C parameter, and a kernelized version that draws curved boundaries in infinite-dimensional spaces using the kernel trick. Along the way, we saw how SVMs connect to ideas we've explored before — L2 regularization, hinge loss, gradient descent — while bringing something fundamentally new: the principle that the decision boundary should be defined by the hardest examples, not the easiest ones.
In the next post, we'll explore the other side of machine learning — what happens when you have no labels at all. K-Means Clustering from Scratch will take us into unsupervised learning, where algorithms discover structure on their own.
References & Further Reading
- Vladimir Vapnik — The Nature of Statistical Learning Theory (1995) — the foundational text that introduced SVMs and statistical learning theory
- Corinna Cortes & Vladimir Vapnik — Support-Vector Networks (1995) — the original SVM paper introducing soft-margin classification
- Bernhard Schölkopf & Alexander Smola — Learning with Kernels (2002) — comprehensive treatment of kernel methods and SVMs
- Chih-Chung Chang & Chih-Jen Lin — LIBSVM (2011) — the most widely used SVM implementation, basis for scikit-learn's SVM
- Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — chapters 4 and 12 cover SVMs in the context of classification and kernel methods
- scikit-learn SVM documentation — practical guide with excellent visual examples
Posts in this series: Regularization from Scratch (L2 as margin maximization), Loss Functions from Scratch (hinge loss), Backpropagation from Scratch (gradient-based optimization), Gradient Boosting from Scratch (decision boundary comparison), ML Evaluation from Scratch (cross-validation for hyperparameter selection)