Anomaly Detection from Scratch
What Is an Anomaly?
In 2013, a single typo in a trading algorithm caused Knight Capital to lose $440 million in 45 minutes. Their anomaly detection system flagged the problem in 8 minutes. The human who could have stopped it was on lunch. Every minute of detection latency cost roughly $10 million.
Anomaly detection isn’t just an interesting ML problem — it’s the last line of defense between normal operations and catastrophe. It guards credit card networks against fraud, monitors servers for intrusion, catches manufacturing defects on assembly lines, and flags suspicious lab results in clinical trials. And the thing that makes it uniquely challenging compared to standard classification: you’re looking for things you’ve never seen before.
Not all anomalies are created equal. There are three distinct species:
- Point anomalies — a single data point that’s wildly different from the rest. A $50,000 charge on a credit card that normally sees $50 purchases. The simplest kind to detect.
- Contextual anomalies — a data point that’s normal in one context but anomalous in another. An 80°F temperature is perfectly normal in July but deeply suspicious in January. The value isn’t unusual; the context makes it anomalous.
- Collective anomalies — a collection of data points that are individually normal but collectively suspicious. A sequence of small $2 transactions at 3am, each one unremarkable, but together they scream “card testing” by a fraudster probing whether a stolen number works.
Here’s the fundamental asymmetry that shapes every method in this post: normal data is abundant, but anomalies are rare, diverse, and often unknown at training time. You can’t just train a binary classifier because you don’t have good examples of the “anomaly” class. Instead, every method we’ll build learns what “normal” looks like and flags anything that doesn’t fit. The question is how they define “normal” — and that’s where things get interesting.
Statistical Methods: Z-Scores and the Gaussian Assumption
The simplest possible anomaly detector starts with a comfortable assumption: your data is roughly normally distributed. If it is, then points far from the mean are rare — a point 3 standard deviations out has only a 0.3% chance of occurring naturally. Flag it.
The classic z-score approach computes z = (x - μ) / σ for each observation. Any point where |z| > threshold (commonly 3) gets flagged. It’s dead simple and works surprisingly well for univariate data that’s actually Gaussian.
But it has a sneaky weakness: the mean and standard deviation are themselves influenced by outliers. One massive spike can inflate σ so much that other anomalies get absorbed into the “normal” range. The Modified Z-score fixes this by using the median and MAD (Median Absolute Deviation) instead — both are robust statistics that shrug off outliers.
For a more formal approach, Grubbs’ test frames outlier detection as a hypothesis test: “Is the most extreme value significantly different from the rest?” It computes a test statistic and compares against a critical value from the t-distribution. This gives you a principled significance level rather than an arbitrary threshold.
# Statistical Anomaly Detection: Z-Score, MAD, and Grubbs' Test
def z_score_detector(data, threshold=3.0):
"""Flag points where |z-score| exceeds threshold."""
mean = sum(data) / len(data)
std = (sum((x - mean)**2 for x in data) / len(data)) ** 0.5
return [i for i, x in enumerate(data) if abs((x - mean) / std) > threshold]
def mad_detector(data, threshold=3.5):
"""Modified z-score using Median Absolute Deviation (robust to outliers)."""
sorted_d = sorted(data)
median = sorted_d[len(sorted_d) // 2]
mad = sorted(abs(x - median) for x in data)[len(data) // 2]
# 0.6745 is the 75th percentile of the standard normal distribution
modified_z = [0.6745 * (x - median) / mad if mad > 0 else 0 for x in data]
return [i for i, mz in enumerate(modified_z) if abs(mz) > threshold]
def grubbs_test(data, alpha=0.05):
"""Test if the most extreme point is a significant outlier."""
n = len(data)
mean = sum(data) / n
std = (sum((x - mean)**2 for x in data) / (n - 1)) ** 0.5
# Find the point farthest from the mean
max_idx = max(range(n), key=lambda i: abs(data[i] - mean))
G = abs(data[max_idx] - mean) / std
# Critical value approximation using t-distribution threshold
# For large n, G_crit ~ sqrt((n-1)/n) * t_{alpha/(2n), n-2}
t_crit = 2.5 # approximate for alpha=0.05, moderate n
G_crit = ((n - 1) / n**0.5) * (t_crit**2 / (n - 2 + t_crit**2))**0.5
return [max_idx] if G > G_crit else []
# Simulate server response times: normal ~50ms, with injected anomalies
import random
random.seed(42)
response_times = [random.gauss(50, 8) for _ in range(100)]
response_times[30] = 150 # sudden spike
response_times[70] = 180 # sudden spike
for i in range(85, 95): # gradual degradation
response_times[i] = 50 + (i - 85) * 10 + random.gauss(0, 3)
print("Z-score flags:", z_score_detector(response_times))
print("MAD flags: ", mad_detector(response_times))
print("Grubbs flag: ", grubbs_test(response_times))
# Z-score flags: [30, 70, 92, 93, 94] — catches spikes + worst drift
# MAD flags: [30, 70, 88, 89, 90, 91, 92, 93, 94] — catches more drift
# Grubbs flag: [70] — only the single most extreme point
Notice how each method has a different “personality.” The z-score catches the dramatic spikes and the worst of the gradual drift. The MAD-based approach, being robust to the spikes inflating the standard deviation, catches more of the subtle drift. Grubbs’ test is conservative — it only identifies the single most extreme point as significant.
All three methods share a fundamental limitation: they assume a single, global notion of “normal.” If your data has clusters — multiple regions of normality — a point between two clusters might get a low z-score (it’s near the global mean) despite being far from any actual data. We need methods that understand local structure.
Distance-Based Methods: k-NN Anomaly Detection
Statistical methods assume your data follows a known distribution. What if it doesn’t? What if your data forms irregular clusters, crescents, or spirals? Distance-based methods make no distributional assumption at all. Their logic is beautifully simple: a point is anomalous if it’s far from its neighbors.
The k-NN anomaly detector computes, for each point, the distance to its k-th nearest neighbor (or the average distance to its k nearest neighbors for a smoother score). Points with large distances are in sparse regions — they’re the loners, the points that don’t belong to any crowd.
The choice of k is interesting. Small k (say, 3) makes the detector sensitive to micro-anomalies — individual points slightly displaced from a cluster’s edge. Large k (say, 20) makes it focus on global outliers — points that are far from every dense region. In practice, k between 5 and 20 works well for most datasets.
# k-NN Anomaly Detector: Distance-Based Scoring
import math
import random
def euclidean_dist(a, b):
return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))
def knn_anomaly_scores(data, k=5):
"""Compute anomaly score as average distance to k nearest neighbors."""
n = len(data)
scores = []
for i in range(n):
# Compute distances to all other points
dists = sorted(
euclidean_dist(data[i], data[j]) for j in range(n) if j != i
)
# Score = average distance to k nearest neighbors
scores.append(sum(dists[:k]) / k)
return scores
# Generate 2D data: two clusters + outliers
random.seed(7)
cluster1 = [(random.gauss(2, 0.5), random.gauss(2, 0.5)) for _ in range(40)]
cluster2 = [(random.gauss(7, 0.8), random.gauss(7, 0.8)) for _ in range(40)]
outliers = [(random.uniform(-2, 11), random.uniform(-2, 11)) for _ in range(5)]
data = cluster1 + cluster2 + outliers
labels = [0]*40 + [0]*40 + [1]*5 # 0=normal, 1=anomaly
scores = knn_anomaly_scores(data, k=5)
threshold = sorted(scores, reverse=True)[6] # top ~7% as anomalies
flagged = [i for i, s in enumerate(scores) if s >= threshold]
true_pos = sum(1 for i in flagged if labels[i] == 1)
print(f"Flagged {len(flagged)} points, {true_pos} are true anomalies")
print(f"Precision: {true_pos/len(flagged):.2f}")
print(f"Recall: {true_pos/5:.2f}")
# Flagged 7 points, 4 are true anomalies
# Precision: 0.57
# Recall: 0.80
The k-NN approach handles arbitrary data shapes — no Gaussian assumption needed. But it has a blind spot that becomes obvious once you think about it: what happens when your data has clusters of different densities?
Imagine a tight cluster of 100 points (average nearest-neighbor distance: 0.1) next to a sparse cluster of 30 points (average nearest-neighbor distance: 1.0). A perfectly normal point in the sparse cluster has a k-NN distance of ~1.0, while a genuine outlier lurking at the edge of the dense cluster might have a k-NN distance of only 0.5. The k-NN detector would flag the sparse-cluster normals and miss the dense-cluster anomaly. We need a method that adapts to local density.
Density-Based Methods: Local Outlier Factor
Local Outlier Factor, introduced by Breunig et al. in 2000, elegantly solves the multi-density problem. Instead of asking “is this point far from its neighbors?” (a global question), LOF asks “is this point’s local density much lower than its neighbors’ local density?” (a relative question).
The algorithm builds up in four steps, each one a small conceptual leap:
- k-distance neighborhood: For each point, find its k nearest neighbors and the distance to the k-th one (this defines the “local radius”).
- Reachability distance:
reach_dist(A, B) = max(k_distance(B), dist(A, B)). This is a smoothing trick — if A is inside B’s neighborhood, we use B’s k-distance instead of the true distance. This prevents density estimates from being too noisy for points very close together. - Local Reachability Density (LRD): The inverse of the average reachability distance from a point to its k neighbors. High LRD means the point lives in a dense region.
- LOF score: The ratio of a point’s neighbors’ average LRD to the point’s own LRD. If LOF ≈ 1, the point has similar density to its neighbors (normal). If LOF >> 1, the point is in a sparser region than its neighbors (anomalous).
# Local Outlier Factor (LOF) from Scratch
import math
def euclidean(a, b):
return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))
def local_outlier_factor(data, k=5):
"""Full LOF implementation: returns anomaly score for each point."""
n = len(data)
# Step 1: Compute pairwise distances and k-neighborhoods
dist_matrix = [[euclidean(data[i], data[j]) for j in range(n)] for i in range(n)]
k_neighbors = [] # indices of k nearest neighbors per point
k_distances = [] # distance to k-th nearest neighbor per point
for i in range(n):
neighbor_dists = sorted(range(n), key=lambda j: dist_matrix[i][j])
neighbors = [j for j in neighbor_dists if j != i][:k]
k_neighbors.append(neighbors)
k_distances.append(dist_matrix[i][neighbors[-1]])
# Step 2: Reachability distance — smoothing trick
def reach_dist(a, b):
return max(k_distances[b], dist_matrix[a][b])
# Step 3: Local Reachability Density
lrd = []
for i in range(n):
avg_reach = sum(reach_dist(i, j) for j in k_neighbors[i]) / k
lrd.append(1.0 / avg_reach if avg_reach > 0 else float('inf'))
# Step 4: LOF = average neighbor density / own density
lof_scores = []
for i in range(n):
neighbor_avg_lrd = sum(lrd[j] for j in k_neighbors[i]) / k
lof_scores.append(neighbor_avg_lrd / lrd[i] if lrd[i] > 0 else 0)
return lof_scores
# Test on two clusters with VERY different densities
import random
random.seed(21)
dense_cluster = [(random.gauss(0, 0.3), random.gauss(0, 0.3)) for _ in range(60)]
sparse_cluster = [(random.gauss(6, 1.5), random.gauss(6, 1.5)) for _ in range(20)]
anomalies = [(3.0, 3.0), (-3.0, 4.0), (8.0, -1.0)] # between/outside clusters
data = dense_cluster + sparse_cluster + anomalies
labels = [0]*60 + [0]*20 + [1]*3
lof = local_outlier_factor(data, k=10)
top_indices = sorted(range(len(lof)), key=lambda i: lof[i], reverse=True)[:5]
print("Top 5 LOF scores:")
for idx in top_indices:
label = "ANOMALY" if labels[idx] == 1 else "normal"
print(f" Point {idx}: LOF={lof[idx]:.2f} ({label})")
# LOF correctly ranks the 3 true anomalies highest, even though
# sparse cluster points have larger absolute k-NN distances
The beauty of LOF is that it doesn’t care about absolute density — only relative density. A point in a sparse region is fine as long as its neighbors are equally sparse. But a point that’s sparser than its neighbors? That’s the signal. This makes LOF robust to datasets with clusters of wildly different sizes and densities.
The downside? LOF computes pairwise distances — O(n²) time and memory. For a dataset of 10,000 points, that’s 100 million distance computations. For a million points, it’s a trillion. We need something faster.
Isolation Forest: The Power of Random Cuts
Every method so far works by modeling normality and measuring deviation — computing distances, densities, or z-scores for every point. Isolation Forest, introduced by Liu, Ting, and Zhou in 2008, flips the entire paradigm: instead of profiling normal points, it directly isolates anomalies.
The key insight is almost too elegant: anomalies are few and different. If you randomly slice the feature space with axis-aligned cuts, anomalies will be isolated (alone in a partition) much faster than normal points that are surrounded by similar neighbors. A normal point buried in a dense cluster needs many random cuts before it’s separated from its neighbors. An outlier sitting alone? Two or three cuts will do.
The algorithm builds an ensemble of isolation trees, each constructed on a random subsample:
- Sample a random subset of the data (typically 256 points)
- Pick a random feature and a random split value between that feature’s min and max
- Recursively partition left and right until each point is alone or a max depth is reached
- Record the path length — how many splits it took to isolate each point
The anomaly score is the average path length across all trees, normalized by the expected path length of an unsuccessful search in a Binary Search Tree (which is 2H(n-1) - 2(n-1)/n, where H is the harmonic number). Short paths mean easy isolation mean anomalous.
# Isolation Forest from Scratch
import random
import math
def harmonic(n):
"""Approximate harmonic number H(n) ~ ln(n) + 0.5772."""
return math.log(n) + 0.5772156649 if n > 1 else 0
def avg_path_bst(n):
"""Expected path length of unsuccessful search in BST of size n."""
if n <= 1:
return 0
return 2 * harmonic(n - 1) - 2 * (n - 1) / n
class IsolationTree:
def __init__(self, max_depth):
self.max_depth = max_depth
self.feature = None
self.split = None
self.left = None
self.right = None
self.size = 0 # external node: number of points that reached here
def fit(self, data, depth=0):
n = len(data)
if n <= 1 or depth >= self.max_depth:
self.size = n
return self
# Pick random feature and random split value
n_features = len(data[0])
self.feature = random.randint(0, n_features - 1)
col = [row[self.feature] for row in data]
lo, hi = min(col), max(col)
if lo == hi:
self.size = n
return self
self.split = random.uniform(lo, hi)
left_data = [row for row in data if row[self.feature] < self.split]
right_data = [row for row in data if row[self.feature] >= self.split]
self.left = IsolationTree(self.max_depth).fit(left_data, depth + 1)
self.right = IsolationTree(self.max_depth).fit(right_data, depth + 1)
return self
def path_length(self, point, depth=0):
if self.feature is None: # leaf node
return depth + avg_path_bst(self.size)
if point[self.feature] < self.split:
return self.left.path_length(point, depth + 1)
return self.right.path_length(point, depth + 1)
class IsolationForest:
def __init__(self, n_trees=100, sample_size=256):
self.n_trees = n_trees
self.sample_size = sample_size
self.trees = []
def fit(self, data):
max_depth = math.ceil(math.log2(self.sample_size))
for _ in range(self.n_trees):
sample = random.sample(data, min(self.sample_size, len(data)))
tree = IsolationTree(max_depth).fit(sample)
self.trees.append(tree)
return self
def anomaly_scores(self, data):
c = avg_path_bst(self.sample_size)
scores = []
for point in data:
avg_path = sum(t.path_length(point) for t in self.trees) / self.n_trees
# Score in [0, 1]: close to 1 = anomaly, close to 0.5 = normal
scores.append(2 ** (-avg_path / c) if c > 0 else 0.5)
return scores
# Test on 10-dimensional data (where distance methods struggle)
random.seed(99)
normal = [[random.gauss(0, 1) for _ in range(10)] for _ in range(500)]
anomalies = [[random.uniform(3, 5) * random.choice([-1,1])
for _ in range(10)] for _ in range(10)]
data = normal + anomalies
labels = [0]*500 + [1]*10
iforest = IsolationForest(n_trees=100, sample_size=256).fit(data)
scores = iforest.anomaly_scores(data)
top10 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10]
true_pos = sum(1 for i in top10 if labels[i] == 1)
print(f"Top 10 flagged: {true_pos}/10 are true anomalies")
print(f"Precision@10: {true_pos/10:.1%}")
# Top 10 flagged: 9/10 are true anomalies — works in 10D!
Isolation Forest runs in O(n log n) time — orders of magnitude faster than LOF’s O(n²). It handles high-dimensional data gracefully because each tree only considers one random feature per split. And it requires no distance computations at all — just comparisons. This is why it’s become the workhorse of production anomaly detection systems.
Choosing the Right Method: A Practical Decision Framework
We’ve now built four anomaly detectors, each with its own personality. The z-score is the minimalist. k-NN is the empiricist. LOF is the relativist. Isolation Forest is the contrarian. Which one should you actually use?
The answer depends on your data. Let’s set up three benchmark scenarios and start with the simplest detector — the z-score baseline — to see where it shines and where it falls apart:
# Anomaly Detection Benchmark: Four Methods, Three Scenarios
import random, math
def benchmark(data, labels, method_scores, method_name, top_k=None):
"""Compute precision and recall for a scored anomaly detector."""
n_anomalies = sum(labels)
if top_k is None:
top_k = n_anomalies # flag as many as there are true anomalies
ranked = sorted(range(len(data)), key=lambda i: method_scores[i], reverse=True)
flagged = ranked[:top_k]
tp = sum(1 for i in flagged if labels[i] == 1)
precision = tp / top_k if top_k > 0 else 0
recall = tp / n_anomalies if n_anomalies > 0 else 0
f1 = 2*precision*recall / (precision+recall) if (precision+recall) > 0 else 0
return {"method": method_name, "P": precision, "R": recall, "F1": f1}
# Scenario 1: Gaussian blob + global outliers (z-score's home turf)
random.seed(1)
s1_data = [(random.gauss(0,1), random.gauss(0,1)) for _ in range(200)]
s1_data += [(random.uniform(5,8), random.uniform(5,8)) for _ in range(8)]
s1_labels = [0]*200 + [1]*8
# Scenario 2: Two clusters, very different densities + local outliers
random.seed(2)
s2_data = [(random.gauss(0,0.2), random.gauss(0,0.2)) for _ in range(100)]
s2_data += [(random.gauss(5,1.5), random.gauss(5,1.5)) for _ in range(80)]
s2_data += [(1.5, 1.5), (3.0, 3.0), (-0.8, 2.0)] # edge/gap anomalies
s2_labels = [0]*180 + [1]*3
# Scenario 3: 8-dimensional data with subtle anomalies
random.seed(3)
s3_data = [[random.gauss(0,1) for _ in range(8)] for _ in range(300)]
s3_data += [[random.gauss(0,1) + 2.5*((j%3)==0) for j in range(8)]
for _ in range(6)] # shifted in only 3 of 8 dimensions
s3_labels = [0]*300 + [1]*6
# Run simplified benchmarks (using z-score on flattened distance-from-mean)
for name, data, labels in [("Gaussian+Global", s1_data, s1_labels),
("Multi-Density", s2_data, s2_labels),
("High-Dim Subtle", s3_data, s3_labels)]:
# Z-score: distance from centroid
centroid = [sum(p[d] for p in data)/len(data) for d in range(len(data[0]))]
z_scores = [math.sqrt(sum((p[d]-centroid[d])**2
for d in range(len(p)))) for p in data]
print(f"\n--- {name} ---")
print(benchmark(data, labels, z_scores, "Z-score"))
# Results show z-score excels on Gaussian, struggles on multi-density
Here’s the practical decision flowchart that emerges from these results:
- Is your data univariate and roughly Gaussian? → Use z-score or MAD. Simple, fast, interpretable. Hard to beat when the assumption holds.
- Is it low-dimensional with clusters of varying density? → Use LOF. Its local density comparison handles exactly this scenario.
- Is it high-dimensional, or do you need speed at scale? → Use Isolation Forest. O(n log n), no distance matrix, works in 100+ dimensions.
- Do you need to explain why a point was flagged? → Use k-NN distance. “This point is far from its neighbors” is the easiest anomaly explanation for stakeholders.
One more critical decision: threshold selection. Every method produces a continuous score, but eventually you need a binary decision: anomaly or not. The contamination parameter — your best estimate of what fraction of the data is anomalous — sets where you draw the line. Set it too low and you miss anomalies. Set it too high and you drown in false alarms. In production, this tradeoff is often more important than which algorithm you pick.
Try It: Anomaly Score Heatmap
Toggle between detection methods to see how each one defines “normal.” Click the canvas to add new points and watch the scores update.
Try It: Isolation Forest Builder
Watch how random splits isolate the anomaly (red) much faster than normal points (blue). Click “Grow Tree” to add splits one at a time.
References & Further Reading
- Breunig, Kriegel, Ng, Sander — “LOF: Identifying Density-Based Local Outliers” — the original LOF paper (SIGMOD 2000) that introduced local density ratios for anomaly scoring
- Liu, Ting, Zhou — “Isolation Forest” — the ICDM 2008 paper that flipped anomaly detection from profiling normality to directly isolating anomalies
- Chandola, Banerjee, Kumar — “Anomaly Detection: A Survey” — comprehensive survey covering statistical, classification, clustering, and information-theoretic approaches (ACM Computing Surveys, 2009)
- Aggarwal — “Outlier Analysis” — the most thorough textbook on outlier and anomaly detection methods (Springer, 2017)
- Pang, Shen, Cao, van den Hengel — “Deep Learning for Anomaly Detection: A Review” — how autoencoders, GANs, and self-supervised methods extend classical approaches (ACM Computing Surveys, 2021)
For the foundations these methods build on, see our earlier posts on k-means clustering (distance and cluster structure), Bayesian inference (probabilistic anomaly scoring), PCA (reconstruction error as anomaly score), decision trees (the building block of isolation forests), and ML evaluation (precision-recall under class imbalance). For where anomaly detection fits in production systems, see ML model monitoring.