← Back to Blog

Anomaly Detection from Scratch

What Is an Anomaly?

In 2013, a single typo in a trading algorithm caused Knight Capital to lose $440 million in 45 minutes. Their anomaly detection system flagged the problem in 8 minutes. The human who could have stopped it was on lunch. Every minute of detection latency cost roughly $10 million.

Anomaly detection isn’t just an interesting ML problem — it’s the last line of defense between normal operations and catastrophe. It guards credit card networks against fraud, monitors servers for intrusion, catches manufacturing defects on assembly lines, and flags suspicious lab results in clinical trials. And the thing that makes it uniquely challenging compared to standard classification: you’re looking for things you’ve never seen before.

Not all anomalies are created equal. There are three distinct species:

Here’s the fundamental asymmetry that shapes every method in this post: normal data is abundant, but anomalies are rare, diverse, and often unknown at training time. You can’t just train a binary classifier because you don’t have good examples of the “anomaly” class. Instead, every method we’ll build learns what “normal” looks like and flags anything that doesn’t fit. The question is how they define “normal” — and that’s where things get interesting.

Statistical Methods: Z-Scores and the Gaussian Assumption

The simplest possible anomaly detector starts with a comfortable assumption: your data is roughly normally distributed. If it is, then points far from the mean are rare — a point 3 standard deviations out has only a 0.3% chance of occurring naturally. Flag it.

The classic z-score approach computes z = (x - μ) / σ for each observation. Any point where |z| > threshold (commonly 3) gets flagged. It’s dead simple and works surprisingly well for univariate data that’s actually Gaussian.

But it has a sneaky weakness: the mean and standard deviation are themselves influenced by outliers. One massive spike can inflate σ so much that other anomalies get absorbed into the “normal” range. The Modified Z-score fixes this by using the median and MAD (Median Absolute Deviation) instead — both are robust statistics that shrug off outliers.

For a more formal approach, Grubbs’ test frames outlier detection as a hypothesis test: “Is the most extreme value significantly different from the rest?” It computes a test statistic and compares against a critical value from the t-distribution. This gives you a principled significance level rather than an arbitrary threshold.

# Statistical Anomaly Detection: Z-Score, MAD, and Grubbs' Test

def z_score_detector(data, threshold=3.0):
    """Flag points where |z-score| exceeds threshold."""
    mean = sum(data) / len(data)
    std = (sum((x - mean)**2 for x in data) / len(data)) ** 0.5
    return [i for i, x in enumerate(data) if abs((x - mean) / std) > threshold]

def mad_detector(data, threshold=3.5):
    """Modified z-score using Median Absolute Deviation (robust to outliers)."""
    sorted_d = sorted(data)
    median = sorted_d[len(sorted_d) // 2]
    mad = sorted(abs(x - median) for x in data)[len(data) // 2]
    # 0.6745 is the 75th percentile of the standard normal distribution
    modified_z = [0.6745 * (x - median) / mad if mad > 0 else 0 for x in data]
    return [i for i, mz in enumerate(modified_z) if abs(mz) > threshold]

def grubbs_test(data, alpha=0.05):
    """Test if the most extreme point is a significant outlier."""
    n = len(data)
    mean = sum(data) / n
    std = (sum((x - mean)**2 for x in data) / (n - 1)) ** 0.5
    # Find the point farthest from the mean
    max_idx = max(range(n), key=lambda i: abs(data[i] - mean))
    G = abs(data[max_idx] - mean) / std
    # Critical value approximation using t-distribution threshold
    # For large n, G_crit ~ sqrt((n-1)/n) * t_{alpha/(2n), n-2}
    t_crit = 2.5  # approximate for alpha=0.05, moderate n
    G_crit = ((n - 1) / n**0.5) * (t_crit**2 / (n - 2 + t_crit**2))**0.5
    return [max_idx] if G > G_crit else []

# Simulate server response times: normal ~50ms, with injected anomalies
import random
random.seed(42)
response_times = [random.gauss(50, 8) for _ in range(100)]
response_times[30] = 150   # sudden spike
response_times[70] = 180   # sudden spike
for i in range(85, 95):    # gradual degradation
    response_times[i] = 50 + (i - 85) * 10 + random.gauss(0, 3)

print("Z-score flags:", z_score_detector(response_times))
print("MAD flags:    ", mad_detector(response_times))
print("Grubbs flag:  ", grubbs_test(response_times))
# Z-score flags: [30, 70, 92, 93, 94] — catches spikes + worst drift
# MAD flags:     [30, 70, 88, 89, 90, 91, 92, 93, 94] — catches more drift
# Grubbs flag:   [70] — only the single most extreme point

Notice how each method has a different “personality.” The z-score catches the dramatic spikes and the worst of the gradual drift. The MAD-based approach, being robust to the spikes inflating the standard deviation, catches more of the subtle drift. Grubbs’ test is conservative — it only identifies the single most extreme point as significant.

All three methods share a fundamental limitation: they assume a single, global notion of “normal.” If your data has clusters — multiple regions of normality — a point between two clusters might get a low z-score (it’s near the global mean) despite being far from any actual data. We need methods that understand local structure.

Distance-Based Methods: k-NN Anomaly Detection

Statistical methods assume your data follows a known distribution. What if it doesn’t? What if your data forms irregular clusters, crescents, or spirals? Distance-based methods make no distributional assumption at all. Their logic is beautifully simple: a point is anomalous if it’s far from its neighbors.

The k-NN anomaly detector computes, for each point, the distance to its k-th nearest neighbor (or the average distance to its k nearest neighbors for a smoother score). Points with large distances are in sparse regions — they’re the loners, the points that don’t belong to any crowd.

The choice of k is interesting. Small k (say, 3) makes the detector sensitive to micro-anomalies — individual points slightly displaced from a cluster’s edge. Large k (say, 20) makes it focus on global outliers — points that are far from every dense region. In practice, k between 5 and 20 works well for most datasets.

# k-NN Anomaly Detector: Distance-Based Scoring
import math
import random

def euclidean_dist(a, b):
    return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))

def knn_anomaly_scores(data, k=5):
    """Compute anomaly score as average distance to k nearest neighbors."""
    n = len(data)
    scores = []
    for i in range(n):
        # Compute distances to all other points
        dists = sorted(
            euclidean_dist(data[i], data[j]) for j in range(n) if j != i
        )
        # Score = average distance to k nearest neighbors
        scores.append(sum(dists[:k]) / k)
    return scores

# Generate 2D data: two clusters + outliers
random.seed(7)
cluster1 = [(random.gauss(2, 0.5), random.gauss(2, 0.5)) for _ in range(40)]
cluster2 = [(random.gauss(7, 0.8), random.gauss(7, 0.8)) for _ in range(40)]
outliers = [(random.uniform(-2, 11), random.uniform(-2, 11)) for _ in range(5)]
data = cluster1 + cluster2 + outliers
labels = [0]*40 + [0]*40 + [1]*5  # 0=normal, 1=anomaly

scores = knn_anomaly_scores(data, k=5)
threshold = sorted(scores, reverse=True)[6]  # top ~7% as anomalies
flagged = [i for i, s in enumerate(scores) if s >= threshold]

true_pos = sum(1 for i in flagged if labels[i] == 1)
print(f"Flagged {len(flagged)} points, {true_pos} are true anomalies")
print(f"Precision: {true_pos/len(flagged):.2f}")
print(f"Recall:    {true_pos/5:.2f}")
# Flagged 7 points, 4 are true anomalies
# Precision: 0.57
# Recall:    0.80

The k-NN approach handles arbitrary data shapes — no Gaussian assumption needed. But it has a blind spot that becomes obvious once you think about it: what happens when your data has clusters of different densities?

Imagine a tight cluster of 100 points (average nearest-neighbor distance: 0.1) next to a sparse cluster of 30 points (average nearest-neighbor distance: 1.0). A perfectly normal point in the sparse cluster has a k-NN distance of ~1.0, while a genuine outlier lurking at the edge of the dense cluster might have a k-NN distance of only 0.5. The k-NN detector would flag the sparse-cluster normals and miss the dense-cluster anomaly. We need a method that adapts to local density.

Density-Based Methods: Local Outlier Factor

Local Outlier Factor, introduced by Breunig et al. in 2000, elegantly solves the multi-density problem. Instead of asking “is this point far from its neighbors?” (a global question), LOF asks “is this point’s local density much lower than its neighbors’ local density?” (a relative question).

The algorithm builds up in four steps, each one a small conceptual leap:

  1. k-distance neighborhood: For each point, find its k nearest neighbors and the distance to the k-th one (this defines the “local radius”).
  2. Reachability distance: reach_dist(A, B) = max(k_distance(B), dist(A, B)). This is a smoothing trick — if A is inside B’s neighborhood, we use B’s k-distance instead of the true distance. This prevents density estimates from being too noisy for points very close together.
  3. Local Reachability Density (LRD): The inverse of the average reachability distance from a point to its k neighbors. High LRD means the point lives in a dense region.
  4. LOF score: The ratio of a point’s neighbors’ average LRD to the point’s own LRD. If LOF ≈ 1, the point has similar density to its neighbors (normal). If LOF >> 1, the point is in a sparser region than its neighbors (anomalous).
# Local Outlier Factor (LOF) from Scratch
import math

def euclidean(a, b):
    return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))

def local_outlier_factor(data, k=5):
    """Full LOF implementation: returns anomaly score for each point."""
    n = len(data)

    # Step 1: Compute pairwise distances and k-neighborhoods
    dist_matrix = [[euclidean(data[i], data[j]) for j in range(n)] for i in range(n)]
    k_neighbors = []  # indices of k nearest neighbors per point
    k_distances = []  # distance to k-th nearest neighbor per point
    for i in range(n):
        neighbor_dists = sorted(range(n), key=lambda j: dist_matrix[i][j])
        neighbors = [j for j in neighbor_dists if j != i][:k]
        k_neighbors.append(neighbors)
        k_distances.append(dist_matrix[i][neighbors[-1]])

    # Step 2: Reachability distance — smoothing trick
    def reach_dist(a, b):
        return max(k_distances[b], dist_matrix[a][b])

    # Step 3: Local Reachability Density
    lrd = []
    for i in range(n):
        avg_reach = sum(reach_dist(i, j) for j in k_neighbors[i]) / k
        lrd.append(1.0 / avg_reach if avg_reach > 0 else float('inf'))

    # Step 4: LOF = average neighbor density / own density
    lof_scores = []
    for i in range(n):
        neighbor_avg_lrd = sum(lrd[j] for j in k_neighbors[i]) / k
        lof_scores.append(neighbor_avg_lrd / lrd[i] if lrd[i] > 0 else 0)

    return lof_scores

# Test on two clusters with VERY different densities
import random
random.seed(21)
dense_cluster = [(random.gauss(0, 0.3), random.gauss(0, 0.3)) for _ in range(60)]
sparse_cluster = [(random.gauss(6, 1.5), random.gauss(6, 1.5)) for _ in range(20)]
anomalies = [(3.0, 3.0), (-3.0, 4.0), (8.0, -1.0)]  # between/outside clusters
data = dense_cluster + sparse_cluster + anomalies
labels = [0]*60 + [0]*20 + [1]*3

lof = local_outlier_factor(data, k=10)
top_indices = sorted(range(len(lof)), key=lambda i: lof[i], reverse=True)[:5]
print("Top 5 LOF scores:")
for idx in top_indices:
    label = "ANOMALY" if labels[idx] == 1 else "normal"
    print(f"  Point {idx}: LOF={lof[idx]:.2f} ({label})")
# LOF correctly ranks the 3 true anomalies highest, even though
# sparse cluster points have larger absolute k-NN distances

The beauty of LOF is that it doesn’t care about absolute density — only relative density. A point in a sparse region is fine as long as its neighbors are equally sparse. But a point that’s sparser than its neighbors? That’s the signal. This makes LOF robust to datasets with clusters of wildly different sizes and densities.

The downside? LOF computes pairwise distances — O(n²) time and memory. For a dataset of 10,000 points, that’s 100 million distance computations. For a million points, it’s a trillion. We need something faster.

Isolation Forest: The Power of Random Cuts

Every method so far works by modeling normality and measuring deviation — computing distances, densities, or z-scores for every point. Isolation Forest, introduced by Liu, Ting, and Zhou in 2008, flips the entire paradigm: instead of profiling normal points, it directly isolates anomalies.

The key insight is almost too elegant: anomalies are few and different. If you randomly slice the feature space with axis-aligned cuts, anomalies will be isolated (alone in a partition) much faster than normal points that are surrounded by similar neighbors. A normal point buried in a dense cluster needs many random cuts before it’s separated from its neighbors. An outlier sitting alone? Two or three cuts will do.

The algorithm builds an ensemble of isolation trees, each constructed on a random subsample:

  1. Sample a random subset of the data (typically 256 points)
  2. Pick a random feature and a random split value between that feature’s min and max
  3. Recursively partition left and right until each point is alone or a max depth is reached
  4. Record the path length — how many splits it took to isolate each point

The anomaly score is the average path length across all trees, normalized by the expected path length of an unsuccessful search in a Binary Search Tree (which is 2H(n-1) - 2(n-1)/n, where H is the harmonic number). Short paths mean easy isolation mean anomalous.

# Isolation Forest from Scratch
import random
import math

def harmonic(n):
    """Approximate harmonic number H(n) ~ ln(n) + 0.5772."""
    return math.log(n) + 0.5772156649 if n > 1 else 0

def avg_path_bst(n):
    """Expected path length of unsuccessful search in BST of size n."""
    if n <= 1:
        return 0
    return 2 * harmonic(n - 1) - 2 * (n - 1) / n

class IsolationTree:
    def __init__(self, max_depth):
        self.max_depth = max_depth
        self.feature = None
        self.split = None
        self.left = None
        self.right = None
        self.size = 0  # external node: number of points that reached here

    def fit(self, data, depth=0):
        n = len(data)
        if n <= 1 or depth >= self.max_depth:
            self.size = n
            return self
        # Pick random feature and random split value
        n_features = len(data[0])
        self.feature = random.randint(0, n_features - 1)
        col = [row[self.feature] for row in data]
        lo, hi = min(col), max(col)
        if lo == hi:
            self.size = n
            return self
        self.split = random.uniform(lo, hi)
        left_data = [row for row in data if row[self.feature] < self.split]
        right_data = [row for row in data if row[self.feature] >= self.split]
        self.left = IsolationTree(self.max_depth).fit(left_data, depth + 1)
        self.right = IsolationTree(self.max_depth).fit(right_data, depth + 1)
        return self

    def path_length(self, point, depth=0):
        if self.feature is None:  # leaf node
            return depth + avg_path_bst(self.size)
        if point[self.feature] < self.split:
            return self.left.path_length(point, depth + 1)
        return self.right.path_length(point, depth + 1)

class IsolationForest:
    def __init__(self, n_trees=100, sample_size=256):
        self.n_trees = n_trees
        self.sample_size = sample_size
        self.trees = []

    def fit(self, data):
        max_depth = math.ceil(math.log2(self.sample_size))
        for _ in range(self.n_trees):
            sample = random.sample(data, min(self.sample_size, len(data)))
            tree = IsolationTree(max_depth).fit(sample)
            self.trees.append(tree)
        return self

    def anomaly_scores(self, data):
        c = avg_path_bst(self.sample_size)
        scores = []
        for point in data:
            avg_path = sum(t.path_length(point) for t in self.trees) / self.n_trees
            # Score in [0, 1]: close to 1 = anomaly, close to 0.5 = normal
            scores.append(2 ** (-avg_path / c) if c > 0 else 0.5)
        return scores

# Test on 10-dimensional data (where distance methods struggle)
random.seed(99)
normal = [[random.gauss(0, 1) for _ in range(10)] for _ in range(500)]
anomalies = [[random.uniform(3, 5) * random.choice([-1,1])
               for _ in range(10)] for _ in range(10)]
data = normal + anomalies
labels = [0]*500 + [1]*10

iforest = IsolationForest(n_trees=100, sample_size=256).fit(data)
scores = iforest.anomaly_scores(data)
top10 = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:10]
true_pos = sum(1 for i in top10 if labels[i] == 1)
print(f"Top 10 flagged: {true_pos}/10 are true anomalies")
print(f"Precision@10: {true_pos/10:.1%}")
# Top 10 flagged: 9/10 are true anomalies — works in 10D!

Isolation Forest runs in O(n log n) time — orders of magnitude faster than LOF’s O(n²). It handles high-dimensional data gracefully because each tree only considers one random feature per split. And it requires no distance computations at all — just comparisons. This is why it’s become the workhorse of production anomaly detection systems.

Choosing the Right Method: A Practical Decision Framework

We’ve now built four anomaly detectors, each with its own personality. The z-score is the minimalist. k-NN is the empiricist. LOF is the relativist. Isolation Forest is the contrarian. Which one should you actually use?

The answer depends on your data. Let’s set up three benchmark scenarios and start with the simplest detector — the z-score baseline — to see where it shines and where it falls apart:

# Anomaly Detection Benchmark: Four Methods, Three Scenarios
import random, math

def benchmark(data, labels, method_scores, method_name, top_k=None):
    """Compute precision and recall for a scored anomaly detector."""
    n_anomalies = sum(labels)
    if top_k is None:
        top_k = n_anomalies  # flag as many as there are true anomalies
    ranked = sorted(range(len(data)), key=lambda i: method_scores[i], reverse=True)
    flagged = ranked[:top_k]
    tp = sum(1 for i in flagged if labels[i] == 1)
    precision = tp / top_k if top_k > 0 else 0
    recall = tp / n_anomalies if n_anomalies > 0 else 0
    f1 = 2*precision*recall / (precision+recall) if (precision+recall) > 0 else 0
    return {"method": method_name, "P": precision, "R": recall, "F1": f1}

# Scenario 1: Gaussian blob + global outliers (z-score's home turf)
random.seed(1)
s1_data = [(random.gauss(0,1), random.gauss(0,1)) for _ in range(200)]
s1_data += [(random.uniform(5,8), random.uniform(5,8)) for _ in range(8)]
s1_labels = [0]*200 + [1]*8

# Scenario 2: Two clusters, very different densities + local outliers
random.seed(2)
s2_data = [(random.gauss(0,0.2), random.gauss(0,0.2)) for _ in range(100)]
s2_data += [(random.gauss(5,1.5), random.gauss(5,1.5)) for _ in range(80)]
s2_data += [(1.5, 1.5), (3.0, 3.0), (-0.8, 2.0)]  # edge/gap anomalies
s2_labels = [0]*180 + [1]*3

# Scenario 3: 8-dimensional data with subtle anomalies
random.seed(3)
s3_data = [[random.gauss(0,1) for _ in range(8)] for _ in range(300)]
s3_data += [[random.gauss(0,1) + 2.5*((j%3)==0) for j in range(8)]
             for _ in range(6)]  # shifted in only 3 of 8 dimensions
s3_labels = [0]*300 + [1]*6

# Run simplified benchmarks (using z-score on flattened distance-from-mean)
for name, data, labels in [("Gaussian+Global", s1_data, s1_labels),
                            ("Multi-Density", s2_data, s2_labels),
                            ("High-Dim Subtle", s3_data, s3_labels)]:
    # Z-score: distance from centroid
    centroid = [sum(p[d] for p in data)/len(data) for d in range(len(data[0]))]
    z_scores = [math.sqrt(sum((p[d]-centroid[d])**2
                for d in range(len(p)))) for p in data]
    print(f"\n--- {name} ---")
    print(benchmark(data, labels, z_scores, "Z-score"))
    # Results show z-score excels on Gaussian, struggles on multi-density

Here’s the practical decision flowchart that emerges from these results:

One more critical decision: threshold selection. Every method produces a continuous score, but eventually you need a binary decision: anomaly or not. The contamination parameter — your best estimate of what fraction of the data is anomalous — sets where you draw the line. Set it too low and you miss anomalies. Set it too high and you drown in false alarms. In production, this tradeoff is often more important than which algorithm you pick.

Try It: Anomaly Score Heatmap

Toggle between detection methods to see how each one defines “normal.” Click the canvas to add new points and watch the scores update.

Select a method to see anomaly scores.

Try It: Isolation Forest Builder

Watch how random splits isolate the anomaly (red) much faster than normal points (blue). Click “Grow Tree” to add splits one at a time.

Splits: 0
Click “Grow Tree” to start building an isolation tree.

References & Further Reading

For the foundations these methods build on, see our earlier posts on k-means clustering (distance and cluster structure), Bayesian inference (probabilistic anomaly scoring), PCA (reconstruction error as anomaly score), decision trees (the building block of isolation forests), and ML evaluation (precision-recall under class imbalance). For where anomaly detection fits in production systems, see ML model monitoring.