ML Evaluation from Scratch

February 26, 2026 · Elementary · 16 min read

The Evaluation Trap

Your model got 94% accuracy. Is that good? You don't actually know yet — and that's the problem.

Evaluation is the most underestimated skill in machine learning. We spend weeks tweaking architectures, tuning hyperparameters, and engineering features, then spend about thirty seconds checking whether any of it worked. A single model.score(X_test, y_test) and we call it a day. This is how models that look great on your laptop fail catastrophically in production.

There are three ways evaluation goes wrong, roughly in order of how often I see them:

Evaluating on training data — the model memorized the answers, but you're grading it on the same exam it studied from. Every model looks brilliant when you let it cheat.
A single random train/test split — you got lucky (or unlucky) with which examples ended up where. Run it again with a different random seed and accuracy shifts by 3%. Which number do you trust?
Comparing models without statistical significance — Model B beats Model A by 0.8%. Is B actually better, or did it just get a more favorable split? Without a proper test, you cannot tell.

The consequences are real: deployed models that fail on new data, research papers with irreproducible results, and engineering teams that waste months chasing improvements that were never improvements at all. If you've read our posts on loss functions and regularization, you know how models learn and how to prevent overfitting. This post is about how to measure whether any of that actually worked.

We'll build everything from scratch: splitting strategies, cross-validation, a proper metric suite, and — the part almost everyone skips — statistical significance tests that tell you whether a difference between two models is real or just noise.

Holdout, Stratification, and the Bias-Variance of Splits

The simplest form of evaluation is the holdout split: shuffle your data, set aside 20% as a test set, train on the remaining 80%, and evaluate. It's fast, it's easy, and it's dangerously unreliable — because how you split matters enormously.

Consider a dataset with 95% negative examples and 5% positive. A naive random split might, by pure chance, put zero positive examples in the test set. Your model would score 100% by always predicting negative, and you'd never notice. Stratified splitting fixes this by maintaining class proportions in both sets. And if your data has a time component — stock prices, user behavior, sensor readings — random splits cause data leakage because you're training on future data to predict the past.

Here are three splitting strategies, implemented from scratch:

import numpy as np

def random_holdout(X, y, test_ratio=0.2, seed=42):
    """Split data randomly into train and test sets."""
    rng = np.random.RandomState(seed)
    n = len(X)
    indices = rng.permutation(n)
    split = int(n * (1 - test_ratio))
    return (X[indices[:split]], X[indices[split:]],
            y[indices[:split]], y[indices[split:]])

def stratified_holdout(X, y, test_ratio=0.2, seed=42):
    """Split preserving class proportions in both sets."""
    rng = np.random.RandomState(seed)
    train_idx, test_idx = [], []
    for label in np.unique(y):
        label_idx = np.where(y == label)[0]
        rng.shuffle(label_idx)
        split = int(len(label_idx) * (1 - test_ratio))
        train_idx.extend(label_idx[:split])
        test_idx.extend(label_idx[split:])
    return (X[train_idx], X[test_idx],
            y[train_idx], y[test_idx])

def temporal_split(X, y, timestamps, cutoff):
    """Split by time: train on past, test on future."""
    train_mask = timestamps < cutoff
    test_mask = timestamps >= cutoff
    return (X[train_mask], X[test_mask],
            y[train_mask], y[test_mask])

# Demonstrate the variance problem: 20 random splits
accuracies = []
for seed in range(20):
    X_tr, X_te, y_tr, y_te = random_holdout(X, y, seed=seed)
    model.fit(X_tr, y_tr)
    accuracies.append(model.score(X_te, y_te))
print(f"Mean: {np.mean(accuracies):.3f} +/- {np.std(accuracies):.3f}")
# Mean: 0.912 +/- 0.031  — that +/-3.1% is the problem!

The random_holdout function shuffles indices and slices them — simple but fragile. The stratified_holdout version processes each class separately, ensuring both sets reflect the true class distribution. The temporal_split uses a date cutoff, which is the only correct approach for time-series data.

That final experiment is the key takeaway: twenty different random seeds produce a standard deviation of 3.1 percentage points. Your model didn't change — only the split changed. If you evaluate once and get 91.2%, another researcher with a different seed gets 88.5%, and neither of you is wrong. This variance motivates the next technique.

Cross-Validation — Reducing the Luck Factor

The standard solution to split variance is to evaluate on every example by rotating which data is held out. In k-fold cross-validation, you partition the data into k equal folds, then run k experiments: each time, one fold serves as the test set and the remaining k−1 folds serve as training data. Your final score is the mean across all k evaluations.

This eliminates the question "did I get a lucky split?" because every example gets exactly one turn as test data. The resulting mean is far more stable than any single holdout score.

import numpy as np

def kfold_split(n, k=5, seed=42):
    """Generate k train/test index pairs."""
    rng = np.random.RandomState(seed)
    indices = rng.permutation(n)
    fold_size = n // k
    for i in range(k):
        test_idx = indices[i * fold_size : (i + 1) * fold_size]
        train_idx = np.concatenate([
            indices[:i * fold_size],
            indices[(i + 1) * fold_size:]
        ])
        yield train_idx, test_idx

def stratified_kfold(y, k=5, seed=42):
    """K-fold splits preserving class proportions per fold."""
    rng = np.random.RandomState(seed)
    folds = [[] for _ in range(k)]
    for label in np.unique(y):
        label_idx = np.where(y == label)[0]
        rng.shuffle(label_idx)
        for i, idx in enumerate(label_idx):
            folds[i % k].append(idx)
    for i in range(k):
        test_idx = np.array(folds[i])
        train_idx = np.concatenate([
            np.array(folds[j]) for j in range(k) if j != i
        ])
        yield train_idx, test_idx

def cross_validate(X, y, model_fn, k=10, seed=42):
    """Run stratified k-fold CV and return per-fold scores."""
    scores = []
    for train_idx, test_idx in stratified_kfold(y, k, seed):
        model = model_fn()
        model.fit(X[train_idx], y[train_idx])
        scores.append(model.score(X[test_idx], y[test_idx]))
    return np.array(scores)

scores = cross_validate(X, y, model_fn=LogisticRegression, k=10)
print(f"10-fold CV: {scores.mean():.3f} +/- {scores.std():.3f}")
# 10-fold CV: 0.914 +/- 0.012  — much tighter than random splits!

The kfold_split generator shuffles indices and yields non-overlapping slices. The stratified_kfold version deals out examples round-robin within each class, so every fold has roughly the same class ratio. The cross_validate wrapper ties it together: create a fresh model for each fold, train, score, collect.

The variance reduction is dramatic: from ±3.1% with random holdout down to ±1.2% with 10-fold CV. For most problems, k=5 or k=10 are the standard choices (Hastie, Tibshirani & Friedman recommend these in Elements of Statistical Learning). At the extreme, leave-one-out CV (k=n) evaluates every single example individually. It's unbiased but high-variance and computationally expensive — training n models is usually overkill.

One subtlety: if your data has natural groups — multiple images per patient, several transactions per user — you must keep entire groups in the same fold. Otherwise the model sees "nearly identical" training examples for each test example, which is a form of data leakage. And when you're simultaneously tuning hyperparameters and evaluating, you need nested cross-validation: an inner loop for model selection and an outer loop for honest evaluation. The outer loop tells you how good the model is; the inner loop tells you which hyperparameters to use.

One important theoretical result: Bengio & Grandvalet (2004) proved that no universal unbiased estimator of k-fold CV variance exists. In other words, the ± number we report is itself an approximation. This doesn't make CV useless — it makes it important to understand what we're actually measuring.

Try It: Cross-Validation Explorer

Watch k-fold CV in action. Each fold takes a turn as the test set (highlighted in brighter colors with rings). A nearest-centroid classifier is trained on the remaining points, and fold accuracy is recorded in the bar chart on the right.

k = 5 Points: 200 Overlap: 0.5

Beyond Accuracy — Metrics That Actually Matter

Accuracy is the most popular evaluation metric and also the most misleading. On a dataset with 99% negative and 1% positive examples, a model that always predicts "negative" scores 99% accuracy. It's a useless model with a perfect-looking number.

The fundamental issue is that accuracy treats all errors equally. In fraud detection, missing a fraud (false negative) costs thousands of dollars, while flagging a legitimate transaction (false positive) costs a phone call. In medical screening, missing a cancer (false negative) can be fatal, while a false alarm just means an extra test. You need metrics that capture which kinds of errors your model makes.

import numpy as np

def confusion_counts(y_true, y_pred):
    """Return TP, FP, FN, TN from binary predictions."""
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    return tp, fp, fn, tn

def precision_recall_f1(y_true, y_pred):
    """Precision, recall, and their harmonic mean."""
    tp, fp, fn, tn = confusion_counts(y_true, y_pred)
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec  = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1   = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
    return prec, rec, f1

def matthews_corrcoef(y_true, y_pred):
    """MCC: balanced metric using all four confusion matrix cells."""
    tp, fp, fn, tn = confusion_counts(y_true, y_pred)
    num = tp * tn - fp * fn
    den = np.sqrt(float((tp+fp) * (tp+fn) * (tn+fp) * (tn+fn)))
    return num / den if den > 0 else 0

def roc_auc(y_true, scores):
    """AUC via the Mann-Whitney U statistic."""
    pos_scores = scores[y_true == 1]
    neg_scores = scores[y_true == 0]
    count = 0
    for p in pos_scores:
        for n in neg_scores:
            if p > n: count += 1
            elif p == n: count += 0.5
    return count / (len(pos_scores) * len(neg_scores))

# Imbalanced fraud detection: 1% positive rate
y_true = np.array([0]*990 + [1]*10)
y_pred = np.zeros(1000, dtype=int)  # always predict "no fraud"
p, r, f1 = precision_recall_f1(y_true, y_pred)
mcc = matthews_corrcoef(y_true, y_pred)

print(f"Accuracy: {np.mean(y_true == y_pred):.1%}")  # 99.0%
print(f"Precision: {p:.3f}, Recall: {r:.3f}, F1: {f1:.3f}")  # all 0
print(f"MCC: {mcc:.3f}")  # 0.000 — correctly says model is useless

Let's unpack these metrics. Precision answers "of the items I flagged as positive, how many actually were?" Recall answers "of all the actual positives, how many did I find?" F1 is the harmonic mean of the two — it penalizes lopsided tradeoffs harder than the arithmetic mean would.

The Matthews Correlation Coefficient (MCC) deserves special attention. Its formula — (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN)) — uses all four cells of the confusion matrix. It produces a value between −1 and +1, where 0 means the model is no better than random guessing. Unlike accuracy and F1, MCC handles class imbalance naturally: our always-negative fraud model gets MCC = 0, correctly signaling that it learned nothing. Chicco & Jurman (2020) showed that MCC is the only metric that gives a high score exclusively when the model performs well on all four confusion matrix quadrants.

For ROC-AUC, we use the Mann-Whitney U statistic: the probability that a randomly chosen positive example gets a higher score than a randomly chosen negative example. An AUC of 0.5 means random; 1.0 means perfect ranking. This O(n²) implementation is deliberately naive — scikit-learn uses a sorted approach for efficiency — but it makes the probability interpretation crystal clear.

The precision-recall tradeoff depends entirely on context. In fraud detection you might accept 10% false positives to catch 95% of fraud. In spam filtering you might tolerate missing some spam to avoid ever filtering a legitimate email. The business context determines which tradeoff is acceptable — no metric can answer that for you.

Statistical Significance — Is the Difference Real?

You've trained two models. Model A scores 89.6% on 10-fold CV. Model B scores 91.6%. Should you deploy B? The 2-percentage-point gap looks convincing, but looks can deceive. The per-fold scores might bounce around enough that the gap is just noise.

We need a statistical test. The simplest approach is a paired t-test on the per-fold scores. It's "paired" because each fold produces a score for both models on the same test data. But there's a trap: the folds share training data. Fold 1 and Fold 2 have 80% of their training examples in common (for 10-fold CV). This overlap makes the scores correlated, and the naive t-test underestimates the true variance. The result? Inflated Type I error — it declares "significant!" far more often than 5% of the time.

The fix comes from Nadeau & Bengio (2003): inflate the variance estimate by a correction factor that accounts for the overlap. Instead of dividing by 1/k, use (1/k + n_test/n_train). For 10-fold CV, that's 1/10 + 1/9 ≈ 0.211, which roughly doubles the variance estimate and dramatically reduces false positives.

An alternative is McNemar's test, which sidesteps the shared-training-data problem entirely. It works on the disagreement table: count how many examples A gets right but B gets wrong (b), and vice versa (c). The test statistic χ² = (b − c)² / (b + c) uses only these off-diagonal cells. It's more powerful for large test sets because it uses per-example information rather than per-fold aggregates.

import numpy as np
from scipy import stats

def paired_ttest(scores_a, scores_b):
    """Naive paired t-test on CV fold scores.
    WARNING: underestimates variance due to overlapping training sets."""
    diffs = scores_a - scores_b
    t_stat = np.mean(diffs) / (np.std(diffs, ddof=1) / np.sqrt(len(diffs)))
    p_value = 2 * stats.t.sf(abs(t_stat), df=len(diffs) - 1)
    return t_stat, p_value

def corrected_resampled_ttest(scores_a, scores_b, n_train, n_test):
    """Nadeau & Bengio (2003) corrected test.
    Accounts for non-independence of CV folds."""
    k = len(scores_a)
    diffs = scores_a - scores_b
    mean_d = np.mean(diffs)
    var_d = np.var(diffs, ddof=1)
    # The correction: inflate variance by (1/k + n_test/n_train)
    corrected_var = (1/k + n_test/n_train) * var_d
    t_stat = mean_d / np.sqrt(corrected_var) if corrected_var > 0 else 0
    p_value = 2 * stats.t.sf(abs(t_stat), df=k - 1)
    return t_stat, p_value

def mcnemar_test(y_true, preds_a, preds_b):
    """McNemar's test on per-example disagreements."""
    correct_a = (preds_a == y_true)
    correct_b = (preds_b == y_true)
    b = np.sum(correct_a & ~correct_b)   # A right, B wrong
    c = np.sum(~correct_a & correct_b)   # A wrong, B right
    if (b + c) == 0:
        return 0.0, 1.0
    chi2 = (b - c) ** 2 / (b + c)
    p_value = stats.chi2.sf(chi2, df=1)
    return chi2, p_value

# The shocking example
scores_a = np.array([0.88, 0.92, 0.89, 0.91, 0.87,
                     0.93, 0.90, 0.86, 0.91, 0.89])
scores_b = np.array([0.93, 0.91, 0.92, 0.91, 0.91,
                     0.91, 0.93, 0.91, 0.92, 0.91])

_, p_naive = paired_ttest(scores_a, scores_b)
_, p_corr  = corrected_resampled_ttest(scores_a, scores_b, 900, 100)

print(f"2% gap — Naive p={p_naive:.3f}, Corrected p={p_corr:.3f}")
# 2% gap — Naive p=0.030, Corrected p=0.110
# Naive says significant, but the corrected test says NO!

The example above is the whole lesson in four numbers. A 2-percentage-point accuracy gap looks meaningful, and the naive paired t-test confirms it (p = 0.030). But once we account for the fact that CV folds share training data, the corrected test gives p = 0.110 — not significant at α = 0.05. The naive test was overconfident because it pretended the folds were independent.

One more trap: multiple comparisons. If you compare 10 models pairwise, that's 45 tests. At α = 0.05, you'd expect roughly 2–3 "significant" results purely by chance. The Bonferroni correction divides your significance threshold by the number of comparisons: with 45 tests, you'd require p < 0.05/45 ≈ 0.001 to declare significance. It's conservative, but it prevents you from fooling yourself.

Try It: Significance Test Playground

Set two model accuracies, a sample size, and the number of CV folds. The demo simulates fold-level scores using the corrected Nadeau-Bengio t-test. Hit "Run 100 Experiments" to see how often the test correctly detects (or falsely claims) a significant difference.

Acc A: 90% Acc B: 92% Samples: 200 Folds: 10

Evaluation Pitfalls — The Mistakes Everyone Makes

Beyond the core techniques, there's a catalog of evaluation errors that appear in practice over and over. I've made most of these myself, which is how I know they're easy to miss and hard to debug.

Pitfall	Example	Fix
Data leakage	Standardizing features using the full dataset’s mean/std before splitting	All preprocessing inside the CV loop
Test set contamination	Tuning hyperparameters on the test set, even indirectly	Use nested CV for model selection
Temporal leakage	Random-splitting time series, training on future data	Always use temporal splits for time-stamped data
Metric mismatch	Optimizing accuracy when the business cares about precision@k	Choose metrics that match business impact
Ignoring calibration	Model says 90% confidence but is correct only 70% of the time	Use reliability diagrams, apply Platt scaling
Publication bias	Running many experiments but only reporting the best split	Report all results; use significance tests

The single most important rule of ML evaluation: do all preprocessing inside the cross-validation loop. Standardization, feature selection, imputation, encoding — if the test set influences any of these steps, your evaluation is optimistically biased and your model will underperform in production.

Data leakage is the most insidious because it's invisible in your pipeline. You'll get great validation scores, ship the model, and wonder why production metrics are worse. The fix is simple in principle — every transformation that learns from data must be fitted on training data only and applied to test data — but it requires discipline to enforce in a complex pipeline with dozens of preprocessing steps.

Putting It All Together

Evaluation isn't glamorous. Nobody writes blog posts bragging about their cross-validation setup. But it's the difference between a model that works and a model that merely looks like it works. The pipeline we've built — stratified splits, k-fold cross-validation, metrics that match your actual problem, and significance tests that account for the quirks of CV — gives you honest answers about model performance.

If you take one thing from this post, let it be this: a 2% accuracy improvement is not an improvement until a statistical test says so. And if you take two things, let the second be: do all preprocessing inside the cross-validation loop. These two habits alone will save you from the most common evaluation mistakes in machine learning.

References & Further Reading

Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — chapters 7–8 cover model assessment and selection with mathematical rigor and clarity
Nadeau & Bengio — Inference for the Generalization Error (2003) — the corrected resampled t-test that accounts for overlapping training sets in CV
Dietterich — Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms (1998) — the foundational comparison of five significance tests for ML
Raschka — Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning (2018) — comprehensive modern survey of everything in this post and more
Bengio & Grandvalet — No Unbiased Estimator of the Variance of K-Fold Cross-Validation (2004) — proves a fundamental limitation of CV variance estimation
Bouckaert & Frank — Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms (2004) — empirical study of how significance tests perform on real ML benchmarks
Chicco & Jurman — The Advantages of the Matthews Correlation Coefficient over F1 Score and Accuracy (2020) — the case for MCC as the default classification metric
DadOps — Loss Functions from Scratch — the connection between training objectives and evaluation metrics
DadOps — Regularization from Scratch — why models overfit and how to prevent it
DadOps — Gradient Boosting from Scratch — out-of-bag evaluation for ensemble methods
DadOps — Neural Scaling Laws — when more data makes the evaluation question moot
DadOps — Backpropagation from Scratch — the training algorithm whose outputs we're evaluating