← Back to Blog

Gradient Boosting from Scratch

Introduction

Here's a fact that surprises most people entering the AI field: the algorithm that wins the most Kaggle competitions, powers the most production fraud detection systems, and dominates credit scoring, ad click prediction, and recommendation ranking is not a neural network. It's gradient boosting.

XGBoost, LightGBM, and CatBoost — three implementations of the same core idea — remain the default choice for any problem with tabular data (rows and columns, like a spreadsheet or SQL table). A 2022 NeurIPS benchmark by Grinsztajn et al. tested the best neural network architectures against gradient boosting across 45 tabular datasets and found that trees still win on medium-sized data. Neural networks only pull ahead with 500,000+ rows and complex feature interactions.

Understanding why requires going back to the fundamentals. In this post, we'll build gradient boosting from scratch — starting with a single decision tree, assembling forests, then building the boosting machinery that made XGBoost a household name in machine learning. Along the way, we'll discover why fitting one tree to another tree's mistakes is actually gradient descent in function space, and why this simple idea produces models that are remarkably hard to beat.

If you've been following the elementary series, you've seen us build optimizers that do gradient descent on neural network parameters. Gradient boosting does something wonderfully different: gradient descent where each "step" is an entire decision tree.

1. Decision Trees — Recursive Partitioning

A decision tree is the most intuitive machine learning model: it asks a series of yes/no questions about your features and arrives at a prediction. "Is the customer's income above $50K? Is their credit score above 700? Have they missed a payment in the last year?" Each question splits the data, and each leaf node holds a prediction.

Mathematically, a decision tree is a piecewise-constant function. It partitions feature space into axis-aligned rectangles and assigns a constant value to each region. This sounds limiting, but with enough rectangles, you can approximate any function — the question is whether you'll overfit in the process.

The key decision at each node is: which feature and which threshold give the best split? We measure "best" using impurity — how mixed the classes are in each child node. Two common measures:

In practice, they almost always agree on the best split. We'll use Gini because it's faster and what XGBoost defaults to for classification.

Here's a complete decision tree classifier built from scratch. Notice how compact it is — the entire algorithm is just recursive splitting:

import numpy as np

def gini_impurity(y):
    """Gini = 1 - sum(p_k^2) for each class k."""
    if len(y) == 0:
        return 0.0
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return 1.0 - np.sum(probs ** 2)

def best_split(X, y):
    """Find the feature and threshold that maximize information gain."""
    best_gain, best_feat, best_thresh = -1, None, None
    parent_impurity = gini_impurity(y)
    n = len(y)

    for feat in range(X.shape[1]):
        thresholds = np.unique(X[:, feat])
        for thresh in thresholds:
            left_mask = X[:, feat] <= thresh
            right_mask = ~left_mask
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue

            # Weighted average impurity of children
            w_left = left_mask.sum() / n
            child_impurity = (w_left * gini_impurity(y[left_mask])
                            + (1 - w_left) * gini_impurity(y[right_mask]))
            gain = parent_impurity - child_impurity

            if gain > best_gain:
                best_gain = gain
                best_feat = feat
                best_thresh = thresh

    return best_feat, best_thresh, best_gain

def build_tree(X, y, depth=0, max_depth=5, min_samples=2):
    """Recursively build a decision tree."""
    # Leaf node: return majority class
    if depth >= max_depth or len(y) < min_samples or len(np.unique(y)) == 1:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    feat, thresh, gain = best_split(X, y)
    if feat is None or gain <= 0:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    left_mask = X[:, feat] <= thresh
    return {
        "leaf": False, "feature": feat, "threshold": thresh,
        "left": build_tree(X[left_mask], y[left_mask], depth+1, max_depth, min_samples),
        "right": build_tree(X[~left_mask], y[~left_mask], depth+1, max_depth, min_samples),
    }

def predict_one(tree, x):
    if tree["leaf"]:
        return tree["class"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_one(tree["left"], x)
    return predict_one(tree["right"], x)

# Usage: tree = build_tree(X_train, y_train, max_depth=4)
# predictions = [predict_one(tree, x) for x in X_test]

The fatal flaw of a single tree is high variance. Grow it deep enough and it memorizes the training data perfectly — zero training error — but small changes in the training set produce wildly different trees. This is the classic bias-variance tradeoff: a single deep tree has low bias (it can fit anything) but high variance (it's unstable). We need a way to keep the low bias while taming the variance.

2. Random Forests — Wisdom of Crowds

The first ensemble idea is beautifully simple: if one tree is unstable, build many trees and let them vote. But there's a catch — if you train 100 trees on the same data, you get 100 identical trees. The trick is making them different while still keeping each one individually accurate.

Leo Breiman's Random Forest (2001) introduces two sources of randomness:

Why does averaging decorrelated trees reduce variance? If N trees each have variance σ² and pairwise correlation ρ, the ensemble's variance is:

ρσ² + (1-ρ)σ²/N

As N grows, the second term vanishes. What remains is ρσ² — variance proportional to the correlation between trees. Bagging and feature subsampling reduce ρ, so the ensemble variance shrinks dramatically. A single tree might achieve 75% accuracy on a noisy task; a random forest of 100 trees typically hits 88%+ with no hyperparameter tuning.

def random_forest(X, y, n_trees=100, max_depth=8, max_features="sqrt"):
    """Build a random forest via bagging + feature subsampling."""
    n_samples, n_features = X.shape
    n_sub = int(np.sqrt(n_features)) if max_features == "sqrt" else n_features
    trees = []

    for _ in range(n_trees):
        # Bootstrap sample (draw n_samples with replacement)
        idx = np.random.randint(0, n_samples, size=n_samples)
        X_boot, y_boot = X[idx], y[idx]

        # Build tree with feature subsampling at each split
        tree = build_tree_rf(X_boot, y_boot, max_depth=max_depth,
                             n_sub_features=n_sub)
        trees.append(tree)

    return trees

def build_tree_rf(X, y, depth=0, max_depth=8, min_samples=2, n_sub_features=None):
    """Decision tree with random feature subsampling at each split."""
    if depth >= max_depth or len(y) < min_samples or len(np.unique(y)) == 1:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    # Randomly select a subset of features to consider
    all_feats = np.arange(X.shape[1])
    chosen = np.random.choice(all_feats, size=n_sub_features, replace=False)

    best_gain, best_feat, best_thresh = -1, None, None
    parent_imp = gini_impurity(y)
    n = len(y)

    for feat in chosen:
        for thresh in np.unique(X[:, feat]):
            left = X[:, feat] <= thresh
            if left.sum() == 0 or (~left).sum() == 0:
                continue
            w = left.sum() / n
            imp = w * gini_impurity(y[left]) + (1-w) * gini_impurity(y[~left])
            gain = parent_imp - imp
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feat, thresh

    if best_feat is None:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": build_tree_rf(X[left_mask], y[left_mask], depth+1,
                              max_depth, min_samples, n_sub_features),
        "right": build_tree_rf(X[~left_mask], y[~left_mask], depth+1,
                               max_depth, min_samples, n_sub_features),
    }

def forest_predict(trees, X):
    """Majority vote across all trees."""
    all_preds = np.array([[predict_one(t, x) for x in X] for t in trees])
    # Most common prediction for each sample
    return np.array([np.bincount(all_preds[:, i].astype(int)).argmax()
                     for i in range(X.shape[0])])

Random forests are hard to beat for their simplicity. They need almost no tuning (more trees is almost always better), they provide free feature importance scores, and they parallelize trivially. But they have a ceiling — averaging reduces variance but can't reduce bias. If every individual tree misses some subtle pattern, the forest will too. This is where boosting changes the game.

3. Gradient Boosting — Learning from Mistakes

Here's the paradigm shift. Instead of building independent trees and averaging them (bagging), build trees sequentially, where each new tree specifically targets the errors the ensemble hasn't captured yet.

The algorithm for gradient boosting regression is almost suspiciously simple:

  1. Start with a constant prediction: F₀(x) = mean(y)
  2. Compute the residuals: r_i = y_i - F(x_i)
  3. Fit a small tree to the residuals (not the original targets)
  4. Update the ensemble: F(x) ← F(x) + η · tree(x)
  5. Repeat from step 2

The parameter η (learning rate, typically 0.01-0.3) scales each tree's contribution. Why not just use η=1? Because smaller steps with more trees produce smoother, more generalizable fits — just like in gradient descent for neural networks, a smaller learning rate with more steps usually reaches a better minimum.

But here's the deep insight: those residuals are actually the negative gradient of the squared error loss. For MSE loss L = ½(y - F(x))², the derivative with respect to F(x) is -(y - F(x)) = -residual. So fitting a tree to the residuals is equivalent to taking a gradient descent step in function space. Each tree is a "direction" of steepest descent, and the learning rate is the step size.

This is why it's called "gradient" boosting — the same gradient computation that drives neural network training, but instead of updating weight parameters, we're adding entire functions (trees) to our ensemble.

And because we're doing gradient descent, we can use any differentiable loss function — not just MSE. For classification, we use log-loss (cross-entropy from our loss functions post) and fit trees to the pseudo-residuals: the negative gradient of the log-loss.

def gradient_boost_regressor(X, y, n_rounds=100, lr=0.1, max_depth=3):
    """Gradient boosting for regression (MSE loss)."""
    # Initialize with the mean prediction
    f_hat = np.full(len(y), np.mean(y))
    trees = []
    init_pred = np.mean(y)

    for _ in range(n_rounds):
        # Negative gradient of MSE loss = residuals
        residuals = y - f_hat

        # Fit a shallow tree to the residuals
        tree = build_regression_tree(X, residuals, max_depth=max_depth)
        trees.append(tree)

        # Update predictions with learning rate
        for i in range(len(X)):
            f_hat[i] += lr * predict_reg(tree, X[i])

    return init_pred, trees, lr

def build_regression_tree(X, y, depth=0, max_depth=3, min_samples=5):
    """Regression tree: leaf value = mean of targets in that region."""
    if depth >= max_depth or len(y) < min_samples:
        return {"leaf": True, "value": np.mean(y)}

    best_mse, best_feat, best_thresh = float("inf"), None, None
    n = len(y)

    for feat in range(X.shape[1]):
        thresholds = np.unique(X[:, feat])
        # Sample thresholds for speed on large datasets
        if len(thresholds) > 50:
            thresholds = np.quantile(X[:, feat], np.linspace(0, 1, 50))
        for thresh in thresholds:
            left = X[:, feat] <= thresh
            if left.sum() < min_samples or (~left).sum() < min_samples:
                continue
            mse = (np.var(y[left]) * left.sum()
                 + np.var(y[~left]) * (~left).sum()) / n
            if mse < best_mse:
                best_mse, best_feat, best_thresh = mse, feat, thresh

    if best_feat is None:
        return {"leaf": True, "value": np.mean(y)}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": build_regression_tree(X[left_mask], y[left_mask],
                                      depth+1, max_depth, min_samples),
        "right": build_regression_tree(X[~left_mask], y[~left_mask],
                                       depth+1, max_depth, min_samples),
    }

def predict_reg(tree, x):
    if tree["leaf"]:
        return tree["value"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_reg(tree["left"], x)
    return predict_reg(tree["right"], x)

# For classification, replace residuals with pseudo-residuals:
# p = sigmoid(f_hat)  →  pseudo_residual = y - p
# This is the negative gradient of log-loss: -d/dF [-y*log(p) - (1-y)*log(1-p)]

The bias-variance story is the mirror image of random forests. Forests reduce variance by averaging independent trees but can't reduce bias. Boosting reduces bias by sequentially targeting what the ensemble gets wrong — each new tree chips away at the remaining error. But because trees are dependent (each one is built to correct the previous ensemble), there's less variance reduction. In practice, boosting typically achieves better accuracy than random forests, but requires more careful tuning to avoid overfitting.

4. XGBoost — The Engineering That Changed Everything

Gradient boosting as described above existed since Jerome Friedman's 2001 paper. It worked well, but it was slow and prone to overfitting. Then in 2016, Tianqi Chen and Carlos Guestrin published XGBoost, and it proceeded to dominate every Kaggle competition it entered. What did they change?

Three innovations, each building on the last:

Innovation 1: Second-Order Gradients

Standard gradient boosting uses only the first derivative (gradient) of the loss. XGBoost uses both the first derivative and the second derivative (Hessian). This is the difference between plain gradient descent and Newton's method — it converges faster because the second derivative tells you about the curvature of the loss landscape. For each sample i:

For MSE: g = -(y - F(x)), h = 1. For log-loss: g = p - y, h = p(1 - p), where p = sigmoid(F(x)).

Innovation 2: Regularized Objective

Instead of just minimizing the loss, XGBoost minimizes loss + a regularization penalty that directly controls tree complexity:

Objective = Loss + λ · ∑w_j² + γ · T

Where w_j are the leaf weights, T is the number of leaves, λ penalizes large leaf values (L2 regularization), and γ penalizes having too many leaves. This bakes complexity control directly into the optimization — not as an afterthought, but as part of what "optimal" means.

Innovation 3: The Split Gain Formula

With second-order gradients and the regularized objective, XGBoost derives a single formula that determines whether a split is worth making:

Gain = ½ · [G_L²/(H_L + λ) + G_R²/(H_R + λ) - (G_L + G_R)²/(H_L + H_R + λ)] - γ

Where G_L and H_L are the sums of gradients and Hessians in the left child, and similarly for the right. This formula simultaneously optimizes for accuracy (the G²/H terms represent how much the loss decreases) and penalizes complexity (λ and γ keep the tree small). A split only happens if Gain > 0 — built-in pruning.

And the optimal leaf weight is simply: w* = -G / (H + λ)

This is the heart of XGBoost — one formula for splitting, one for leaf values, and everything else is engineering around these two equations:

def xgboost_tree(X, grads, hessians, depth=0, max_depth=4,
                  lam=1.0, gamma=0.0, min_child_weight=1.0):
    """Build one XGBoost-style tree using second-order gradients."""
    G, H = np.sum(grads), np.sum(hessians)

    # Optimal leaf weight: w* = -G / (H + lambda)
    if depth >= max_depth or len(grads) < 2 or H < min_child_weight:
        return {"leaf": True, "weight": -G / (H + lam)}

    best_gain, best_feat, best_thresh = -float("inf"), None, None

    for feat in range(X.shape[1]):
        # Sort by feature value for efficient threshold search
        order = np.argsort(X[:, feat])
        g_left, h_left = 0.0, 0.0

        for i in range(len(order) - 1):
            idx = order[i]
            g_left += grads[idx]
            h_left += hessians[idx]
            g_right = G - g_left
            h_right = H - h_left

            # Skip if child is too small
            if h_left < min_child_weight or h_right < min_child_weight:
                continue
            # Skip duplicate thresholds
            if X[order[i], feat] == X[order[i+1], feat]:
                continue

            # The XGBoost split gain formula
            gain = 0.5 * (g_left**2 / (h_left + lam)
                        + g_right**2 / (h_right + lam)
                        - G**2 / (H + lam)) - gamma

            if gain > best_gain:
                best_gain = gain
                best_feat = feat
                best_thresh = (X[order[i], feat] + X[order[i+1], feat]) / 2

    # No worthwhile split found (gain includes regularization penalty)
    if best_feat is None or best_gain <= 0:
        return {"leaf": True, "weight": -G / (H + lam)}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": xgboost_tree(X[left_mask], grads[left_mask], hessians[left_mask],
                             depth+1, max_depth, lam, gamma, min_child_weight),
        "right": xgboost_tree(X[~left_mask], grads[~left_mask], hessians[~left_mask],
                              depth+1, max_depth, lam, gamma, min_child_weight),
    }

def predict_xgb(tree, x):
    if tree["leaf"]:
        return tree["weight"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_xgb(tree["left"], x)
    return predict_xgb(tree["right"], x)

# Full XGBoost training loop:
# f_hat = np.zeros(n)
# for round in range(n_rounds):
#     grads = compute_gradient(y, f_hat)    # g_i = p_i - y_i for log-loss
#     hessians = compute_hessian(y, f_hat)  # h_i = p_i * (1 - p_i)
#     tree = xgboost_tree(X, grads, hessians, lam=1.0, gamma=0.1)
#     for i in range(n):
#         f_hat[i] += lr * predict_xgb(tree, X[i])

Compare this to our earlier gradient boosting: instead of "fit tree to residuals," XGBoost says "fit tree to gradients, using Hessians to optimize leaf values and a regularized gain formula to decide splits." Same idea, much better execution.

5. Histogram-Based Splitting — Speed at Scale

The XGBoost implementation above has a bottleneck: at each node, it sorts every feature's values to find the best threshold. For a dataset with N rows and P features, that's O(NP log N) per tree level. On a million-row dataset, this gets slow.

LightGBM (Ke et al., 2017) introduced a trick inspired by how histograms work: bin continuous features into 256 discrete buckets before training starts. Then at each split, instead of searching every unique value, search only 256 bin boundaries. The cost drops from O(NP) to O(256P) per node — independent of dataset size.

There's an additional trick that halves the work: the histogram subtraction optimization. If you know the parent node's histogram and the left child's histogram, the right child's histogram is just the difference. So you only need to build histograms for the smaller child — the other is free.

def histogram_split(X_binned, grads, hessians, n_bins=256, lam=1.0, gamma=0.0):
    """Find best split using histogram-based approach (LightGBM-style).

    X_binned: features pre-binned to integers in [0, n_bins).
    Returns: (best_feature, best_bin, best_gain)
    """
    n_samples, n_features = X_binned.shape
    G_total = np.sum(grads)
    H_total = np.sum(hessians)
    best_gain, best_feat, best_bin = -float("inf"), -1, -1

    for feat in range(n_features):
        # Build histogram: sum gradients and hessians per bin
        g_hist = np.zeros(n_bins)
        h_hist = np.zeros(n_bins)
        for i in range(n_samples):
            b = X_binned[i, feat]
            g_hist[b] += grads[i]
            h_hist[b] += hessians[i]

        # Scan bins left-to-right to find best split
        g_left, h_left = 0.0, 0.0
        for b in range(n_bins - 1):
            g_left += g_hist[b]
            h_left += h_hist[b]
            g_right = G_total - g_left
            h_right = H_total - h_left

            if h_left < 1e-3 or h_right < 1e-3:
                continue

            gain = 0.5 * (g_left**2 / (h_left + lam)
                        + g_right**2 / (h_right + lam)
                        - G_total**2 / (H_total + lam)) - gamma

            if gain > best_gain:
                best_gain, best_feat, best_bin = gain, feat, b

    return best_feat, best_bin, best_gain

def preprocess_bins(X, n_bins=256):
    """Bin continuous features into n_bins buckets using quantiles."""
    X_binned = np.zeros_like(X, dtype=np.int32)
    bin_edges = []
    for feat in range(X.shape[1]):
        edges = np.quantile(X[:, feat],
                           np.linspace(0, 1, n_bins + 1)[1:-1])
        edges = np.unique(edges)  # remove duplicate edges
        X_binned[:, feat] = np.searchsorted(edges, X[:, feat])
        bin_edges.append(edges)
    return X_binned, bin_edges

# Benchmark: on 100K rows, histogram splitting is 5-10x faster
# than exact splitting, with <0.1% accuracy difference

LightGBM adds one more innovation: leaf-wise tree growth. XGBoost grows trees level-by-level (all nodes at depth d before any at depth d+1). LightGBM grows the leaf with the highest gain reduction, regardless of depth. This produces deeper, more asymmetric trees that often generalize better — the tree invests its complexity budget where it matters most. Combined with histogram splitting, LightGBM typically trains 5-10x faster than original XGBoost with comparable accuracy.

6. Trees vs. Neural Nets — The Tabular Data Debate

So when should you reach for gradient boosting versus a neural network? This is arguably the most practically important question in applied machine learning. Here's the honest comparison:

Dimension Gradient Boosting Neural Networks
Tabular data accuracy Wins on medium data (<500K rows) Competitive only with 500K+ rows
Images / text / audio Not applicable Wins by a mile
Training speed Minutes (CPU-only, no GPU needed) Hours to days (requires GPU)
Interpretability SHAP values, feature importance, tree visualization Black box (attention maps help for some architectures)
Missing values Handles natively (learns default direction) Needs imputation preprocessing
Data efficiency Works well with <10K samples Typically needs 100K+ for advantage

The Grinsztajn et al. (2022) NeurIPS benchmark tested the best neural architectures — FT-Transformer, ResNet, MLP — against gradient boosting across 45 diverse tabular datasets. The results were clear: gradient boosting wins on medium-sized datasets, and the neural approaches only start to pull ahead on very large datasets with many interacting features.

Why do trees win on tabular data? Several reasons compound:

The practical takeaway: start with gradient boosting for tabular data. Specifically, start with LightGBM (fastest, best defaults), try XGBoost if you need more regularization control, and only move to neural networks if you have a specific reason: very large data, multimodal inputs (mixing tabular with images or text), or a need for end-to-end differentiable learning. This is one case where scaling laws favor the simpler model — trees scale better per unit of compute on structured data.

7. References & Further Reading

Related posts in this series:

Try It: Tree Builder Playground

Watch a decision tree recursively partition 2D space. Increase depth to see overfitting — training accuracy rises but test accuracy peaks and drops.

Train Acc:  |  Test Acc:  |  Nodes:

Try It: Boosting Rounds Visualizer

Watch gradient boosting fit a noisy sine wave round by round. Low learning rate + many rounds = smooth fit. High learning rate = jerky and prone to overfitting.