Gradient Boosting from Scratch

February 26, 2026 · Elementary · 20 min read

Introduction

Here's a fact that surprises most people entering the AI field: the algorithm that wins the most Kaggle competitions, powers the most production fraud detection systems, and dominates credit scoring, ad click prediction, and recommendation ranking is not a neural network. It's gradient boosting.

XGBoost, LightGBM, and CatBoost — three implementations of the same core idea — remain the default choice for any problem with tabular data (rows and columns, like a spreadsheet or SQL table). A 2022 NeurIPS benchmark by Grinsztajn et al. tested the best neural network architectures against gradient boosting across 45 tabular datasets and found that trees still win on medium-sized data. Neural networks only pull ahead with 500,000+ rows and complex feature interactions.

Understanding why requires going back to the fundamentals. In this post, we'll build gradient boosting from scratch — starting with a single decision tree, assembling forests, then building the boosting machinery that made XGBoost a household name in machine learning. Along the way, we'll discover why fitting one tree to another tree's mistakes is actually gradient descent in function space, and why this simple idea produces models that are remarkably hard to beat.

If you've been following the elementary series, you've seen us build optimizers that do gradient descent on neural network parameters. Gradient boosting does something wonderfully different: gradient descent where each "step" is an entire decision tree.

1. Decision Trees — Recursive Partitioning

A decision tree is the most intuitive machine learning model: it asks a series of yes/no questions about your features and arrives at a prediction. "Is the customer's income above $50K? Is their credit score above 700? Have they missed a payment in the last year?" Each question splits the data, and each leaf node holds a prediction.

Mathematically, a decision tree is a piecewise-constant function. It partitions feature space into axis-aligned rectangles and assigns a constant value to each region. This sounds limiting, but with enough rectangles, you can approximate any function — the question is whether you'll overfit in the process.

The key decision at each node is: which feature and which threshold give the best split? We measure "best" using impurity — how mixed the classes are in each child node. Two common measures:

Gini impurity: G = 1 - ∑p_k² — the probability of misclassifying a randomly chosen sample if you labeled it according to the node's class distribution. Ranges from 0 (pure) to 0.5 (binary, perfectly mixed).
Entropy: H = -∑p_k log₂(p_k) — the information-theoretic uncertainty. Also 0 when pure, maximum when perfectly mixed. Slightly more computationally expensive due to the logarithm.

In practice, they almost always agree on the best split. We'll use Gini because it's faster and what XGBoost defaults to for classification.

Here's a complete decision tree classifier built from scratch. Notice how compact it is — the entire algorithm is just recursive splitting:

import numpy as np

def gini_impurity(y):
    """Gini = 1 - sum(p_k^2) for each class k."""
    if len(y) == 0:
        return 0.0
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return 1.0 - np.sum(probs ** 2)

def best_split(X, y):
    """Find the feature and threshold that maximize information gain."""
    best_gain, best_feat, best_thresh = -1, None, None
    parent_impurity = gini_impurity(y)
    n = len(y)

    for feat in range(X.shape[1]):
        thresholds = np.unique(X[:, feat])
        for thresh in thresholds:
            left_mask = X[:, feat] <= thresh
            right_mask = ~left_mask
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue

            # Weighted average impurity of children
            w_left = left_mask.sum() / n
            child_impurity = (w_left * gini_impurity(y[left_mask])
                            + (1 - w_left) * gini_impurity(y[right_mask]))
            gain = parent_impurity - child_impurity

            if gain > best_gain:
                best_gain = gain
                best_feat = feat
                best_thresh = thresh

    return best_feat, best_thresh, best_gain

def build_tree(X, y, depth=0, max_depth=5, min_samples=2):
    """Recursively build a decision tree."""
    # Leaf node: return majority class
    if depth >= max_depth or len(y) < min_samples or len(np.unique(y)) == 1:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    feat, thresh, gain = best_split(X, y)
    if feat is None or gain <= 0:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    left_mask = X[:, feat] <= thresh
    return {
        "leaf": False, "feature": feat, "threshold": thresh,
        "left": build_tree(X[left_mask], y[left_mask], depth+1, max_depth, min_samples),
        "right": build_tree(X[~left_mask], y[~left_mask], depth+1, max_depth, min_samples),
    }

def predict_one(tree, x):
    if tree["leaf"]:
        return tree["class"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_one(tree["left"], x)
    return predict_one(tree["right"], x)

# Usage: tree = build_tree(X_train, y_train, max_depth=4)
# predictions = [predict_one(tree, x) for x in X_test]

The fatal flaw of a single tree is high variance. Grow it deep enough and it memorizes the training data perfectly — zero training error — but small changes in the training set produce wildly different trees. This is the classic bias-variance tradeoff: a single deep tree has low bias (it can fit anything) but high variance (it's unstable). We need a way to keep the low bias while taming the variance.

2. Random Forests — Wisdom of Crowds

The first ensemble idea is beautifully simple: if one tree is unstable, build many trees and let them vote. But there's a catch — if you train 100 trees on the same data, you get 100 identical trees. The trick is making them different while still keeping each one individually accurate.

Leo Breiman's Random Forest (2001) introduces two sources of randomness:

Bagging (bootstrap aggregating): each tree trains on a random sample drawn with replacement from the training data. Each bootstrap sample is the same size as the original but has ~63% unique samples (the rest are duplicates). The ~37% of samples each tree never sees become its built-in test set — the out-of-bag (OOB) estimate.
Feature subsampling: at each split, instead of considering all features, randomly select √p features (for classification) or p/3 (for regression). This forces trees to rely on different features, further decorrelating them.

Why does averaging decorrelated trees reduce variance? If N trees each have variance σ² and pairwise correlation ρ, the ensemble's variance is:

ρσ² + (1-ρ)σ²/N

As N grows, the second term vanishes. What remains is ρσ² — variance proportional to the correlation between trees. Bagging and feature subsampling reduce ρ, so the ensemble variance shrinks dramatically. A single tree might achieve 75% accuracy on a noisy task; a random forest of 100 trees typically hits 88%+ with no hyperparameter tuning.

def random_forest(X, y, n_trees=100, max_depth=8, max_features="sqrt"):
    """Build a random forest via bagging + feature subsampling."""
    n_samples, n_features = X.shape
    n_sub = int(np.sqrt(n_features)) if max_features == "sqrt" else n_features
    trees = []

    for _ in range(n_trees):
        # Bootstrap sample (draw n_samples with replacement)
        idx = np.random.randint(0, n_samples, size=n_samples)
        X_boot, y_boot = X[idx], y[idx]

        # Build tree with feature subsampling at each split
        tree = build_tree_rf(X_boot, y_boot, max_depth=max_depth,
                             n_sub_features=n_sub)
        trees.append(tree)

    return trees

def build_tree_rf(X, y, depth=0, max_depth=8, min_samples=2, n_sub_features=None):
    """Decision tree with random feature subsampling at each split."""
    if depth >= max_depth or len(y) < min_samples or len(np.unique(y)) == 1:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    # Randomly select a subset of features to consider
    all_feats = np.arange(X.shape[1])
    chosen = np.random.choice(all_feats, size=n_sub_features, replace=False)

    best_gain, best_feat, best_thresh = -1, None, None
    parent_imp = gini_impurity(y)
    n = len(y)

    for feat in chosen:
        for thresh in np.unique(X[:, feat]):
            left = X[:, feat] <= thresh
            if left.sum() == 0 or (~left).sum() == 0:
                continue
            w = left.sum() / n
            imp = w * gini_impurity(y[left]) + (1-w) * gini_impurity(y[~left])
            gain = parent_imp - imp
            if gain > best_gain:
                best_gain, best_feat, best_thresh = gain, feat, thresh

    if best_feat is None:
        classes, counts = np.unique(y, return_counts=True)
        return {"leaf": True, "class": classes[np.argmax(counts)]}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": build_tree_rf(X[left_mask], y[left_mask], depth+1,
                              max_depth, min_samples, n_sub_features),
        "right": build_tree_rf(X[~left_mask], y[~left_mask], depth+1,
                               max_depth, min_samples, n_sub_features),
    }

def forest_predict(trees, X):
    """Majority vote across all trees."""
    all_preds = np.array([[predict_one(t, x) for x in X] for t in trees])
    # Most common prediction for each sample
    return np.array([np.bincount(all_preds[:, i].astype(int)).argmax()
                     for i in range(X.shape[0])])

Random forests are hard to beat for their simplicity. They need almost no tuning (more trees is almost always better), they provide free feature importance scores, and they parallelize trivially. But they have a ceiling — averaging reduces variance but can't reduce bias. If every individual tree misses some subtle pattern, the forest will too. This is where boosting changes the game.

3. Gradient Boosting — Learning from Mistakes

Here's the paradigm shift. Instead of building independent trees and averaging them (bagging), build trees sequentially, where each new tree specifically targets the errors the ensemble hasn't captured yet.

The algorithm for gradient boosting regression is almost suspiciously simple:

Start with a constant prediction: F₀(x) = mean(y)
Compute the residuals: r_i = y_i - F(x_i)
Fit a small tree to the residuals (not the original targets)
Update the ensemble: F(x) ← F(x) + η · tree(x)
Repeat from step 2

The parameter η (learning rate, typically 0.01-0.3) scales each tree's contribution. Why not just use η=1? Because smaller steps with more trees produce smoother, more generalizable fits — just like in gradient descent for neural networks, a smaller learning rate with more steps usually reaches a better minimum.

But here's the deep insight: those residuals are actually the negative gradient of the squared error loss. For MSE loss L = ½(y - F(x))², the derivative with respect to F(x) is -(y - F(x)) = -residual. So fitting a tree to the residuals is equivalent to taking a gradient descent step in function space. Each tree is a "direction" of steepest descent, and the learning rate is the step size.

This is why it's called "gradient" boosting — the same gradient computation that drives neural network training, but instead of updating weight parameters, we're adding entire functions (trees) to our ensemble.

And because we're doing gradient descent, we can use any differentiable loss function — not just MSE. For classification, we use log-loss (cross-entropy from our loss functions post) and fit trees to the pseudo-residuals: the negative gradient of the log-loss.

def gradient_boost_regressor(X, y, n_rounds=100, lr=0.1, max_depth=3):
    """Gradient boosting for regression (MSE loss)."""
    # Initialize with the mean prediction
    f_hat = np.full(len(y), np.mean(y))
    trees = []
    init_pred = np.mean(y)

    for _ in range(n_rounds):
        # Negative gradient of MSE loss = residuals
        residuals = y - f_hat

        # Fit a shallow tree to the residuals
        tree = build_regression_tree(X, residuals, max_depth=max_depth)
        trees.append(tree)

        # Update predictions with learning rate
        for i in range(len(X)):
            f_hat[i] += lr * predict_reg(tree, X[i])

    return init_pred, trees, lr

def build_regression_tree(X, y, depth=0, max_depth=3, min_samples=5):
    """Regression tree: leaf value = mean of targets in that region."""
    if depth >= max_depth or len(y) < min_samples:
        return {"leaf": True, "value": np.mean(y)}

    best_mse, best_feat, best_thresh = float("inf"), None, None
    n = len(y)

    for feat in range(X.shape[1]):
        thresholds = np.unique(X[:, feat])
        # Sample thresholds for speed on large datasets
        if len(thresholds) > 50:
            thresholds = np.quantile(X[:, feat], np.linspace(0, 1, 50))
        for thresh in thresholds:
            left = X[:, feat] <= thresh
            if left.sum() < min_samples or (~left).sum() < min_samples:
                continue
            mse = (np.var(y[left]) * left.sum()
                 + np.var(y[~left]) * (~left).sum()) / n
            if mse < best_mse:
                best_mse, best_feat, best_thresh = mse, feat, thresh

    if best_feat is None:
        return {"leaf": True, "value": np.mean(y)}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": build_regression_tree(X[left_mask], y[left_mask],
                                      depth+1, max_depth, min_samples),
        "right": build_regression_tree(X[~left_mask], y[~left_mask],
                                       depth+1, max_depth, min_samples),
    }

def predict_reg(tree, x):
    if tree["leaf"]:
        return tree["value"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_reg(tree["left"], x)
    return predict_reg(tree["right"], x)

# For classification, replace residuals with pseudo-residuals:
# p = sigmoid(f_hat)  →  pseudo_residual = y - p
# This is the negative gradient of log-loss: -d/dF [-y*log(p) - (1-y)*log(1-p)]

The bias-variance story is the mirror image of random forests. Forests reduce variance by averaging independent trees but can't reduce bias. Boosting reduces bias by sequentially targeting what the ensemble gets wrong — each new tree chips away at the remaining error. But because trees are dependent (each one is built to correct the previous ensemble), there's less variance reduction. In practice, boosting typically achieves better accuracy than random forests, but requires more careful tuning to avoid overfitting.

4. XGBoost — The Engineering That Changed Everything

Gradient boosting as described above existed since Jerome Friedman's 2001 paper. It worked well, but it was slow and prone to overfitting. Then in 2016, Tianqi Chen and Carlos Guestrin published XGBoost, and it proceeded to dominate every Kaggle competition it entered. What did they change?

Three innovations, each building on the last:

Innovation 1: Second-Order Gradients

Standard gradient boosting uses only the first derivative (gradient) of the loss. XGBoost uses both the first derivative and the second derivative (Hessian). This is the difference between plain gradient descent and Newton's method — it converges faster because the second derivative tells you about the curvature of the loss landscape. For each sample i:

g_i = dL/dF(x_i) — the gradient (first derivative)
h_i = d²L/dF(x_i)² — the Hessian (second derivative)

For MSE: g = -(y - F(x)), h = 1. For log-loss: g = p - y, h = p(1 - p), where p = sigmoid(F(x)).

Innovation 2: Regularized Objective

Instead of just minimizing the loss, XGBoost minimizes loss + a regularization penalty that directly controls tree complexity:

Objective = Loss + λ · ∑w_j² + γ · T

Where w_j are the leaf weights, T is the number of leaves, λ penalizes large leaf values (L2 regularization), and γ penalizes having too many leaves. This bakes complexity control directly into the optimization — not as an afterthought, but as part of what "optimal" means.

Innovation 3: The Split Gain Formula

With second-order gradients and the regularized objective, XGBoost derives a single formula that determines whether a split is worth making:

Gain = ½ · [G_L²/(H_L + λ) + G_R²/(H_R + λ) - (G_L + G_R)²/(H_L + H_R + λ)] - γ

Where G_L and H_L are the sums of gradients and Hessians in the left child, and similarly for the right. This formula simultaneously optimizes for accuracy (the G²/H terms represent how much the loss decreases) and penalizes complexity (λ and γ keep the tree small). A split only happens if Gain > 0 — built-in pruning.

And the optimal leaf weight is simply: w* = -G / (H + λ)

This is the heart of XGBoost — one formula for splitting, one for leaf values, and everything else is engineering around these two equations:

def xgboost_tree(X, grads, hessians, depth=0, max_depth=4,
                  lam=1.0, gamma=0.0, min_child_weight=1.0):
    """Build one XGBoost-style tree using second-order gradients."""
    G, H = np.sum(grads), np.sum(hessians)

    # Optimal leaf weight: w* = -G / (H + lambda)
    if depth >= max_depth or len(grads) < 2 or H < min_child_weight:
        return {"leaf": True, "weight": -G / (H + lam)}

    best_gain, best_feat, best_thresh = -float("inf"), None, None

    for feat in range(X.shape[1]):
        # Sort by feature value for efficient threshold search
        order = np.argsort(X[:, feat])
        g_left, h_left = 0.0, 0.0

        for i in range(len(order) - 1):
            idx = order[i]
            g_left += grads[idx]
            h_left += hessians[idx]
            g_right = G - g_left
            h_right = H - h_left

            # Skip if child is too small
            if h_left < min_child_weight or h_right < min_child_weight:
                continue
            # Skip duplicate thresholds
            if X[order[i], feat] == X[order[i+1], feat]:
                continue

            # The XGBoost split gain formula
            gain = 0.5 * (g_left**2 / (h_left + lam)
                        + g_right**2 / (h_right + lam)
                        - G**2 / (H + lam)) - gamma

            if gain > best_gain:
                best_gain = gain
                best_feat = feat
                best_thresh = (X[order[i], feat] + X[order[i+1], feat]) / 2

    # No worthwhile split found (gain includes regularization penalty)
    if best_feat is None or best_gain <= 0:
        return {"leaf": True, "weight": -G / (H + lam)}

    left_mask = X[:, best_feat] <= best_thresh
    return {
        "leaf": False, "feature": best_feat, "threshold": best_thresh,
        "left": xgboost_tree(X[left_mask], grads[left_mask], hessians[left_mask],
                             depth+1, max_depth, lam, gamma, min_child_weight),
        "right": xgboost_tree(X[~left_mask], grads[~left_mask], hessians[~left_mask],
                              depth+1, max_depth, lam, gamma, min_child_weight),
    }

def predict_xgb(tree, x):
    if tree["leaf"]:
        return tree["weight"]
    if x[tree["feature"]] <= tree["threshold"]:
        return predict_xgb(tree["left"], x)
    return predict_xgb(tree["right"], x)

# Full XGBoost training loop:
# f_hat = np.zeros(n)
# for round in range(n_rounds):
#     grads = compute_gradient(y, f_hat)    # g_i = p_i - y_i for log-loss
#     hessians = compute_hessian(y, f_hat)  # h_i = p_i * (1 - p_i)
#     tree = xgboost_tree(X, grads, hessians, lam=1.0, gamma=0.1)
#     for i in range(n):
#         f_hat[i] += lr * predict_xgb(tree, X[i])

Compare this to our earlier gradient boosting: instead of "fit tree to residuals," XGBoost says "fit tree to gradients, using Hessians to optimize leaf values and a regularized gain formula to decide splits." Same idea, much better execution.

5. Histogram-Based Splitting — Speed at Scale

The XGBoost implementation above has a bottleneck: at each node, it sorts every feature's values to find the best threshold. For a dataset with N rows and P features, that's O(NP log N) per tree level. On a million-row dataset, this gets slow.

LightGBM (Ke et al., 2017) introduced a trick inspired by how histograms work: bin continuous features into 256 discrete buckets before training starts. Then at each split, instead of searching every unique value, search only 256 bin boundaries. The cost drops from O(NP) to O(256P) per node — independent of dataset size.

There's an additional trick that halves the work: the histogram subtraction optimization. If you know the parent node's histogram and the left child's histogram, the right child's histogram is just the difference. So you only need to build histograms for the smaller child — the other is free.

def histogram_split(X_binned, grads, hessians, n_bins=256, lam=1.0, gamma=0.0):
    """Find best split using histogram-based approach (LightGBM-style).

    X_binned: features pre-binned to integers in [0, n_bins).
    Returns: (best_feature, best_bin, best_gain)
    """
    n_samples, n_features = X_binned.shape
    G_total = np.sum(grads)
    H_total = np.sum(hessians)
    best_gain, best_feat, best_bin = -float("inf"), -1, -1

    for feat in range(n_features):
        # Build histogram: sum gradients and hessians per bin
        g_hist = np.zeros(n_bins)
        h_hist = np.zeros(n_bins)
        for i in range(n_samples):
            b = X_binned[i, feat]
            g_hist[b] += grads[i]
            h_hist[b] += hessians[i]

        # Scan bins left-to-right to find best split
        g_left, h_left = 0.0, 0.0
        for b in range(n_bins - 1):
            g_left += g_hist[b]
            h_left += h_hist[b]
            g_right = G_total - g_left
            h_right = H_total - h_left

            if h_left < 1e-3 or h_right < 1e-3:
                continue

            gain = 0.5 * (g_left**2 / (h_left + lam)
                        + g_right**2 / (h_right + lam)
                        - G_total**2 / (H_total + lam)) - gamma

            if gain > best_gain:
                best_gain, best_feat, best_bin = gain, feat, b

    return best_feat, best_bin, best_gain

def preprocess_bins(X, n_bins=256):
    """Bin continuous features into n_bins buckets using quantiles."""
    X_binned = np.zeros_like(X, dtype=np.int32)
    bin_edges = []
    for feat in range(X.shape[1]):
        edges = np.quantile(X[:, feat],
                           np.linspace(0, 1, n_bins + 1)[1:-1])
        edges = np.unique(edges)  # remove duplicate edges
        X_binned[:, feat] = np.searchsorted(edges, X[:, feat])
        bin_edges.append(edges)
    return X_binned, bin_edges

# Benchmark: on 100K rows, histogram splitting is 5-10x faster
# than exact splitting, with <0.1% accuracy difference

LightGBM adds one more innovation: leaf-wise tree growth. XGBoost grows trees level-by-level (all nodes at depth d before any at depth d+1). LightGBM grows the leaf with the highest gain reduction, regardless of depth. This produces deeper, more asymmetric trees that often generalize better — the tree invests its complexity budget where it matters most. Combined with histogram splitting, LightGBM typically trains 5-10x faster than original XGBoost with comparable accuracy.

6. Trees vs. Neural Nets — The Tabular Data Debate

So when should you reach for gradient boosting versus a neural network? This is arguably the most practically important question in applied machine learning. Here's the honest comparison:

Dimension	Gradient Boosting	Neural Networks
Tabular data accuracy	Wins on medium data (<500K rows)	Competitive only with 500K+ rows
Images / text / audio	Not applicable	Wins by a mile
Training speed	Minutes (CPU-only, no GPU needed)	Hours to days (requires GPU)
Interpretability	SHAP values, feature importance, tree visualization	Black box (attention maps help for some architectures)
Missing values	Handles natively (learns default direction)	Needs imputation preprocessing
Data efficiency	Works well with <10K samples	Typically needs 100K+ for advantage

The Grinsztajn et al. (2022) NeurIPS benchmark tested the best neural architectures — FT-Transformer, ResNet, MLP — against gradient boosting across 45 diverse tabular datasets. The results were clear: gradient boosting wins on medium-sized datasets, and the neural approaches only start to pull ahead on very large datasets with many interacting features.

Why do trees win on tabular data? Several reasons compound:

Rotation invariance: trees split on individual features, so they naturally handle heterogeneous feature scales. Neural nets need normalization and struggle when features have very different ranges.
Feature sparsity: in tabular data, often only a few features matter for any given prediction. Trees discover this automatically through their splitting process. Neural nets must learn to ignore irrelevant features, which requires more data.
Irregularity: tabular targets often have irregular, non-smooth decision boundaries. Trees create piecewise-constant approximations that handle irregularity naturally, while neural nets (being smooth function approximators) need many parameters to approximate sharp boundaries.

The practical takeaway: start with gradient boosting for tabular data. Specifically, start with LightGBM (fastest, best defaults), try XGBoost if you need more regularization control, and only move to neural networks if you have a specific reason: very large data, multimodal inputs (mixing tabular with images or text), or a need for end-to-end differentiable learning. This is one case where scaling laws favor the simpler model — trees scale better per unit of compute on structured data.

7. References & Further Reading

Leo Breiman — Random Forests (2001) — the original random forest paper, remarkably readable for an academic paper
Jerome Friedman — Greedy Function Approximation: A Gradient Boosting Machine (2001) — the foundational gradient boosting paper that introduced boosting as gradient descent in function space
Tianqi Chen & Carlos Guestrin — XGBoost: A Scalable Tree Boosting System (KDD 2016) — the paper that launched XGBoost and dominated competitive ML for years
Ke et al. — LightGBM: A Highly Efficient Gradient Boosting Decision Tree (NeurIPS 2017) — histogram-based splitting, leaf-wise growth, and exclusive feature bundling
Prokhorenkova et al. — CatBoost: Unbiased Boosting with Categorical Features (NeurIPS 2018) — ordered boosting and native categorical feature handling
Grinsztajn et al. — Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? (NeurIPS 2022) — the definitive benchmark of trees vs. neural nets on 45 tabular datasets
Hastie, Tibshirani & Friedman — The Elements of Statistical Learning, Chapters 9-10 — the textbook treatment of additive models and boosting

Related posts in this series:

Loss Functions from Scratch — the cross-entropy and MSE losses that gradient boosting minimizes
Regularization from Scratch — the L2 penalty and structural regularization principles XGBoost uses
Optimizers from Scratch — gradient descent on parameters vs. gradient descent in function space
Neural Scaling Laws from Scratch — when more data and compute start to favor neural networks over trees
Backpropagation from Scratch — the gradient computation that both neural nets and boosting rely on

Try It: Tree Builder Playground

Watch a decision tree recursively partition 2D space. Increase depth to see overfitting — training accuracy rises but test accuracy peaks and drops.

Max Depth: 3 Min Samples: 5

Train Acc: — | Test Acc: — | Nodes: —

Try It: Boosting Rounds Visualizer

Watch gradient boosting fit a noisy sine wave round by round. Low learning rate + many rounds = smooth fit. High learning rate = jerky and prone to overfitting.

Learning Rate: 0.10 Rounds: 20 Tree Depth: 2