← Back to Blog

Decision Trees & Random Forests from Scratch

Making Decisions Like a Flowchart

When a doctor diagnoses a patient, they follow a mental flowchart: "Fever above 38°C? Yes. Cough present? Yes. Duration more than two weeks? No. Likely flu, not TB." Each question narrows the possibilities until a diagnosis emerges. A decision tree automates exactly this process — it learns which questions to ask, in what order, and where to draw the thresholds, all from data.

Decision trees are the only mainstream ML model you can literally read. No weight matrices to squint at, no kernel tricks to wave your hands about. You can trace any prediction from root to leaf and explain it to a non-technical stakeholder in plain English. That interpretability comes at a cost — a single tree is unstable, prone to overfitting, and high-variance — but when you combine hundreds of them into a Random Forest, those weaknesses become strengths.

If you've read our post on Gradient Boosting from Scratch, you've already seen trees used as base learners. But there, the tree was a supporting character. Today it's the protagonist. We'll build the CART algorithm from scratch, learn to prune it with cost-complexity analysis (a topic not covered elsewhere on this site), grow a Random Forest with out-of-bag evaluation, and compare two methods for measuring feature importance. Let's dig in.

Splitting Criteria — Gini Impurity and Information Gain

At every internal node, a decision tree asks: "Which feature and which threshold give me the purest children?" To answer this, we need a way to measure impurity — how mixed a group of labels is. Two classic measures dominate practice:

Gini impurity is the probability that a randomly chosen sample would be misclassified if you labeled it according to the class distribution at that node: Gini = 1 - Σpi². It ranges from 0 (perfectly pure) to 0.5 (binary, equal split). Entropy is the information-theoretic uncertainty: H = -Σpi log2 pi, ranging from 0 to 1 for binary classification. Both reward splits that create homogeneous children, and in practice they agree on the best split in over 98% of cases.

The metric we actually optimize is information gain — the reduction in impurity from a parent to its weighted children. The split with the highest information gain wins.

import numpy as np
from collections import Counter

def gini_impurity(y):
    """Gini = 1 - sum(p_i^2) for each class."""
    counts = Counter(y)
    n = len(y)
    if n == 0:
        return 0.0
    return 1.0 - sum((c / n) ** 2 for c in counts.values())

def entropy(y):
    """H = -sum(p_i * log2(p_i)) for each class."""
    counts = Counter(y)
    n = len(y)
    if n == 0:
        return 0.0
    return sum(-(c / n) * np.log2(c / n)
               for c in counts.values() if c > 0)

def information_gain(y_parent, y_left, y_right, criterion=gini_impurity):
    """Weighted impurity reduction from a split."""
    n = len(y_parent)
    n_l, n_r = len(y_left), len(y_right)
    return criterion(y_parent) - (
        n_l / n * criterion(y_left) + n_r / n * criterion(y_right)
    )

# A node with 30 cats and 10 dogs
y = [0]*30 + [1]*10
print(f"Gini:    {gini_impurity(y):.4f}")   # 0.3750
print(f"Entropy: {entropy(y):.4f}")          # 0.8113

# Split into [28 cats, 2 dogs] | [2 cats, 8 dogs]
y_left  = [0]*28 + [1]*2
y_right = [0]*2  + [1]*8
print(f"IG (Gini):    {information_gain(y, y_left, y_right):.4f}")
print(f"IG (Entropy): {information_gain(y, y_left, y_right, entropy):.4f}")

Both criteria give high information gain for this split — it cleanly separates cats from dogs. Notice how similar the two rankings are: Gini and entropy almost always choose the same split. Scikit-learn defaults to Gini because it avoids the logarithm, making it marginally faster.

Building the CART Algorithm

CART (Classification and Regression Trees), introduced by Breiman, Friedman, Olshen, and Stone in 1984, builds a binary tree by exhaustive search. At each node, it tries every feature and every midpoint threshold between sorted unique values. The split with the highest information gain wins. Then it recurses on the left and right children until a stopping condition is met: the node is pure, the tree has reached maximum depth, or there aren't enough samples to justify another split.

The tree is stored as nested dictionaries: internal nodes hold a feature index, threshold, and two children; leaf nodes hold the predicted class (majority vote for classification, mean for regression). Prediction simply walks from root to leaf, going left when the feature value is at or below the threshold.

The greedy nature of CART is fundamental: at each node it makes the locally optimal split, which doesn't guarantee the globally optimal tree. Finding the optimal decision tree is NP-complete — even for just two features. Greedy works well enough in practice.
import numpy as np

class DecisionTree:
    def __init__(self, max_depth=5, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples = min_samples_split

    def _gini(self, y):
        counts = np.bincount(y)
        probs = counts / len(y)
        return 1.0 - np.sum(probs ** 2)

    def _best_split(self, X, y):
        best_gain, best_feat, best_thresh = -1, None, None
        parent_imp = self._gini(y)
        n = len(y)
        for feat in range(X.shape[1]):
            vals = np.unique(X[:, feat])
            for i in range(len(vals) - 1):
                t = (vals[i] + vals[i + 1]) / 2
                left = X[:, feat] <= t
                if left.sum() == 0 or (~left).sum() == 0:
                    continue
                gain = parent_imp - (
                    left.sum() / n * self._gini(y[left]) +
                    (~left).sum() / n * self._gini(y[~left])
                )
                if gain > best_gain:
                    best_gain, best_feat, best_thresh = gain, feat, t
        return best_feat, best_thresh, best_gain

    def _build(self, X, y, depth):
        if (len(np.unique(y)) == 1 or depth >= self.max_depth
                or len(y) < self.min_samples):
            return {'value': int(np.bincount(y).argmax())}
        feat, thresh, gain = self._best_split(X, y)
        if feat is None or gain <= 0:
            return {'value': int(np.bincount(y).argmax())}
        left = X[:, feat] <= thresh
        return {
            'feature': feat, 'threshold': round(thresh, 4),
            'left':  self._build(X[left], y[left], depth + 1),
            'right': self._build(X[~left], y[~left], depth + 1),
        }

    def fit(self, X, y):
        self.tree_ = self._build(X, y, depth=0)
        return self

    def _predict_one(self, node, x):
        if 'value' in node:
            return node['value']
        if x[node['feature']] <= node['threshold']:
            return self._predict_one(node['left'], x)
        return self._predict_one(node['right'], x)

    def predict(self, X):
        return np.array([self._predict_one(self.tree_, x) for x in X])

# Train on 2D data
np.random.seed(42)
X = np.vstack([np.random.randn(50, 2) + [1, 1],
               np.random.randn(50, 2) + [-1, -1]])
y = np.array([0]*50 + [1]*50)

tree = DecisionTree(max_depth=4).fit(X, y)
print(f"Training accuracy: {(tree.predict(X) == y).mean():.1%}")
print(f"Root split: feature {tree.tree_['feature']}, "
      f"threshold {tree.tree_['threshold']}")

The time complexity is O(n × m × n) per tree level: for each of n samples and m features, we scan up to n thresholds. For a balanced tree of depth log(n), total construction is O(n² × m × log n). In practice, trees build fast because leaves shrink quickly.

Overfitting and Pruning — Taming the Tree

An unpruned decision tree will happily memorize your training data. Give it no depth limit and it'll create a leaf for every single sample, achieving 100% training accuracy and terrible generalization. This is overfitting in its purest form — the tree has learned the noise, not the signal.

Pre-pruning (early stopping) is the simple fix: set max_depth, min_samples_split, or min_impurity_decrease to halt tree growth before it goes too deep. Simple, but you have to guess the right hyperparameters.

Cost-complexity pruning (post-pruning) is more elegant. Grow the full tree, then work backwards. For each internal node, compute the "effective alpha" — the complexity penalty at which replacing that subtree with a single leaf becomes worthwhile. The formula is: α = (Rleaf - Rsubtree) / (|leaves| - 1), where R is the total weighted impurity. Nodes with low α are contributing little impurity reduction relative to their complexity, so we prune them first. The optimal α is found via cross-validation. This is regularization for trees.

import copy

def count_leaves(node):
    if 'value' in node:
        return 1
    return count_leaves(node['left']) + count_leaves(node['right'])

def subtree_impurity(node, X, y):
    """Total weighted Gini impurity across all leaves."""
    if 'value' in node:
        n = len(y)
        if n == 0:
            return 0
        return n * (1 - np.sum((np.bincount(y, minlength=2) / n) ** 2))
    mask = X[:, node['feature']] <= node['threshold']
    return (subtree_impurity(node['left'], X[mask], y[mask]) +
            subtree_impurity(node['right'], X[~mask], y[~mask]))

def effective_alpha(node, X, y):
    """Alpha at which pruning this subtree becomes optimal."""
    if 'value' in node:
        return float('inf')
    n = len(y)
    counts = np.bincount(y, minlength=2)
    r_leaf = n * (1 - np.sum((counts / max(n, 1)) ** 2))
    r_sub = subtree_impurity(node, X, y)
    n_leaves = count_leaves(node)
    return (r_leaf - r_sub) / max(n_leaves - 1, 1)

def cost_complexity_prune(node, X, y, alpha):
    """Remove subtrees where complexity cost exceeds impurity gain."""
    if 'value' in node:
        return node
    mask = X[:, node['feature']] <= node['threshold']
    node['left'] = cost_complexity_prune(
        node['left'], X[mask], y[mask], alpha)
    node['right'] = cost_complexity_prune(
        node['right'], X[~mask], y[~mask], alpha)
    if effective_alpha(node, X, y) <= alpha:
        return {'value': int(np.bincount(y).argmax())}
    return node

# Full tree (no depth limit) vs pruned
full = DecisionTree(max_depth=20, min_samples_split=1).fit(X, y)
print(f"Full tree:  {count_leaves(full.tree_)} leaves, "
      f"train acc = {(full.predict(X) == y).mean():.1%}")

pruned = copy.deepcopy(full)
pruned.tree_ = cost_complexity_prune(pruned.tree_, X, y, alpha=0.02)
print(f"Pruned:     {count_leaves(pruned.tree_)} leaves, "
      f"train acc = {(pruned.predict(X) == y).mean():.1%}")

The pruned tree trades a tiny amount of training accuracy for dramatically fewer leaves. Fewer leaves means a simpler model, which generalizes better. This is the bias-variance tradeoff in action: shallow trees have high bias but low variance; deep trees have low bias but high variance. Pruning finds the sweet spot.

Try It: Tree Builder

Click "Step" to grow the tree one split at a time. Watch the scatter plot partition into regions and the tree structure grow as a flowchart.

Decision Boundary
Tree Structure
4
35
Depth: 0 · Nodes: 1 · Accuracy: —

Random Forests — Wisdom of the Crowd

A single decision tree is a high-variance learner: small changes in the training data can produce completely different trees. Leo Breiman's breakthrough insight in 2001 was to harness that instability. If each tree is different, their errors are different. Average enough of them, and the errors cancel out.

Random Forests inject randomness in two ways: bagging (bootstrap aggregating) trains each tree on a random sample with replacement from the training data, so each tree sees a slightly different dataset. Feature subsampling restricts each split to a random subset of √m features, forcing trees to explore different parts of the feature space. The ensemble prediction is the majority vote of all trees.

Feature subsampling is the key insight that makes forests powerful. Without it, every tree would split on the same dominant features and make the same errors. By decorrelating the trees, feature subsampling lets the ensemble average out individual tree quirks instead of amplifying them.

A bonus: since each tree is trained on a bootstrap sample, roughly 37% of the data is left out of each tree (the "out-of-bag" samples). We can use those for free validation — each sample is scored only by trees that never trained on it.

import numpy as np

def build_tree(X, y, depth, max_depth, max_features=None):
    """Recursive CART with optional random feature subsets."""
    if depth >= max_depth or len(np.unique(y)) == 1 or len(y) < 2:
        return int(np.bincount(y).argmax())
    m = X.shape[1]
    feats = np.random.choice(m, max_features or m, replace=False)
    best_gain, best_f, best_t = -1, None, None
    parent_g = 1 - np.sum((np.bincount(y, minlength=2) / len(y)) ** 2)
    for f in feats:
        vals = np.unique(X[:, f])
        for i in range(len(vals) - 1):
            t = (vals[i] + vals[i+1]) / 2
            left = X[:, f] <= t
            nl, nr = left.sum(), (~left).sum()
            if nl == 0 or nr == 0:
                continue
            gl = 1 - np.sum((np.bincount(y[left], minlength=2) / nl) ** 2)
            gr = 1 - np.sum((np.bincount(y[~left], minlength=2) / nr) ** 2)
            gain = parent_g - (nl * gl + nr * gr) / len(y)
            if gain > best_gain:
                best_gain, best_f, best_t = gain, f, t
    if best_gain <= 0:
        return int(np.bincount(y).argmax())
    mask = X[:, best_f] <= best_t
    return {
        'f': best_f, 't': best_t,
        'L': build_tree(X[mask], y[mask], depth+1, max_depth, max_features),
        'R': build_tree(X[~mask], y[~mask], depth+1, max_depth, max_features),
    }

def walk(node, x):
    if isinstance(node, int):
        return node
    return walk(node['L'] if x[node['f']] <= node['t'] else node['R'], x)

class RandomForest:
    def __init__(self, n_trees=100, max_depth=5):
        self.n_trees, self.max_depth = n_trees, max_depth

    def fit(self, X, y):
        n, m = X.shape
        mf = int(np.sqrt(m))
        self.trees_, self.oob_ = [], []
        for _ in range(self.n_trees):
            bag = np.random.choice(n, n, replace=True)
            oob = list(set(range(n)) - set(bag))
            self.trees_.append(
                build_tree(X[bag], y[bag], 0, self.max_depth, mf))
            self.oob_.append(oob)
        return self

    def predict(self, X):
        preds = np.array([[walk(t, x) for x in X] for t in self.trees_])
        return np.array([np.bincount(preds[:,i]).argmax()
                         for i in range(len(X))])

    def oob_score(self, X, y):
        """Each sample scored only by trees that didn't train on it."""
        votes = {i: [] for i in range(len(y))}
        for tree, oob in zip(self.trees_, self.oob_):
            for idx in oob:
                votes[idx].append(walk(tree, X[idx]))
        scored = {k: v for k, v in votes.items() if v}
        return np.mean([np.bincount(v).argmax() == y[k]
                        for k, v in scored.items()])

# Single tree vs Random Forest
np.random.seed(42)
X = np.vstack([np.random.randn(100, 2) + [1, 1],
               np.random.randn(100, 2) + [-1, -1]])
y = np.array([0]*100 + [1]*100)

single = build_tree(X, y, 0, max_depth=10)
single_acc = np.mean([walk(single, x) == t for x, t in zip(X, y)])
rf = RandomForest(n_trees=100, max_depth=5).fit(X, y)

print(f"Single tree: {single_acc:.1%}")       # ~100%
print(f"Forest:      {(rf.predict(X) == y).mean():.1%}")  # ~100%
print(f"OOB score:   {rf.oob_score(X, y):.1%}")           # ~96%

The OOB score is the honest metric here. The single tree achieves 100% on training data because it memorized everything. The forest also scores 100% on training data, but its OOB score reveals realistic generalization performance — no held-out test set needed.

Try It: Forest vs Single Tree

Compare the jagged boundary of a single tree (left) to the smooth forest boundary (center). The right panel shows how individual trees overlap — darker color means more trees agree.

Single Tree
Forest (20 trees)
Tree Overlay
20
5
Single tree acc: — · Forest acc: —

Feature Importance — What the Forest Learned

One of the biggest practical benefits of tree ensembles: they tell you which features matter. Two methods dominate:

Impurity-based importance sums the Gini reduction from every split that uses a given feature, averaged across all trees. It's fast (computed during training) but biased — it systematically overestimates the importance of high-cardinality features (those with many unique values) because they offer more potential split points.

Permutation importance takes a different approach: shuffle a feature column, destroying its relationship with the target, and measure how much accuracy drops. Bigger drops mean the feature was more important. This method is unbiased and model-agnostic, but slower because it requires re-predicting on the shuffled data for each feature.

def impurity_importance(forest, X, y):
    """Average Gini reduction per feature across all trees."""
    n_feat = X.shape[1]
    imp = np.zeros(n_feat)
    def traverse(node, X_sub, y_sub):
        if isinstance(node, int) or len(y_sub) == 0:
            return
        f, t = node['f'], node['t']
        mask = X_sub[:, f] <= t
        n = len(y_sub)
        pg = 1 - np.sum((np.bincount(y_sub, minlength=2) / n) ** 2)
        nl, nr = mask.sum(), (~mask).sum()
        gl = 1 - np.sum((np.bincount(y_sub[mask], minlength=2)
                         / max(nl, 1)) ** 2) if nl else 0
        gr = 1 - np.sum((np.bincount(y_sub[~mask], minlength=2)
                         / max(nr, 1)) ** 2) if nr else 0
        imp[f] += n * (pg - nl/n*gl - nr/n*gr)
        traverse(node['L'], X_sub[mask], y_sub[mask])
        traverse(node['R'], X_sub[~mask], y_sub[~mask])
    for tree in forest.trees_:
        traverse(tree, X, y)
    total = imp.sum()
    return imp / total if total > 0 else imp

def permutation_importance(forest, X, y, n_repeats=10):
    """Accuracy drop when each feature is shuffled."""
    base_acc = (forest.predict(X) == y).mean()
    result = np.zeros(X.shape[1])
    for f in range(X.shape[1]):
        drops = []
        for _ in range(n_repeats):
            X_shuf = X.copy()
            np.random.shuffle(X_shuf[:, f])
            drops.append(base_acc - (forest.predict(X_shuf) == y).mean())
        result[f] = np.mean(drops)
    return result

# 10 features: first 3 relevant, last 7 noise
np.random.seed(42)
X_t = np.random.randn(200, 10)
y_t = ((X_t[:,0] > 0) ^ (X_t[:,1] > 0) ^ (X_t[:,2] > 0)).astype(int)

rf_t = RandomForest(n_trees=50, max_depth=6).fit(X_t, y_t)
imp_i = impurity_importance(rf_t, X_t, y_t)
perm_i = permutation_importance(rf_t, X_t, y_t)

print("Feature | Impurity | Permutation | Type")
print("--------|----------|-------------|-------")
for i in range(10):
    tag = "SIGNAL" if i < 3 else "noise"
    print(f"   {i}    |  {imp_i[i]:.3f}  |    {perm_i[i]:.3f}    | {tag}")

The signal features (0, 1, 2) should show high importance under both methods, but you'll typically see impurity importance assign non-trivial scores to noise features with many unique values, while permutation importance correctly assigns them near-zero. This is why scikit-learn's default feature_importances_ (impurity-based) can be misleading. When accuracy matters, prefer permutation importance.

Conclusion

We've traced the full arc of tree-based learning: from the simple idea of recursive binary splits, through the CART algorithm, to pruning for generalization, to the power of Random Forests, and finally to understanding what the model learned through feature importance. Decision trees are one of the oldest ideas in machine learning, yet they remain at the foundation of the most competitive algorithms in the field — gradient boosting, XGBoost, LightGBM, and CatBoost are all tree ensembles at heart.

If you want to go deeper, our post on Gradient Boosting from Scratch shows how to combine trees sequentially rather than in parallel. For the math behind preventing overfitting, see Regularization from Scratch. And for properly measuring how well your trees generalize, check out ML Evaluation from Scratch.

References & Further Reading