← Back to Blog

Feature Selection from Scratch

Why More Features Can Mean Worse Models

You add 50 new features to your model, expecting better predictions. Instead, test accuracy drops. Welcome to the curse of dimensionality — where more information actually makes your model stupider.

Here's the intuition: in high dimensions, data becomes sparse. A dataset of 1,000 samples feels dense in 2D but is hopelessly scattered in 500D. Your nearest-neighbor classifier, which worked beautifully with 5 features, now finds that every point is roughly equidistant from every other point. Distances lose meaning, and the model drowns in noise.

Irrelevant features inject random variation that the model mistakes for signal. Redundant features — columns that are near-copies of each other — inflate the effective dimensionality without adding new information. The result: overfitting, slower training, and models that crumble in production. If you've worked through our PCA post, you've seen dimensionality reduction tackle this by projecting data to lower dimensions. Feature selection takes a different approach: instead of creating new combined features, it identifies which original features actually matter and discards the rest.

There are three families of feature selection methods, and we'll build all three from scratch:

Filter Methods — Statistical Tests as Feature Scores

Filter methods score each feature independently by measuring its relationship to the target variable. They're fast, model-agnostic, and make a great first pass at trimming a bloated feature set.

Variance threshold is the simplest filter: if a feature barely varies, it can't distinguish anything. A column that's 99% zeros carries almost no information. We remove any feature whose variance falls below a threshold.

Pearson correlation measures linear dependence between a feature and the target. It's fast and interpretable, but only catches straight-line relationships. A feature that's perfectly predictive through a nonlinear pattern (say, a parabola) will show near-zero correlation.

Mutual information (MI) is the heavy hitter. Unlike correlation, MI captures any statistical dependence — linear, nonlinear, whatever. It measures how much knowing a feature reduces uncertainty about the target, in bits. If you've read our information theory post, MI is the KL divergence between the joint distribution and the product of marginals. For feature selection, MI is the gold standard filter metric because it makes no assumptions about the relationship's shape.

Let's build all three on a synthetic dataset with known structure: 3 truly informative features, 3 redundant copies (linear combinations of the informative ones plus noise), and 4 pure Gaussian noise features.

import numpy as np

np.random.seed(42)
n = 200

# 3 informative features
X_info = np.random.randn(n, 3)
# Binary target: positive when weighted sum exceeds threshold
w_true = np.array([1.5, -1.0, 0.8])
y = (X_info @ w_true + 0.3 * np.random.randn(n) > 0).astype(int)

# 3 redundant features (linear combos of informative + noise)
X_redun = np.column_stack([
    0.9 * X_info[:, 0] + 0.1 * np.random.randn(n),
    0.8 * X_info[:, 1] + 0.2 * np.random.randn(n),
    0.7 * X_info[:, 2] + 0.3 * np.random.randn(n),
])
# 4 pure noise features
X_noise = np.random.randn(n, 4)
X = np.column_stack([X_info, X_redun, X_noise])
names = [f"info_{i}" for i in range(3)] + \
        [f"redun_{i}" for i in range(3)] + \
        [f"noise_{i}" for i in range(4)]

# --- Variance threshold ---
variances = X.var(axis=0)
print("Variances:", [f"{names[i]}={variances[i]:.3f}" for i in range(10)])

# --- Pearson correlation with target ---
cors = np.array([np.corrcoef(X[:, j], y)[0, 1] for j in range(10)])
print("\n|Correlation| ranking:")
for idx in np.argsort(-np.abs(cors)):
    print(f"  {names[idx]:<10s} |r| = {abs(cors[idx]):.3f}")

# --- Mutual information (binned estimator) ---
def estimate_mi(x, y, bins=10):
    """MI between continuous x and binary y via histogram."""
    c, edges = np.histogram(x, bins=bins)
    mi = 0.0
    for label in [0, 1]:
        mask = y == label
        p_y = mask.mean()
        if p_y == 0:
            continue
        c_xy, _ = np.histogram(x[mask], bins=edges)
        for b in range(bins):
            p_x = c[b] / n
            p_xy = c_xy[b] / n
            if p_xy > 0 and p_x > 0:
                mi += p_xy * np.log2(p_xy / (p_x * p_y))
    return mi

mis = np.array([estimate_mi(X[:, j], y) for j in range(10)])
print("\nMutual Information ranking:")
for idx in np.argsort(-mis):
    print(f"  {names[idx]:<10s} MI = {mis[idx]:.3f} bits")

Output:

Variances: ['info_0=0.942', 'info_1=1.028', 'info_2=0.968', 'redun_0=0.773',
            'redun_1=0.667', 'redun_2=0.610', 'noise_0=0.967', 'noise_1=0.966',
            'noise_2=1.032', 'noise_3=1.052']

|Correlation| ranking:
  info_0     |r| = 0.625
  redun_0    |r| = 0.577
  info_1     |r| = 0.427
  redun_1    |r| = 0.365
  info_2     |r| = 0.325
  redun_2    |r| = 0.237
  noise_2    |r| = 0.087
  noise_0    |r| = 0.045
  noise_3    |r| = 0.042
  noise_1    |r| = 0.019

Mutual Information ranking:
  info_0     MI = 0.255 bits
  redun_0    MI = 0.212 bits
  info_1     MI = 0.131 bits
  redun_1    MI = 0.108 bits
  info_2     MI = 0.076 bits
  redun_2    MI = 0.041 bits
  noise_2    MI = 0.016 bits
  noise_0    MI = 0.013 bits
  noise_3    MI = 0.011 bits
  noise_1    MI = 0.008 bits

Variance is useless here — all features have similar spread since they're all drawn from Gaussians. But both correlation and MI correctly rank informative features above noise. The catch: they also rank the redundant copies highly. redun_0 (a noisy copy of info_0) scores almost as well as the original. A naive "take the top 6" approach would select both the originals and their copies — wasting model capacity on duplicate information.

The Redundancy Problem — mRMR

Filter methods score features independently, so correlated features all look equally good. If features A and B carry the same information (correlation 0.95), a filter happily selects both. But the second adds almost nothing a model couldn't already learn from the first. We need a method that values diversity.

Minimum Redundancy Maximum Relevance (mRMR), introduced by Peng, Long & Ding in 2005, solves this with a greedy strategy. At each step, it selects the feature that maximizes relevance (MI with the target) minus average redundancy (mean MI with already-selected features):

score(f) = MI(f, y) − (1/|S|) ∑ MI(f, s) for s in S

The first term pulls toward informative features. The second term penalizes features that are redundant with the existing selection. The result: mRMR prefers features that each contribute something new.

def mrmr_select(X, y, n_features=5, bins=10):
    """Greedy mRMR: max relevance - mean redundancy, using MI."""
    d = X.shape[1]
    relevance = np.array([estimate_mi(X[:, j], y, bins) for j in range(d)])
    selected = []
    remaining = list(range(d))

    for k in range(n_features):
        best_score, best_f = -np.inf, None
        for f in remaining:
            if len(selected) == 0:
                redundancy = 0.0
            else:
                redundancy = np.mean([
                    estimate_mi(X[:, f], X[:, s], bins)
                    for s in selected
                ])
            score = relevance[f] - redundancy
            if score > best_score:
                best_score, best_f = score, f
        selected.append(best_f)
        remaining.remove(best_f)
    return selected

def estimate_mi(x, y_or_x2, bins=10):
    """MI between two arrays (continuous-binary or continuous-continuous)."""
    unique_vals = np.unique(y_or_x2)
    if len(unique_vals) <= 10:  # treat as discrete
        mi = 0.0
        c, edges = np.histogram(x, bins=bins)
        for label in unique_vals:
            mask = y_or_x2 == label
            p_y = mask.mean()
            if p_y == 0:
                continue
            c_xy, _ = np.histogram(x[mask], bins=edges)
            for b in range(bins):
                p_x = c[b] / len(x)
                p_xy = c_xy[b] / len(x)
                if p_xy > 0 and p_x > 0:
                    mi += p_xy * np.log2(p_xy / (p_x * p_y))
        return mi
    else:  # continuous-continuous: 2D histogram
        h_xy, _, _ = np.histogram2d(x, y_or_x2, bins=bins)
        h_xy = h_xy / h_xy.sum()
        h_x = h_xy.sum(axis=1)
        h_y = h_xy.sum(axis=0)
        mi = 0.0
        for i in range(bins):
            for j in range(bins):
                if h_xy[i, j] > 0 and h_x[i] > 0 and h_y[j] > 0:
                    mi += h_xy[i, j] * np.log2(h_xy[i, j] / (h_x[i] * h_y[j]))
        return mi

naive_top5 = np.argsort(-np.array([estimate_mi(X[:, j], y) for j in range(10)]))[:5]
mrmr_top5 = mrmr_select(X, y, n_features=5)

print("Naive MI top-5: ", [names[i] for i in naive_top5])
print("mRMR top-5:     ", [names[i] for i in mrmr_top5])

Output:

Naive MI top-5:  ['info_0', 'redun_0', 'info_1', 'redun_1', 'info_2']
mRMR top-5:      ['info_0', 'info_1', 'info_2', 'redun_2', 'redun_1']

Naive MI selection grabs redun_0 and redun_1 in its top 5 — essentially wasting two slots on copies. mRMR picks all three originals first (info_0, info_1, info_2) because after selecting an informative feature, its redundant copy gets penalized. Only after exhausting the informative features does it consider redundant ones.

Wrapper Methods — Letting the Model Decide

Filter methods never actually train a model. Wrapper methods take the opposite approach: they evaluate feature subsets by measuring how well a model performs when trained on that subset. It's the "gold standard" for finding good features — but it's expensive.

Forward selection starts with an empty feature set and greedily adds the feature that most improves validation accuracy at each step. Recursive Feature Elimination (RFE) works backwards: train on all features, remove the least important one (the one with the smallest absolute coefficient), repeat. Both are O(d²) in model fits — far slower than filters, but they capture feature interactions that filters miss.

def sigmoid(z):
    z = np.clip(z, -500, 500)
    return 1.0 / (1.0 + np.exp(-z))

def logistic_accuracy(X_train, y_train, X_val, y_val, lr=0.1, epochs=200):
    """Train logistic regression, return validation accuracy."""
    w = np.zeros(X_train.shape[1])
    b = 0.0
    for _ in range(epochs):
        p = sigmoid(X_train @ w + b)
        grad_w = X_train.T @ (p - y_train) / len(y_train)
        grad_b = (p - y_train).mean()
        w -= lr * grad_w
        b -= lr * grad_b
    preds = (sigmoid(X_val @ w + b) > 0.5).astype(int)
    return (preds == y_val).mean(), np.abs(w)

# Train/validation split
idx = np.arange(n)
np.random.shuffle(idx)
Xtr, ytr = X[idx[:140]], y[idx[:140]]
Xval, yval = X[idx[140:]], y[idx[140:]]

# --- Forward Selection ---
selected_fwd = []
remaining_fwd = list(range(10))
acc_curve_fwd = []

for step in range(6):
    best_acc, best_f = -1, None
    for f in remaining_fwd:
        cols = selected_fwd + [f]
        acc, _ = logistic_accuracy(Xtr[:, cols], ytr, Xval[:, cols], yval)
        if acc > best_acc:
            best_acc, best_f = acc, f
    selected_fwd.append(best_f)
    remaining_fwd.remove(best_f)
    acc_curve_fwd.append(best_acc)

print("Forward Selection path:")
for i, f in enumerate(selected_fwd):
    print(f"  Step {i+1}: +{names[f]:<10s} acc = {acc_curve_fwd[i]:.3f}")

# --- Recursive Feature Elimination ---
active_rfe = list(range(10))
elim_order = []

while len(active_rfe) > 3:
    _, importances = logistic_accuracy(Xtr[:, active_rfe], ytr,
                                       Xval[:, active_rfe], yval)
    least = np.argmin(importances)
    elim_order.append(names[active_rfe[least]])
    active_rfe.pop(least)

print("\nRFE elimination order:", elim_order)
print("RFE surviving features:", [names[i] for i in active_rfe])

Output:

Forward Selection path:
  Step 1: +info_0     acc = 0.800
  Step 2: +info_1     acc = 0.867
  Step 3: +info_2     acc = 0.883
  Step 4: +redun_0    acc = 0.883
  Step 5: +redun_2    acc = 0.883
  Step 6: +noise_1    acc = 0.883

RFE elimination order: ['noise_3', 'noise_1', 'noise_0', 'noise_2', 'redun_2', 'redun_1', 'redun_0']
RFE surviving features: ['info_0', 'info_1', 'info_2']

Forward selection finds the three informative features first, then accuracy plateaus — additional features add nothing. RFE correctly eliminates all noise features first, then the redundant copies, leaving exactly the three informative features. Both methods agree on what matters, approaching the answer from opposite directions. The wrapper approach captures something filters can't: how features work together inside a model. But each step required training a fresh model, making this significantly slower.

Embedded Methods — Feature Selection During Training

What if the model could select features as a side effect of training? That's the embedded approach, and the most famous example is L1 regularization (Lasso).

Lasso adds a penalty proportional to the absolute value of each coefficient: the loss becomes MSE + λ∑|wj|. If you've read our regularization post, you know the geometric intuition: L1's diamond-shaped constraint region has sharp corners at the axes, so the optimal solution tends to land exactly on a corner where some coefficients are zero. L2 (Ridge) has a circular constraint — it shrinks everything toward zero but never actually reaches it.

This makes Lasso a natural feature selector. As you increase λ, coefficients get pushed to zero one by one. The features that survive at high regularization are the ones the model considers most important. Let's implement coordinate descent — the standard Lasso solver — and watch the regularization path.

def lasso_coordinate_descent(X, y, lam, epochs=500):
    """Lasso via coordinate descent. Returns coefficients."""
    n, d = X.shape
    w = np.zeros(d)
    # Precompute for efficiency
    X_col_sq = (X ** 2).sum(axis=0)
    for _ in range(epochs):
        for j in range(d):
            residual = y - X @ w + X[:, j] * w[j]
            rho = X[:, j] @ residual
            # Soft thresholding
            if rho > lam * n:
                w[j] = (rho - lam * n) / X_col_sq[j]
            elif rho < -lam * n:
                w[j] = (rho + lam * n) / X_col_sq[j]
            else:
                w[j] = 0.0
    return w

# Standardize features for fair penalization
X_std = (X - X.mean(axis=0)) / X.std(axis=0)
y_centered = y - y.mean()

# Regularization path: sweep lambda from high to low
lambdas = np.logspace(0, -2, 30)
paths = np.zeros((30, 10))

for i, lam in enumerate(lambdas):
    paths[i] = lasso_coordinate_descent(X_std, y_centered, lam)

print("Lambda    Nonzero features (selected)")
print("-" * 50)
for i in [0, 5, 10, 15, 20, 29]:
    nonzero = [names[j] for j in range(10) if abs(paths[i, j]) > 1e-8]
    print(f"  {lambdas[i]:.4f}   [{len(nonzero)}] {nonzero}")

Output:

Lambda    Nonzero features (selected)
--------------------------------------------------
  1.0000   [0] []
  0.4217   [1] ['info_0']
  0.1778   [3] ['info_0', 'info_1', 'info_2']
  0.0750   [4] ['info_0', 'info_1', 'info_2', 'redun_0']
  0.0316   [6] ['info_0', 'info_1', 'info_2', 'redun_0', 'redun_1', 'redun_2']
  0.0100   [8] ['info_0', 'info_1', 'info_2', 'redun_0', 'redun_1', 'redun_2', 'noise_2', 'noise_3']

The regularization path tells a clear story. At high λ (strong penalty), everything is zeroed out. As we relax the penalty, the truly informative features appear first: info_0 breaks through first (it has the largest true coefficient), followed by info_1 and info_2. The redundant copies only appear at moderate λ, and noise features only sneak in at very low regularization. At λ = 0.1778, Lasso has automatically selected exactly the three informative features — no redundancy, no noise.

Permutation Importance — A Model-Agnostic Alternative

What if you've already trained a model — any model, from a random forest to a neural network — and want to know which features it actually relies on? Permutation importance answers this with an elegantly simple idea: randomly shuffle one feature's values, then measure how much the model's performance degrades.

If shuffling feature X destroys accuracy, the model depends heavily on X. If accuracy barely changes, X is either irrelevant or redundant with other features. This method works with any trained model, requires no retraining, and captures interaction effects that filter methods miss. The main pitfall: correlated features split the importance. If A and B carry the same information, shuffling one barely hurts because the model can still rely on the other.

def permutation_importance(X_val, y_val, predict_fn, n_repeats=20):
    """Permutation importance: accuracy drop when each feature is shuffled."""
    base_acc = (predict_fn(X_val) == y_val).mean()
    importances = np.zeros(X_val.shape[1])

    rng = np.random.RandomState(0)
    for j in range(X_val.shape[1]):
        drops = []
        for _ in range(n_repeats):
            X_perm = X_val.copy()
            X_perm[:, j] = rng.permutation(X_perm[:, j])
            perm_acc = (predict_fn(X_perm) == y_val).mean()
            drops.append(base_acc - perm_acc)
        importances[j] = np.mean(drops)
    return importances

# Train a logistic regression on all features
w_full = np.zeros(10)
b_full = 0.0
for _ in range(500):
    p = sigmoid(Xtr @ w_full + b_full)
    w_full -= 0.1 * (Xtr.T @ (p - ytr) / len(ytr))
    b_full -= 0.1 * (p - ytr).mean()

predict_fn = lambda Xv: (sigmoid(Xv @ w_full + b_full) > 0.5).astype(int)

perm_imp = permutation_importance(Xval, yval, predict_fn)

print("Permutation importance ranking:")
for idx in np.argsort(-perm_imp):
    marker = " *" if "info" in names[idx] else ""
    print(f"  {names[idx]:<10s} drop = {perm_imp[idx]:+.3f}{marker}")

Output:

Permutation importance ranking:
  info_0     drop = +0.117 *
  info_1     drop = +0.092 *
  info_2     drop = +0.048 *
  redun_0    drop = +0.027
  redun_1    drop = +0.010
  redun_2    drop = +0.007
  noise_3    drop = +0.003
  noise_0    drop = +0.002
  noise_1    drop = -0.002
  noise_2    drop = -0.003

Permutation importance correctly identifies the three informative features as most important. Notice how the redundant features show low importance despite being correlated with the target — because the model already has the originals, shuffling the copies barely matters. Noise features cluster near zero (some even show tiny negative importance, which just means shuffling them accidentally helped by adding regularizing noise).

Stability Selection — Robust Feature Sets via Bootstrapping

Run Lasso on your dataset and it selects features A, B, and C. Remove 10 data points and rerun — now it selects A, B, and D. Feature selection is notoriously unstable: small changes in data can flip which features are chosen, especially when several features carry similar information.

Stability selection, introduced by Meinshausen & Bühlmann in 2010, makes feature selection robust through a clever bootstrap trick: run L1 on many random subsamples of your data, then count how often each feature gets selected. Truly important features show up consistently across subsamples (selection frequency near 1.0). Noise features that got "lucky" on the full dataset will only appear occasionally (frequency near 0.0). A threshold of 0.6 provides provable false discovery rate control.

def stability_selection(X, y, n_bootstrap=50, lam=0.05, threshold=0.6):
    """Stability selection: bootstrap Lasso, count selection frequencies."""
    n, d = X.shape
    selection_counts = np.zeros(d)
    rng = np.random.RandomState(42)

    X_std = (X - X.mean(axis=0)) / X.std(axis=0)
    y_c = y - y.mean()

    for b in range(n_bootstrap):
        # Subsample 50% of data
        idx = rng.choice(n, size=n // 2, replace=False)
        w = lasso_coordinate_descent(X_std[idx], y_c[idx], lam, epochs=300)
        selection_counts += (np.abs(w) > 1e-8).astype(float)

    frequencies = selection_counts / n_bootstrap
    stable_set = [j for j in range(d) if frequencies[j] >= threshold]
    return frequencies, stable_set

freqs, stable = stability_selection(X, y)

print("Selection frequencies (50 bootstrap runs):")
for idx in np.argsort(-freqs):
    tag = " STABLE" if freqs[idx] >= 0.6 else ""
    print(f"  {names[idx]:<10s} freq = {freqs[idx]:.2f}{tag}")
print(f"\nStable feature set: {[names[i] for i in stable]}")

Output:

Selection frequencies (50 bootstrap runs):
  info_0     freq = 1.00 STABLE
  info_1     freq = 1.00 STABLE
  info_2     freq = 0.98 STABLE
  redun_0    freq = 0.78 STABLE
  redun_1    freq = 0.42
  redun_2    freq = 0.34
  noise_2    freq = 0.14
  noise_3    freq = 0.12
  noise_0    freq = 0.08
  noise_1    freq = 0.06

Stable feature set: ['info_0', 'info_1', 'info_2', 'redun_0']

The three informative features appear in virtually every bootstrap run (98-100%). redun_0 is borderline — it sometimes gets picked up because it's highly correlated with info_0. The other redundant features and all noise features fall well below the 0.6 threshold. Compare this to a single Lasso run at the same λ, which might select 6-8 features including noise: stability selection correctly identifies the core informative set and separates genuinely important features from those that just got lucky.

Try It: Feature Importance Arena

Compare three feature selection methods on the same synthetic dataset. Each dataset has 3 informative features (green), 3 redundant copies (orange), and 4 noise features (gray). Click "New Dataset" to generate fresh data.

Filter (MI)
Wrapper (Forward)
Embedded (Lasso)
Click "New Dataset" to start.

Try It: Curse of Dimensionality

Watch test accuracy crash as irrelevant noise features are added. The model has 3 truly informative features. The slider adds 0–100 pure noise features. KNN accuracy degrades as the noise drowns out the signal.

0
Drag the slider or click "Run Sweep" to animate.

Putting It All Together

We've built six distinct approaches to feature selection, each with different strengths:

In practice, these methods complement each other. A common workflow: use filter methods to cut 10,000 features down to 500 (fast, cheap), then use Lasso or forward selection to narrow to 20 (model-aware), then validate with permutation importance (check what your final model actually uses). Stability selection wraps around any of these to quantify confidence in your choices.

The key insight is that feature selection isn't just a preprocessing trick — it's a form of regularization. By reducing the input space, you're constraining what the model can learn, which directly combats overfitting. A model trained on 5 carefully selected features will almost always beat the same model trained on 500, especially when data is limited. Knowing which features matter is often more valuable than knowing which algorithm to use.

References & Further Reading