Linear Regression from Scratch: The Algorithm That Launched a Thousand Models

February 27, 2026 · Elementary · 15 min read

The Simplest Useful Model

Every machine learning algorithm you’ve ever used — neural networks, gradient boosting, SVMs — descends from one ancestor: linear regression. It’s the fruit fly of machine learning: simple enough to dissect completely, powerful enough to solve real problems, and foundational enough that understanding it deeply makes everything else click.

Most tutorials rush through linear regression to get to the “interesting” stuff. Let’s not. Let’s build it from scratch, derive every equation, and understand exactly why it works.

Linear regression is the gateway drug of ML. In this single algorithm, you’ll encounter loss functions, gradient descent, closed-form solutions, regularization, and overfitting — all in their purest, most transparent forms. Master these ideas here and you’ll recognize them everywhere: in loss functions, in optimizers, in logistic regression.

The model itself is almost laughably simple:

y = w₁x₁ + w₂x₂ + … + wₙxₙ + b

Or in compact matrix form: y = Xw + b. That’s it. A weighted sum of features plus a bias term. The entire challenge is finding the right weights — and there turn out to be surprisingly many ways to do it.

One critical distinction before we start: “linear” in linear regression means linear in the parameters, not necessarily linear in the features. The model y = w₁x + w₂x² + w₃x³ is still linear regression — a polynomial in x, but linear in the weights w. This distinction will matter when we get to polynomial regression.

Simple Linear Regression — One Feature, One Line

Let’s start with the absolute simplest case: one input feature x, one output y, and we want to find the best line y = wx + b.

But what does “best” mean? We need a way to measure how wrong our line is. The standard choice is the sum of squared residuals:

L = Σ(yᵢ − (wxᵢ + b))²

Why squared errors? Four good reasons: (1) they penalize large errors more than small ones, (2) the function is differentiable everywhere, (3) it has a clean closed-form solution, and (4) it corresponds to maximum likelihood estimation under the assumption that noise is Gaussian. If you want the full story on loss functions, see our deep dive on MSE and friends.

To find the minimum, we take partial derivatives with respect to w and b, set them to zero, and solve. After some algebra (which is a satisfying exercise to work through on paper), we get:

b = ȳ − w·x̄ (the line passes through the centroid of the data)
w = Σ(xᵢ − x̄)(yᵢ − ȳ) / Σ(xᵢ − x̄)²

Look at that formula for w. The numerator is the covariance of x and y. The denominator is the variance of x. So the optimal slope is literally Cov(x, y) / Var(x). The best linear fit is determined by how much x and y co-vary, normalized by how much x varies on its own.

Let’s implement this from scratch:

import numpy as np

# Generate noisy linear data: y = 3x + 7 + noise
np.random.seed(42)
X = 2 * np.random.rand(100)
y = 3 * X + 7 + np.random.randn(100) * 0.8

# Closed-form solution — no libraries needed
x_mean, y_mean = X.mean(), y.mean()
w = np.sum((X - x_mean) * (y - y_mean)) / np.sum((X - x_mean) ** 2)
b = y_mean - w * x_mean

print(f"Slope:     {w:.4f}  (true: 3.0)")
print(f"Intercept: {b:.4f}  (true: 7.0)")

# R-squared: proportion of variance explained
y_pred = w * X + b
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y_mean) ** 2)
r_squared = 1 - ss_res / ss_tot
print(f"R²:        {r_squared:.4f}")

# Sanity check against sklearn
from sklearn.linear_model import LinearRegression
sk_model = LinearRegression().fit(X.reshape(-1, 1), y)
print(f"\nsklearn slope: {sk_model.coef_[0]:.4f}, intercept: {sk_model.intercept_:.4f}")
print("Match: ✓" if abs(w - sk_model.coef_[0]) < 1e-10 else "Mismatch!")

The output confirms our scratch implementation matches sklearn exactly. The recovered slope (around 3.07) and intercept (around 6.88) are close to the true values of 3 and 7 — the small error is just noise doing its thing.

Try It: Fit a Line

Click anywhere on the canvas to place data points. The regression line updates in real time. Watch how the slope, intercept, and R² change as you add points. Use Clear to start over or Add Noise to perturb existing points.

Click to place points

The Normal Equation — Multiple Features in One Shot

Real-world problems rarely have a single feature. Houses have square footage and bedrooms and location. Patients have age and blood pressure and cholesterol. We need to extend our approach to d features.

In matrix form, our model is y = Xw, where X is an n×d matrix (n samples, d features — we absorb the bias into w by prepending a column of ones to X). The loss becomes:

L = (y − Xw)ᵀ(y − Xw)

Taking the matrix derivative and setting it to zero gives us the famous normal equation:

w = (XᵀX)⁻¹ Xᵀy

This is the multivariate generalization of the Cov/Var formula from Section 2.

The normal equation is beautiful: one line of code, exact solution, no iteration. But it has limits:

XᵀX is singular when features are collinear — the inverse doesn’t exist
Inverting XᵀX is O(d³) — fine for d=100, brutal for d=100,000
Numerical instability with ill-conditioned matrices (large condition number)

The pseudoinverse (np.linalg.pinv) handles the singular case gracefully by using SVD internally. Let’s see both approaches:

import numpy as np

# Generate multi-feature data with known weights
np.random.seed(42)
n, d = 200, 5
true_w = np.array([3.0, -1.5, 0.0, 2.0, 0.0])  # 3 active, 2 zero
true_b = 4.0

X = np.random.randn(n, d)
y = X @ true_w + true_b + np.random.randn(n) * 0.5

# Prepend column of ones for bias
X_aug = np.column_stack([np.ones(n), X])

# Method 1: Normal equation
w_normal = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y

# Method 2: Pseudoinverse (more stable)
w_pinv = np.linalg.pinv(X_aug) @ y

print("True weights:    ", np.round([true_b] + list(true_w), 3))
print("Normal equation: ", np.round(w_normal, 3))
print("Pseudoinverse:   ", np.round(w_pinv, 3))

# Now break it: add a collinear feature (x6 = 2 * x1)
X_bad = np.column_stack([X, 2 * X[:, 0]])
X_bad_aug = np.column_stack([np.ones(n), X_bad])

cond_number = np.linalg.cond(X_bad_aug.T @ X_bad_aug)
print(f"\nCondition number with collinear feature: {cond_number:.2e}")
print("(Anything above ~1e12 means numerical trouble)")

# Pseudoinverse still works
w_pinv_bad = np.linalg.pinv(X_bad_aug) @ y
print("Pseudoinverse still finds a solution: ✓")

The normal equation recovers our true weights almost perfectly. But when we add a collinear feature (x₆ = 2·x₁), the condition number explodes and the normal equation becomes numerically unstable. The pseudoinverse, backed by SVD, handles it gracefully.

Gradient Descent — When the Closed Form Won’t Scale

The normal equation is O(d³). For a dataset with 1,000 features, that means roughly a billion operations for the matrix inverse. For 1,000,000 features (common in NLP and computer vision), it’s intractable. We need an iterative approach.

Enter gradient descent: compute the gradient of the loss, take a small step in the opposite direction, repeat. For MSE, the gradient has a clean form:

∇L = −(2/n) Xᵀ(y − Xw)

Read that term (y − Xw) — those are the residuals. The gradient is the correlation between the residuals and the features. If a feature correlates with the error, the gradient says “adjust that weight.” Beautifully intuitive.

The update rule is simple: w ← w − α · ∇L, where α is the learning rate.

One crucial practical detail: feature scaling matters enormously. Without normalization, the loss surface is an elongated ellipsoid — gradient descent zigzags back and forth along the narrow direction. With standardization (zero mean, unit variance), the loss surface becomes spherical and GD takes a direct path to the minimum. For a deep exploration of gradient descent variants, see our optimizers post.

Here’s the beautiful thing about linear regression’s loss surface: it’s a convex paraboloid. There’s exactly one minimum, and gradient descent is guaranteed to find it (with a small enough learning rate). No local minima, no saddle points — unlike neural networks.

import numpy as np

def gradient_descent_lr(X, y, lr=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    b = 0.0
    losses = []

    for epoch in range(epochs):
        y_pred = X @ w + b
        residual = y_pred - y

        grad_w = (2 / n) * (X.T @ residual)
        grad_b = (2 / n) * np.sum(residual)

        w -= lr * grad_w
        b -= lr * grad_b

        loss = np.mean(residual ** 2)
        losses.append(loss)
    return w, b, losses

# Generate data
np.random.seed(42)
n, d = 500, 5
true_w = np.array([3.0, -1.5, 0.0, 2.0, 0.0])
X_raw = np.random.randn(n, d) * np.array([1, 10, 0.1, 5, 100])  # different scales
y = X_raw @ true_w + 4.0 + np.random.randn(n) * 0.5

# Without scaling — slow, zigzaggy convergence
w_raw, b_raw, loss_raw = gradient_descent_lr(X_raw, y, lr=0.00001, epochs=2000)

# With scaling — fast, direct convergence
X_mean, X_std = X_raw.mean(axis=0), X_raw.std(axis=0)
X_scaled = (X_raw - X_mean) / X_std
w_scaled, b_scaled, loss_scaled = gradient_descent_lr(X_scaled, y, lr=0.1, epochs=200)

# Convert scaled weights back to original space
w_original = w_scaled / X_std
b_original = b_scaled - np.sum(w_scaled * X_mean / X_std)

print("True weights:       ", true_w)
print("GD (no scaling):    ", np.round(w_raw, 3))
print("GD (with scaling):  ", np.round(w_original, 3))
print(f"\nLoss after 2000 epochs (no scaling):  {loss_raw[-1]:.4f}")
print(f"Loss after 200 epochs (with scaling): {loss_scaled[-1]:.4f}")
print("Scaling wins: 10x fewer epochs, 100x larger learning rate")

The difference is dramatic. Without feature scaling, gradient descent crawls with a tiny learning rate (0.00001) and still hasn’t converged after 2,000 epochs. With standardized features, we use a learning rate 10,000× larger and converge in just 200 epochs. This is why every ML pipeline starts with feature scaling.

Polynomial Regression — Nonlinearity from a Linear Model

Here’s one of the most elegant tricks in machine learning: you can model nonlinear relationships using linear regression. The secret? Feature engineering.

Take a single feature x and create new features: x², x³, x⁴, and so on. Now the model y = w₁x + w₂x² + w₃x³ + b is a polynomial in x — but it’s still a linear function of the parameters (w₁, w₂, w₃, b). We can use the exact same normal equation to solve for the weights.

This is the simplest example of a broader idea: transform your features into a richer space, then fit a linear model in that space. Polynomial kernels in SVMs do this implicitly. Neural network hidden layers do it learned. But the polynomial basis expansion is where the concept is clearest.

Polynomial regression is also the perfect playground for understanding the bias-variance tradeoff:

Degree 1 (a line): underfits — too simple to capture a curvy relationship
Degree 3-5: usually a good fit — flexible enough without going haywire
Degree 15: overfits wildly — passes through every training point but oscillates between them with enormous coefficients

import numpy as np

# Nonlinear ground truth: y = sin(x) + noise
np.random.seed(42)
X_train = np.sort(np.random.uniform(0, 2 * np.pi, 20))
y_train = np.sin(X_train) + np.random.randn(20) * 0.3
X_test = np.linspace(0, 2 * np.pi, 200)
y_test_true = np.sin(X_test)

def poly_features(x, degree):
    return np.column_stack([x ** i for i in range(degree + 1)])

print(f"{'Degree':>6} {'Train MSE':>12} {'Test MSE':>12} {'Max |coeff|':>14}")
print("-" * 48)

for degree in [1, 3, 5, 10, 15]:
    X_poly = poly_features(X_train, degree)
    w = np.linalg.pinv(X_poly) @ y_train

    train_pred = X_poly @ w
    test_pred = poly_features(X_test, degree) @ w

    train_mse = np.mean((y_train - train_pred) ** 2)
    test_mse = np.mean((y_test_true - test_pred) ** 2)

    print(f"{degree:>6} {train_mse:>12.6f} {test_mse:>12.6f} {np.max(np.abs(w)):>14.1f}")

# Degree 15: enormous coefficients = overfitting fingerprint

Watch the pattern: training MSE drops monotonically as degree increases — more flexibility always fits training data better. But test MSE forms a U-curve: it drops at first (reducing underfitting), then spikes (overfitting kicks in). The max coefficient magnitude tells the story — degree 15 has coefficients in the thousands, a telltale sign of overfitting. This motivates our next topic: regularization.

Ridge and Lasso — Taming the Coefficients

Polynomial regression showed us the problem: unconstrained models grow enormous coefficients to overfit training data. The fix is elegantly simple — add a penalty on the coefficient magnitudes to the loss function.

Ridge Regression (L2 Penalty)

Ridge regression adds the squared magnitude of the weights to the loss:

L = ||y − Xw||² + λ||w||²

This has a clean closed-form solution: w = inv(XᵀX + λI) · Xᵀy. Notice that λI term — it adds λ to every diagonal element of XᵀX, guaranteeing it’s always invertible. Ridge regression simultaneously solves the overfitting problem and the collinearity problem from Section 3.

Ridge shrinks all coefficients toward zero, but never to exactly zero. From a Bayesian perspective, L2 regularization is equivalent to placing a Gaussian prior on the weights — see our Bayesian inference post for the full story.

Lasso Regression (L1 Penalty)

Lasso uses the absolute values instead:

L = ||y − Xw||² + λ||w||₁

The L1 penalty has no closed-form solution (absolute value isn’t differentiable at zero), so we use coordinate descent: optimize one weight at a time while holding the others fixed. The key magic of Lasso is that it drives some coefficients to exactly zero — performing automatic feature selection. The geometry explains why: the L1 constraint region is a diamond with corners on the axes, and the loss contours tend to hit those corners.

This makes Lasso especially powerful when you suspect only a few features truly matter. For the full treatment of L1, L2, dropout, and other regularization techniques, see our regularization deep dive.

import numpy as np

# Data with 3 real features, 3 irrelevant ones
np.random.seed(42)
n = 100
X = np.random.randn(n, 6)
true_w = np.array([4.0, -2.0, 3.0, 0.0, 0.0, 0.0])  # only first 3 matter
y = X @ true_w + 1.0 + np.random.randn(n) * 0.5

# Prepend ones column for bias
X_aug = np.column_stack([np.ones(n), X])

def ridge(X, y, lam):
    d = X.shape[1]
    I = np.eye(d)
    I[0, 0] = 0  # don't regularize bias
    return np.linalg.inv(X.T @ X + lam * I) @ X.T @ y

def lasso_cd(X, y, lam, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        for j in range(d):
            residual = y - X @ w + X[:, j] * w[j]
            rho = X[:, j] @ residual / n
            xj_sq = np.sum(X[:, j] ** 2) / n
            if j == 0:  # don't regularize bias
                w[j] = rho / xj_sq
            else:
                w[j] = np.sign(rho) * max(abs(rho) - lam / (2 * n), 0) / xj_sq
    return w

print("True weights: [bias=1.0, 4.0, -2.0, 3.0, 0.0, 0.0, 0.0]\n")

for lam in [0.01, 0.1, 1.0, 10.0]:
    w_r = ridge(X_aug, y, lam)
    w_l = lasso_cd(X_aug, y, lam)
    print(f"λ = {lam:>5.2f}")
    print(f"  Ridge: {np.round(w_r, 2)}")
    print(f"  Lasso: {np.round(w_l, 2)}")
    lasso_zeros = np.sum(np.abs(w_l[1:]) < 0.01)
    print(f"  Lasso zeros: {lasso_zeros}/6 features\n")

As λ increases, Ridge smoothly shrinks all coefficients toward zero but none reach it. Lasso, on the other hand, starts snapping coefficients to exactly zero — and it correctly identifies that features 4, 5, and 6 are irrelevant. This is automatic feature selection, and it’s one of the most useful properties in practical ML.

Try It: Regularization Path — Ridge vs Lasso

Drag the λ slider to see how Ridge and Lasso shrink coefficients differently. The data has 6 features but only 3 have non-zero true weights (shown as dashed lines). Watch how Lasso snaps irrelevant features to exactly zero while Ridge just shrinks everything.

λ: 0.00

Ridge (L2)

Lasso (L1)

λ = 0.00 — No regularization (OLS solution)

Evaluation and Diagnostics — Is Your Model Any Good?

You’ve fit a model. How do you know it’s actually useful? The go-to metric is R² (coefficient of determination):

R² = 1 − SS_res / SS_tot

R² tells you the proportion of variance in y that your model explains. R² = 1 means a perfect fit, R² = 0 means your model is no better than predicting the mean, and R² < 0 means you’re actually worse than the mean (this happens more often than you’d think with overfit models evaluated on test data).

But R² alone isn’t enough. Your residuals — the differences between predicted and actual values — are the secret diagnostic weapon. A good model’s residuals should look like random noise: no patterns, no trends, constant spread.

Residuals vs. predicted: should be a random scatter cloud. A funnel shape means heteroscedasticity (variance changes with x). A curve means your model is missing nonlinear structure.
Adjusted R²: penalizes adding useless features. Regular R² can only go up as you add features; adjusted R² can decrease if a feature isn’t pulling its weight.

The classical assumptions for linear regression (known by the mnemonic LINE): Linearity, Independence, Normality of residuals, and Equal variance (homoscedasticity). Violations don’t invalidate the model, but they affect confidence intervals and hypothesis tests. For comprehensive evaluation metrics beyond R², see our evaluation metrics post.

import numpy as np

np.random.seed(42)
n = 200

# Scenario 1: Linear data (model is appropriate)
X1 = np.random.randn(n, 3)
y1 = X1 @ [2.0, -1.0, 0.5] + 3.0 + np.random.randn(n) * 0.8

# Scenario 2: Nonlinear data (model is misspecified)
X2 = np.random.randn(n, 1)
y2 = 2 * np.sin(X2[:, 0]) + X2[:, 0] ** 2 + np.random.randn(n) * 0.3

def evaluate(X, y, label):
    X_aug = np.column_stack([np.ones(len(X)), X])
    w = np.linalg.pinv(X_aug) @ y
    y_pred = X_aug @ w
    residuals = y - y_pred

    n, d = X.shape
    ss_res = np.sum(residuals ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    r2 = 1 - ss_res / ss_tot
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - d - 1)
    rmse = np.sqrt(np.mean(residuals ** 2))
    mae = np.mean(np.abs(residuals))

    print(f"\n--- {label} ---")
    print(f"R²:          {r2:.4f}")
    print(f"Adjusted R²: {adj_r2:.4f}")
    print(f"RMSE:        {rmse:.4f}")
    print(f"MAE:         {mae:.4f}")

    # Residual pattern check
    # Split predictions into 4 bins, check if residual means differ
    bins = np.percentile(y_pred, [25, 50, 75])
    bin_idx = np.digitize(y_pred, bins)
    bin_means = [residuals[bin_idx == i].mean() for i in range(4)]
    pattern = max(bin_means) - min(bin_means)
    if pattern > 0.5:
        print(f"⚠ Residual pattern detected (spread={pattern:.2f}) — model may be misspecified")
    else:
        print(f"✓ Residuals look random (spread={pattern:.2f})")

evaluate(X1, y1, "Linear Data (Good Fit)")
evaluate(X2, y2, "Nonlinear Data (Misspecified Model)")

The linear scenario shows a high R², matching adjusted R², and random residuals — a healthy model. The nonlinear scenario has decent R² but a clear residual pattern, signaling that a linear model is missing the underlying curve. This is exactly the kind of diagnostic that saves you from trusting a misleading metric.

References & Further Reading

Carl Friedrich Gauss — Theoria Motus (1809) — first use of least squares for orbital prediction
Adrien-Marie Legendre — Nouvelles méthodes (1805) — independent discovery of least squares
Hastie, Tibshirani & Friedman — The Elements of Statistical Learning — canonical ML textbook (free online), chapters 3-4 cover linear methods in depth
Christopher Bishop — Pattern Recognition and Machine Learning — Bayesian perspective on linear regression (chapter 3)
scikit-learn — Linear Models Documentation — production-grade implementations of everything covered here

Related DadOps posts:

Loss Functions from Scratch — deep dive into MSE, MAE, Huber, and why we square errors
Regularization from Scratch — L1, L2, dropout, early stopping, and the full regularization toolkit
Optimizers from Scratch — gradient descent, momentum, Adam, and learning rate schedules
Logistic Regression from Scratch — the classification counterpart of linear regression
Bayesian Inference from Scratch — the probabilistic view: priors, posteriors, and Bayesian linear regression
PCA from Scratch — SVD, dimensionality reduction, and the connection to linear regression
ML Evaluation from Scratch — comprehensive evaluation metrics and cross-validation