Logistic Regression from Scratch: The Single Neuron That Launched Deep Learning
From Linear Regression to Classification
Here's a simple problem: given how many hours a student studied, predict whether they pass or fail an exam. You might reach for linear regression — fit a line to the data and use it to predict. But linear regression gives you a continuous number. It might predict 1.3 for a diligent student, or −0.5 for someone who didn't study at all. What does a probability of −0.5 even mean?
The fix is elegant. Take your linear function z = w·x + b and squash it through the logistic function (also called the sigmoid): σ(z) = 1 / (1 + e−z). This S-shaped curve maps any real number to the range (0, 1) — a valid probability. Large positive inputs get pushed toward 1 ("almost certainly passes"), large negative inputs get pushed toward 0 ("almost certainly fails"), and z = 0 maps to exactly 0.5 ("coin flip").
The model is now: P(pass | hours) = σ(w · hours + b). This is logistic regression — and despite its name, it's a classification algorithm, not regression. The decision boundary is the set of inputs where w·x + b = 0, giving σ = 0.5. On one side, the model predicts class 1; on the other, class 0. The boundary is always linear (a line in 2D, a plane in 3D, a hyperplane in higher dimensions), but the probabilities smoothly transition across it.
The sigmoid has a beautiful derivative: σ'(z) = σ(z) · (1 − σ(z)). The derivative is expressed entirely in terms of the function itself — no need to recompute z. This isn't just elegant; it's computationally convenient and will produce a remarkably clean gradient when we derive the loss function. If you've read the activation functions post, you've already met the sigmoid. Now you know where it came from: not from neural networks, but from statistics. And if you've read the softmax post, here's the punchline: softmax is the multi-class generalization of sigmoid. Two-class logistic regression IS softmax with K=2.
Maximum Likelihood and Cross-Entropy Loss
We have a model. Now we need to train it — find the weights w and bias b that make the best predictions. The principled approach is maximum likelihood estimation: find the parameters that make the observed data most probable under the model.
For a single data point (x, y) where y ∈ {0, 1}, the model predicts ŷ = σ(w·x + b). The probability of observing y given the model is:
P(y | x) = ŷy · (1 − ŷ)(1−y)
When y = 1 this reduces to ŷ (we want ŷ large). When y = 0 it reduces to (1 − ŷ) (we want ŷ small). Taking the log of the likelihood for all N data points and negating gives us the binary cross-entropy loss:
L = −(1/N) ∑ [yi · log(ŷi) + (1 − yi) · log(1 − ŷi)]
This isn't an arbitrary choice. Cross-entropy is THE natural loss function for probabilistic classification because it's directly derived from the likelihood of the data. The loss functions post introduced cross-entropy; now you see where it comes from. And crucially: cross-entropy is convex with respect to the model parameters, guaranteeing a single global minimum. MSE (mean squared error) is NOT convex for classification — it creates a bumpy landscape with local minima that gradient descent can get stuck in.
Now for the gradient. We need ∂L/∂w to do gradient descent. The chain rule gives us ∂L/∂w = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂w. Each piece:
- ∂L/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)
- ∂ŷ/∂z = σ(z)(1−σ(z)) = ŷ(1−ŷ)
- ∂z/∂w = x
Multiply them together and something magical happens. The ŷ(1−ŷ) terms cancel with the denominators, and the entire gradient collapses to:
∂L/∂w = (1/N) ∑ (ŷi − yi) · xi
That's it. Prediction minus truth, scaled by the input. No complicated expressions — just the residual error times the feature vector. This cancellation between the sigmoid derivative and the cross-entropy derivative is not a coincidence; it's why this particular combination of activation and loss became the foundation of deep learning. If you've read backpropagation from scratch, you'll recognize this as backprop through a single layer — the simplest possible case.
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
class LogisticRegression:
def __init__(self, n_features):
self.w = np.zeros(n_features)
self.b = 0.0
def predict_proba(self, X):
return sigmoid(X @ self.w + self.b)
def loss(self, X, y):
p = self.predict_proba(X)
p = np.clip(p, 1e-12, 1 - 1e-12)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
def fit(self, X, y, lr=0.1, epochs=200):
for epoch in range(epochs):
p = self.predict_proba(X)
residual = p - y # the beautiful gradient
grad_w = X.T @ residual / len(y)
grad_b = np.mean(residual)
self.w -= lr * grad_w
self.b -= lr * grad_b
def predict(self, X):
return (self.predict_proba(X) >= 0.5).astype(int)
# Generate a 2D dataset
np.random.seed(42)
X_pos = np.random.randn(50, 2) + [2, 2]
X_neg = np.random.randn(50, 2) + [-1, -1]
X = np.vstack([X_pos, X_neg])
y = np.array([1]*50 + [0]*50, dtype=float)
model = LogisticRegression(n_features=2)
print(f"Before training: loss = {model.loss(X, y):.4f}")
model.fit(X, y, lr=0.5, epochs=300)
print(f"After training: loss = {model.loss(X, y):.4f}")
print(f"Accuracy: {np.mean(model.predict(X) == y):.1%}")
print(f"Weights: {model.w}, Bias: {model.b:.3f}")
Notice how clean the training loop is. The gradient is p - y (residual) times the features — just one line of linear algebra. No complicated backpropagation through multiple layers, no chain rule overhead. This is gradient descent at its most transparent.
Regularization and the Bias-Variance Tradeoff
Logistic regression can overfit, especially when you have more features than samples. Imagine classifying emails as spam with 10,000 word features but only 200 training emails. The model can find a combination of weights that perfectly separates the training set by memorizing specific word patterns that don't generalize.
The fix is regularization — adding a penalty on the weight magnitudes to the loss. L2 regularization adds λ||w||² to the loss, which adds 2λw to the gradient. This shrinks all weights toward zero, producing smoother decision boundaries. L1 regularization adds λ|w| to the loss, which pushes weights exactly to zero, giving automatic feature selection — the model learns which features are irrelevant and eliminates them. If you've read the regularization post, you've seen these techniques applied to neural networks. Here's the simpler setting where their effects are most visible.
class RegularizedLogisticRegression:
def __init__(self, n_features):
self.w = np.zeros(n_features)
self.b = 0.0
def predict_proba(self, X):
return sigmoid(X @ self.w + self.b)
def fit(self, X, y, lr=0.1, epochs=300, lam=0.1, penalty="l2"):
for epoch in range(epochs):
p = self.predict_proba(X)
grad_w = X.T @ (p - y) / len(y)
grad_b = np.mean(p - y)
# Add regularization gradient (not applied to bias)
if penalty == "l2":
grad_w += 2 * lam * self.w
elif penalty == "l1":
grad_w += lam * np.sign(self.w)
self.w -= lr * grad_w
self.b -= lr * grad_b
# High-dimensional example: 100 features, only 5 are relevant
np.random.seed(7)
n_samples, n_features, n_relevant = 80, 100, 5
X_train = np.random.randn(n_samples, n_features)
true_w = np.zeros(n_features)
true_w[:n_relevant] = np.random.randn(n_relevant) * 3
y_train = (sigmoid(X_train @ true_w) > 0.5).astype(float)
# L1 finds the sparse solution
model_l1 = RegularizedLogisticRegression(n_features)
model_l1.fit(X_train, y_train, lr=0.05, epochs=500, lam=0.05, penalty="l1")
nonzero = np.sum(np.abs(model_l1.w) > 0.01)
print(f"L1: {nonzero} non-zero weights (true: {n_relevant})")
print(f"Top features: {np.argsort(np.abs(model_l1.w))[-5:]}")
With L1 regularization, most of the 100 weights collapse to zero, leaving only the handful that actually matter. This is remarkably similar to what SVMs achieve through a different mechanism: SVM's margin maximization is mathematically equivalent to L2-regularized hinge loss. Both algorithms fight overfitting, just with different weapons.
Multi-Class Logistic Regression (Softmax)
Binary classification is useful, but the real world has more than two categories. Extending logistic regression to K classes requires learning K weight vectors (one per class) and replacing the sigmoid with the softmax function:
P(y = k | x) = exp(wk · x) / ∑j exp(wj · x)
This is the softmax from the softmax post, now revealed as multi-class logistic regression. For K = 2, you can verify this reduces to the sigmoid: the log-ratio log(P(y=1)/P(y=0)) = (w1 − w0) · x, which is a single linear function — exactly the logistic regression model.
The loss becomes categorical cross-entropy: L = −(1/N) ∑ ∑k yik · log(ŷik), where y is one-hot encoded. And the gradient retains the same beautiful form: ∂L/∂wk = (1/N) ∑ (ŷik − yik) · xi. Prediction minus truth, for every class simultaneously.
def softmax(Z):
Z_shifted = Z - Z.max(axis=1, keepdims=True) # numerical stability
exp_Z = np.exp(Z_shifted)
return exp_Z / exp_Z.sum(axis=1, keepdims=True)
class SoftmaxRegression:
def __init__(self, n_features, n_classes):
self.W = np.zeros((n_features, n_classes))
self.b = np.zeros(n_classes)
def predict_proba(self, X):
return softmax(X @ self.W + self.b)
def fit(self, X, y_onehot, lr=0.1, epochs=300):
n = len(X)
for epoch in range(epochs):
probs = self.predict_proba(X)
residual = probs - y_onehot # same elegant gradient
self.W -= lr * (X.T @ residual) / n
self.b -= lr * residual.mean(axis=0)
def predict(self, X):
return np.argmax(self.predict_proba(X), axis=1)
# Three-class dataset
np.random.seed(99)
centers = [[-2, 0], [2, 0], [0, 3]]
X_parts, y_parts = [], []
for i, c in enumerate(centers):
X_parts.append(np.random.randn(40, 2) * 0.8 + c)
y_parts.append(np.full(40, i))
X_multi = np.vstack(X_parts)
y_multi = np.concatenate(y_parts)
y_onehot = np.eye(3)[y_multi]
model_mc = SoftmaxRegression(n_features=2, n_classes=3)
model_mc.fit(X_multi, y_onehot, lr=0.3, epochs=500)
accuracy = np.mean(model_mc.predict(X_multi) == y_multi)
print(f"3-class accuracy: {accuracy:.1%}")
print(f"Class probabilities for [0, 1.5]:")
print(f" {model_mc.predict_proba(np.array([[0, 1.5]]))[0].round(3)}")
Here's the insight that connects everything: the output layer of every classification neural network — every transformer, every CNN, every ViT — is softmax regression applied to learned features. The complete transformer takes token embeddings, processes them through attention and feed-forward layers, and then the final linear layer + softmax is exactly multi-class logistic regression on the transformer's learned representations. The deep network learns features; the last layer classifies them.
From Logistic Regression to Neural Networks
This is the section that ties the entire elementary series together. A neural network with zero hidden layers — just input → sigmoid → output — is logistic regression. They're the same algorithm. So what happens when you add layers?
Consider the XOR problem: two features where the output is 1 when exactly one feature is "on" (points at [0,1] and [1,0]) and 0 when both agree (points at [0,0] and [1,1]). No single line can separate these classes. Logistic regression fails completely — the best it can do is 50%, no better than random guessing.
Now add one hidden layer with 2 neurons: z = σ(W1x + b1), ŷ = σ(w2z + b2). The hidden layer learns a nonlinear feature transformation that makes the problem linearly separable in the new space. Then the output layer — which is logistic regression — classifies effortlessly. The hidden neurons learned to create features that logistic regression can use.
class TinyNeuralNetwork:
"""A 1-hidden-layer network = logistic regression on learned features."""
def __init__(self, n_input, n_hidden):
scale = np.sqrt(2 / n_input)
self.W1 = np.random.randn(n_input, n_hidden) * scale
self.b1 = np.zeros(n_hidden)
self.w2 = np.random.randn(n_hidden) * scale
self.b2 = 0.0
def forward(self, X):
self.z1 = X @ self.W1 + self.b1 # linear transform
self.h = sigmoid(self.z1) # learned features
self.z2 = self.h @ self.w2 + self.b2 # logistic regression
return sigmoid(self.z2) # on those features
def fit(self, X, y, lr=1.0, epochs=2000):
for epoch in range(epochs):
p = self.forward(X)
# Output layer gradient (same as logistic regression)
d2 = p - y
grad_w2 = self.h.T @ d2 / len(y)
grad_b2 = np.mean(d2)
# Hidden layer gradient (backpropagation)
d1 = np.outer(d2, self.w2) * self.h * (1 - self.h)
grad_W1 = X.T @ d1 / len(y)
grad_b1 = d1.mean(axis=0)
self.w2 -= lr * grad_w2
self.b2 -= lr * grad_b2
self.W1 -= lr * grad_W1
self.b1 -= lr * grad_b1
# The XOR problem: logistic regression fails, neural network succeeds
X_xor = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=float)
y_xor = np.array([0, 1, 1, 0], dtype=float)
# Logistic regression: stuck at 50%
lr_model = LogisticRegression(2)
lr_model.fit(X_xor, y_xor, lr=1.0, epochs=1000)
lr_preds = lr_model.predict_proba(X_xor)
print("Logistic Regression on XOR:")
for x, yt, yp in zip(X_xor, y_xor, lr_preds):
print(f" {x} -> true={yt:.0f}, pred={yp:.2f}")
# Neural network: solves it
np.random.seed(5)
nn_model = TinyNeuralNetwork(2, 4)
nn_model.fit(X_xor, y_xor, lr=2.0, epochs=3000)
nn_preds = nn_model.forward(X_xor)
print("\nNeural Network on XOR:")
for x, yt, yp in zip(X_xor, y_xor, nn_preds):
print(f" {x} -> true={yt:.0f}, pred={yp:.2f}")
print(f"\nLearned features for [1,0]: {nn_model.h[2].round(3)}")
print(f"Learned features for [1,1]: {nn_model.h[3].round(3)}")
This is the single most important insight in deep learning: neural networks are logistic regression on learned representations. Every layer extracts increasingly abstract features. The final layer classifies in the transformed space. When the micrograd post built a tiny autograd engine, it was building the machinery to learn these feature transformations automatically. When the feed-forward network post described the FFN in transformers, it described a two-layer feature extractor whose output gets classified by — you guessed it — logistic regression at the end.
Logistic Regression in Practice
When should you use logistic regression? The honest answer: always try it first. It's fast, interpretable, well-calibrated (its outputs are genuine probabilities, unlike most other classifiers), and sets a strong baseline. If logistic regression achieves 92% accuracy on your problem, you know the data is mostly linearly separable and a neural network might squeeze out a few more percent but won't transform the results. If it achieves 55%, the problem is fundamentally nonlinear and you need more powerful tools.
# Comparison: Logistic Regression vs SVM vs Naive Bayes vs Decision Tree
# (using our from-scratch implementations where possible)
results = {
"Logistic Regression": {"train_speed": "fast (SGD)", "calibration": "excellent",
"interpretability": "high (weights = feature importance)",
"online": True, "best_for": "first baseline, probabilities needed"},
"SVM (RBF)": {"train_speed": "slow (O(n^2))", "calibration": "poor",
"interpretability": "low (kernel space)",
"online": False, "best_for": "small data, clear margins"},
"Naive Bayes": {"train_speed": "fastest (single pass)", "calibration": "poor",
"interpretability": "moderate (per-class feature probs)",
"online": True, "best_for": "text, very small data, multi-class"},
"Decision Tree": {"train_speed": "moderate", "calibration": "poor",
"interpretability": "highest (readable rules)",
"online": False, "best_for": "mixed feature types, interpretability"}
}
print(f"{'Model':-<25} {'Calibration':-<14} {'Online?':-<10} {'Speed'}")
print("-" * 70)
for name, props in results.items():
online = "Yes" if props["online"] else "No"
print(f"{name:-<25} {props['calibration']:-<14} {online:-<10} {props['train_speed']}")
# Key insight: LR coefficients directly show feature importance
# Positive weight = feature pushes toward class 1
# Negative weight = feature pushes toward class 0
# |weight| = strength of influence
print(f"\nLR weights reveal WHY: w = {lr_model.w.round(3)}")
print("Each coefficient tells you the direction AND magnitude of influence")
The comparison reveals logistic regression's unique position. It's the only classifier that gives calibrated probabilities out of the box — when it says 70% chance, it's right about 70% of the time. Naive Bayes outputs probabilities too, but they're notoriously poorly calibrated (too extreme). SVMs don't output probabilities at all without additional calibration. Decision trees give fraction-of-class counts, but these are granular and unstable. For applications where you need to trust the confidence scores (medical diagnosis, credit scoring, A/B test analysis), logistic regression is often the right choice even in 2026.
Try It: Decision Boundary Builder
Click to place red (class 0) or blue (class 1) points. The decision boundary updates in real time. Color intensity shows prediction confidence.
The Bridge: Why This Matters
Logistic regression isn't just another algorithm in the catalog. It's the origin point of modern deep learning. A single artificial neuron with a sigmoid activation IS logistic regression. Adding more neurons in a layer creates a wider feature extractor. Stacking layers creates depth. Replace sigmoid with ReLU, add attention, scale to billions of parameters — and you've built GPT. But the output layer? Still logistic regression. Still prediction minus truth. Still the same gradient that falls out of maximum likelihood estimation and the sigmoid derivative cancellation.
The next time you see a neural network architecture diagram, look at the final layer. It's logistic regression, faithfully classifying in whatever feature space the preceding layers learned to construct. Every layer below it exists to make logistic regression's job easier.
Try It: From 1 Neuron to a Network
The same non-separable dataset (concentric circles), three models. Watch logistic regression fail, then see how adding hidden layers creates the feature transformations that make classification possible.
References & Further Reading
- David Cox — The Regression Analysis of Binary Sequences (1958) — the original paper introducing logistic regression for binary outcomes
- Hastie, Tibshirani & Friedman — The Elements of Statistical Learning, Chapter 4 — rigorous treatment of linear methods for classification
- Christopher Bishop — Pattern Recognition and Machine Learning, Chapter 4 — Bayesian perspective on logistic regression
- Andrew Ng — CS229 Lecture Notes — clear derivation of logistic regression within the exponential family framework
- scikit-learn — Logistic Regression Documentation — practical implementation with solver options and regularization
- DadOps — Loss Functions from Scratch — cross-entropy in the broader context of all loss functions