← Back to Blog

Naive Bayes from Scratch

Bayes' Theorem — Updating Beliefs with Evidence

Every spam filter you've ever used is powered by a theorem from 1763. Thomas Bayes, an English Presbyterian minister, worked out how to update your beliefs when you see new evidence. Two and a half centuries later, his formula still powers one of the fastest, simplest, and most surprisingly effective classifiers in all of machine learning.

Here's the setup. A rare disease affects 1% of the population. You take a test that's 95% accurate for people who have the disease (sensitivity) and 90% accurate for people who don't (specificity). You test positive. What's the probability you actually have the disease?

Most people guess around 95%. The actual answer is about 8.8%. Your gut is wrong because it ignores the base rate — the disease is so rare that the 10% false positive rate on the 99% healthy population generates far more false alarms than the 95% true positive rate on the 1% sick population. This is the base rate fallacy, and Bayes' theorem corrects it:

P(disease | positive) = P(positive | disease) × P(disease) / P(positive)

= 0.95 × 0.01 / (0.95 × 0.01 + 0.10 × 0.99)

= 0.0095 / 0.1085 ≈ 0.088

Now replace "disease" with "spam" and "positive test" with "email contains the word 'free'." Classification is just Bayesian inference: given the features we observe, what's the most probable class? The formula becomes P(class | features) ∝ P(features | class) × P(class).

There's one problem: computing P(features | class) is intractable for high-dimensional data. If you have 10,000 vocabulary words, each either present or absent, there are 210,000 possible feature combinations per class — more than atoms in the universe. The "naive" fix? Assume all features are conditionally independent given the class:

P(features | class) = P(f₁ | class) × P(f₂ | class) × ... × P(f_d | class)

This assumption is almost always wrong — the words "Nigerian" and "prince" are obviously not independent in spam emails. Yet Naive Bayes works remarkably well anyway. We'll spend an entire section later explaining why. First, let's build it.

Gaussian Naive Bayes — Continuous Features

When your features are continuous numbers (height, weight, temperature), the simplest Naive Bayes variant assumes each feature follows a Gaussian (normal) distribution within each class. Training is embarrassingly simple: compute the mean and variance of each feature for each class. That's it. No gradient descent, no iterations, no learning rate to tune — just counting and averaging.

import numpy as np

class GaussianNB:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.params = {}  # {class: (means, variances, prior)}
        for c in self.classes:
            X_c = X[y == c]
            self.params[c] = (
                X_c.mean(axis=0),            # mean of each feature
                X_c.var(axis=0) + 1e-9,      # variance (+ epsilon for stability)
                len(X_c) / len(X)            # class prior P(class)
            )
        return self

    def _log_gaussian(self, x, mean, var):
        """Log of Gaussian PDF: avoid underflow by staying in log-space."""
        return -0.5 * (np.log(2 * np.pi * var) + (x - mean) ** 2 / var)

    def predict(self, X):
        predictions = []
        for x in X:
            log_posteriors = []
            for c in self.classes:
                mean, var, prior = self.params[c]
                # log P(class) + sum of log P(feature_i | class)
                log_post = np.log(prior) + np.sum(self._log_gaussian(x, mean, var))
                log_posteriors.append(log_post)
            predictions.append(self.classes[np.argmax(log_posteriors)])
        return np.array(predictions)

# Example: classify 2D data
np.random.seed(42)
X_train = np.vstack([
    np.random.randn(50, 2) + [2, 2],    # class 0: centered at (2,2)
    np.random.randn(50, 2) + [-1, -1],  # class 1: centered at (-1,-1)
])
y_train = np.array([0]*50 + [1]*50)

nb = GaussianNB().fit(X_train, y_train)
print(nb.predict(np.array([[0, 0], [3, 3], [-2, -2]])))
# Output: [1, 0, 1]

Let's trace through the math for the test point [0, 0]. For class 0, whose features have means around (2, 2): the point [0, 0] is about 2 standard deviations away on each feature, so the Gaussian PDF gives a relatively small density. For class 1, whose features have means around (-1, -1): the point [0, 0] is only about 1 standard deviation away — higher density. Multiply by the priors (both 0.5 here), and class 1 wins. The entire computation is just evaluating two Gaussian bells and picking the taller one at the query point.

Training is O(N × d) — one pass through the data — and prediction is O(K × d) per sample, where K is the number of classes and d the number of features. You can classify a million samples in the time it takes a neural network to load its weights.

Multinomial Naive Bayes — The Text Classifier

Text classification is where Naive Bayes really shines. The idea: represent each document as a bag of words — a vector of word counts, ignoring order. "The cat sat on the mat" becomes {"the": 2, "cat": 1, "sat": 1, "on": 1, "mat": 1}. Then model P(word | class) using a multinomial distribution: the probability of seeing each word, given the class.

import numpy as np

class MultinomialNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter

    def fit(self, documents, labels):
        self.classes = list(set(labels))
        self.vocab = set()
        for doc in documents:
            self.vocab.update(doc.lower().split())
        self.vocab = sorted(self.vocab)
        self.vocab_idx = {w: i for i, w in enumerate(self.vocab)}
        V = len(self.vocab)

        self.log_prior = {}
        self.log_likelihood = {}  # {class: array of log P(word|class)}

        for c in self.classes:
            class_docs = [d for d, l in zip(documents, labels) if l == c]
            self.log_prior[c] = np.log(len(class_docs) / len(documents))

            # Count all words in this class
            word_counts = np.zeros(V) + self.alpha  # start with smoothing
            for doc in class_docs:
                for word in doc.lower().split():
                    if word in self.vocab_idx:
                        word_counts[self.vocab_idx[word]] += 1

            # Normalize to log-probabilities
            total = word_counts.sum()
            self.log_likelihood[c] = np.log(word_counts / total)
        return self

    def predict(self, document):
        scores = {}
        words = document.lower().split()
        for c in self.classes:
            score = self.log_prior[c]
            for word in words:
                if word in self.vocab_idx:
                    score += self.log_likelihood[c][self.vocab_idx[word]]
            scores[c] = score
        return max(scores, key=scores.get), scores

# Build a tiny spam classifier
spam_emails = [
    "free money click now", "winner selected claim prize",
    "earn cash fast free", "discount offer limited time",
    "free gift card winner", "click here claim reward now",
    "urgent money transfer free", "lottery winner congratulations",
    "free trial offer click", "earn money from home fast",
]
ham_emails = [
    "meeting tomorrow at noon", "project update attached review",
    "code review needed please", "lunch plans for friday team",
    "quarterly report summary data", "bug fix deployed to staging",
    "design review scheduled wednesday", "sprint planning next week",
    "test results look good", "documentation update merged today",
]

docs = spam_emails + ham_emails
labels = ["spam"]*10 + ["ham"]*10

clf = MultinomialNB(alpha=1.0).fit(docs, labels)
prediction, scores = clf.predict("free money winner")
print(f"Prediction: {prediction}")
print(f"Log P(spam|doc) = {scores['spam']:.2f}")
print(f"Log P(ham|doc)  = {scores['ham']:.2f}")
# Output: Prediction: spam
#         Log P(spam|doc) = -7.51
#         Log P(ham|doc)  = -11.94

Let's trace the math for "free money winner." The word "free" appears in 5 of 10 spam emails but 0 of 10 ham emails. With Laplace smoothing (α=1), P("free" | spam) = (5+1)/(total_spam_words + V) — a big number relative to P("free" | ham) = (0+1)/(total_ham_words + V). Same story for "money" (3 spam vs 0 ham) and "winner" (3 spam vs 0 ham). Each word independently screams "spam," and multiplying their probabilities (or adding their logs) makes the case overwhelming.

That Laplace smoothing parameter α is crucial. Without it, any word that appears in spam but never in ham gets P(word | ham) = 0, which makes P(ham | doc) = 0 regardless of every other word. One unseen word vetoes the entire classification. Smoothing adds a pseudocount to every word, ensuring nothing has zero probability — it's the Bayesian equivalent of "don't be too sure about anything."

Bernoulli Naive Bayes and Smoothing

Multinomial NB cares about how many times a word appears. Bernoulli NB only cares about whether a word appears — features are binary: present or absent. This seemingly minor difference has a subtle but important consequence: Bernoulli NB explicitly models the absence of words.

import numpy as np

class BernoulliNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha

    def fit(self, documents, labels):
        self.classes = list(set(labels))
        self.vocab = set()
        for doc in documents:
            self.vocab.update(doc.lower().split())
        self.vocab = sorted(self.vocab)
        V = len(self.vocab)

        self.log_prior = {}
        self.log_prob = {}      # log P(word present | class)
        self.log_prob_neg = {}  # log P(word absent | class)

        for c in self.classes:
            class_docs = [d for d, l in zip(documents, labels) if l == c]
            N_c = len(class_docs)
            self.log_prior[c] = np.log(N_c / len(documents))

            # Count documents containing each word
            word_present = np.zeros(V)
            for doc in class_docs:
                words_in_doc = set(doc.lower().split())
                for i, w in enumerate(self.vocab):
                    if w in words_in_doc:
                        word_present[i] += 1

            # Smoothed probabilities
            p = (word_present + self.alpha) / (N_c + 2 * self.alpha)
            self.log_prob[c] = np.log(p)
            self.log_prob_neg[c] = np.log(1 - p)
        return self

    def predict(self, document):
        words_in_doc = set(document.lower().split())
        scores = {}
        for c in self.classes:
            score = self.log_prior[c]
            for i, w in enumerate(self.vocab):
                if w in words_in_doc:
                    score += self.log_prob[c][i]
                else:
                    score += self.log_prob_neg[c][i]  # penalize missing words
            scores[c] = score
        return max(scores, key=scores.get), scores

When an email doesn't contain the word "meeting," that's evidence against ham (legitimate email). Multinomial NB just skips missing words; Bernoulli NB says "the absence of 'meeting' makes ham less likely." This distinction matters most for short texts — tweets, subject lines, search queries — where what's not said carries as much signal as what is. For longer documents where word frequency carries richer information, Multinomial NB typically wins.

Both variants depend heavily on the smoothing parameter α. Here's how different values behave:

In practice, treat α as a hyperparameter and tune it with cross-validation. The optimal value depends on your vocabulary size and training set — larger vocabularies with more rare words tend to benefit from smaller α values.

Try It: Bayesian Belief Updater

Adjust the disease prevalence and test accuracy to see how Bayes' theorem updates your belief after a positive test. Watch how rare diseases make even accurate tests unreliable.

1.0%
95%
90%
P(disease | positive test) = 8.8% — The base rate fallacy in action

Why "Naive" Works — The Independence Paradox

Here's the puzzle. Naive Bayes assumes every feature is independent given the class. In almost every real dataset, features are correlated — height and weight, word pairs like "machine" and "learning," pixel intensities in adjacent positions. The assumption is provably wrong. Yet Naive Bayes consistently matches or beats far more sophisticated models. What's going on?

import numpy as np

def compare_nb_vs_lr(n_samples=200, correlation=0.8):
    """Generate correlated features and compare NB to Logistic Regression."""
    np.random.seed(42)
    # Class 0: features centered at (2, 2) with correlation
    cov = [[1, correlation], [correlation, 1]]
    X0 = np.random.multivariate_normal([2, 2], cov, n_samples // 2)
    # Class 1: features centered at (-1, -1) with same correlation
    X1 = np.random.multivariate_normal([-1, -1], cov, n_samples // 2)

    X = np.vstack([X0, X1])
    y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

    # Shuffle
    idx = np.random.permutation(len(X))
    X, y = X[idx], y[idx]
    split = int(0.7 * len(X))
    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]

    # Naive Bayes (treats features as independent despite correlation=0.8!)
    nb = GaussianNB().fit(X_train, y_train)
    nb_acc = np.mean(nb.predict(X_test) == y_test)

    # Check NB's probability calibration
    # (using our GaussianNB from earlier)
    log_posts = []
    for x in X_test[:5]:
        posts = []
        for c in nb.classes:
            mean, var, prior = nb.params[c]
            lp = np.log(prior) + np.sum(nb._log_gaussian(x, mean, var))
            posts.append(lp)
        # Convert to probabilities via softmax
        posts = np.array(posts)
        posts -= posts.max()
        probs = np.exp(posts) / np.exp(posts).sum()
        log_posts.append(probs)

    print(f"Feature correlation: {correlation}")
    print(f"NB accuracy: {nb_acc:.1%}")
    print(f"NB probabilities (first 5 samples):")
    for i, p in enumerate(log_posts):
        print(f"  Sample {i}: P(class0)={p[0]:.4f}, P(class1)={p[1]:.4f}")

compare_nb_vs_lr(correlation=0.8)
# Output:
# Feature correlation: 0.8
# NB accuracy: 98.3%
# NB probabilities (first 5 samples):
#   Sample 0: P(class0)=0.0000, P(class1)=1.0000
#   Sample 1: P(class0)=1.0000, P(class1)=0.0000
#   ...
# NB is wildly overconfident (0.0000 vs 1.0000) but the RANKING is correct!

Despite treating two features with 0.8 correlation as independent, Naive Bayes still classifies at 98%+ accuracy. The probabilities are absurdly overconfident — 0.0000 vs 1.0000 instead of something reasonable like 0.85 vs 0.15 — but it doesn't matter because the ranking is correct. Here's why:

Reason 1: Classification only needs the right ranking. To classify correctly, NB only needs P(class A | x) > P(class B | x). Even if both probabilities are wildly wrong in absolute terms, as long as the bigger one stays bigger, the prediction is correct. Correlated features essentially "double-count" the same evidence, inflating the winner's margin — but the winner doesn't change.

Reason 2: The double-counting is symmetric. When height and weight are correlated, NB overweights this information for all classes equally. Since the same features are correlated in every class, the distortion cancels out in the comparison. It's like grading on a curve where everyone gets the same bonus — the ranking doesn't change.

Reason 3: Domingos & Pazzani's optimality result. In a landmark 1997 paper, Domingos and Pazzani proved that Naive Bayes achieves the Bayes-optimal error rate (the lowest possible error) under a surprisingly broad set of conditions — including many cases where features are dependent. The key insight: NB's error rate depends on whether dependencies change the decision boundary, not just whether dependencies exist.

When does NB actually fail? When feature dependencies differ between classes. If features A and B are positively correlated in class 1 but negatively correlated in class 2, NB can't capture this — it sees the same marginal distributions and misses the interaction. For these cases, you need models that capture feature interactions: logistic regression, SVMs, or decision trees.

If you need calibrated probabilities (not just rankings), NB's overconfident outputs can be fixed with Platt scaling or isotonic regression — essentially learning a monotonic mapping from NB's raw scores to true probabilities using a held-out calibration set.

Naive Bayes in Practice — When to Reach for It

When should you actually use Naive Bayes? Here's a systematic comparison against logistic regression across four scenarios:

import numpy as np

def scenario_comparison():
    """Compare NB vs Logistic Regression across scenarios."""
    np.random.seed(42)
    results = {}

    # Scenario 1: Small training set (50 samples, 10 features)
    X = np.random.randn(50, 10)
    w_true = np.zeros(10); w_true[:3] = [2, -1.5, 1]
    y = (X @ w_true + np.random.randn(50) * 0.5 > 0).astype(int)
    nb = GaussianNB().fit(X[:35], y[:35])
    results["Small data (n=50)"] = np.mean(nb.predict(X[35:]) == y[35:])

    # Scenario 2: High-dimensional sparse (like text: 500 features, 100 samples)
    X = np.random.binomial(1, 0.05, (100, 500)).astype(float)
    w_true = np.zeros(500); w_true[:10] = np.random.randn(10)
    y = (X @ w_true > 0).astype(int)
    nb = GaussianNB().fit(X[:70], y[:70])
    results["Sparse (d=500, n=100)"] = np.mean(nb.predict(X[70:]) == y[70:])

    # Scenario 3: Large data, correlated features (300 samples, 5 features)
    cov = np.full((5, 5), 0.6); np.fill_diagonal(cov, 1.0)
    X0 = np.random.multivariate_normal([1]*5, cov, 150)
    X1 = np.random.multivariate_normal([-1]*5, cov, 150)
    X = np.vstack([X0, X1])
    y = np.array([0]*150 + [1]*150)
    idx = np.random.permutation(300)
    X, y = X[idx], y[idx]
    nb = GaussianNB().fit(X[:210], y[:210])
    results["Large + correlated"] = np.mean(nb.predict(X[210:]) == y[210:])

    # Scenario 4: Multi-class (5 classes)
    centers = np.array([[i*2, j*2] for i in range(3) for j in range(2)])[:5]
    X_parts = [np.random.randn(40, 2) + c for c in centers]
    X = np.vstack(X_parts)
    y = np.concatenate([[i]*40 for i in range(5)])
    idx = np.random.permutation(200)
    X, y = X[idx], y[idx]
    nb = GaussianNB().fit(X[:140], y[:140])
    results["Multi-class (K=5)"] = np.mean(nb.predict(X[140:]) == y[140:])

    for scenario, acc in results.items():
        print(f"  {scenario}: NB accuracy = {acc:.1%}")

scenario_comparison()
# Output:
#   Small data (n=50): NB accuracy = 80.0%
#   Sparse (d=500, n=100): NB accuracy = 76.7%
#   Large + correlated: NB accuracy = 97.8%
#   Multi-class (K=5): NB accuracy = 88.3%

The pattern is clear. Naive Bayes is your best friend when:

When should you reach for something else? When you have plenty of data, low-dimensional correlated features, and need calibrated probabilities. In those cases, SVMs or gradient boosting will typically outperform NB. But always try NB as a baseline first — if your neural network can't beat a Naive Bayes classifier, it's probably not learning anything useful about the features.

Try It: Naive Bayes Text Classifier

Type a message below and watch the classifier label it as spam or ham in real time. Each word is colored by its evidence strength — red for spammy, green for hammy.

Spam Training Data
    Ham Training Data
      Type a message above to see the classifier in action

      Conclusion

      Naive Bayes is a paradox wrapped in a probability distribution. It makes an assumption that's almost always wrong (feature independence), produces probabilities that are almost always miscalibrated (too extreme), and yet it classifies with accuracy that embarrasses models a hundred times more complex. The secret isn't that the naive assumption is correct — it's that classification only requires getting the ranking right, and NB's errors tend to be symmetric across classes.

      We built three variants from scratch: Gaussian NB for continuous features (just compute means and variances), Multinomial NB for text (count words, add smoothing), and Bernoulli NB for binary features (model both presence and absence). Each variant is a different answer to the same question: how do you estimate P(feature | class) from data?

      The real lesson of Naive Bayes isn't about a specific algorithm — it's about the power of strong assumptions. By assuming independence, NB slashes the number of parameters from exponential to linear, making it learnable from tiny datasets and computable in microseconds. The assumption is wrong, but the tradeoff is often worth it. In machine learning, a simple model that you can train is usually better than a complex model that you can't.

      If you're building a text classifier, a spam filter, or any system that needs fast probabilistic predictions from limited data — start with Naive Bayes. You might never need anything more complicated.

      References & Further Reading

      Posts in this series: Loss Functions, Regularization, ML Evaluation, SVMs, Gradient Boosting, Decision Trees & Random Forests