← Back to Blog

Quantization from Scratch: How LLMs Shrink to Fit Your GPU

The Memory Wall

You've built a 7-billion-parameter transformer. You can tokenize input, embed it, add positional encoding, run attention with a KV cache, apply softmax, compute loss, optimize, decode, and even fine-tune with LoRA. One problem: the model needs 14 GB of memory just to hold the weights in float16. Your GPU has 8 GB.

The arithmetic is brutal. Each parameter stored in 16-bit floating point takes 2 bytes. Multiply by 7 billion and you get 14 GB — that's the minimum just to load the model, before any activations, KV cache, or optimizer states. Scale up to a 70B model and you need 140 GB — that's two A100 GPUs worth of memory.

But what if you could store each weight in just 4 bits instead of 16? That same 7B model drops to 3.5 GB. The 70B model fits in 35 GB — a single GPU. And the kicker? With the right technique, you can barely measure the quality loss.

That's quantization: the art of representing neural network weights with fewer bits. In the LoRA post, we teased that "quantization deserves its own post." Here it is. We'll build every major quantization scheme from pure math and Python, from the simplest rounding to the information-theoretically optimal NF4 encoding that powers QLoRA.

Let's start at the very bottom — with the bits themselves.

How Floating-Point Numbers Actually Work

Before we can shrink numbers, we need to understand how they're stored. Every floating-point number in a computer is encoded as three fields packed into a fixed-width bit string:

The value is: (-1)sign × 2(exponent - bias) × (1 + mantissa). That implicit leading 1 is free precision — a clever trick that squeezes one extra bit of accuracy from the encoding.

Let's write a function that peels a float apart into its constituent bits:

import struct

def float_to_bits(value, fmt="float32"):
    """Decompose a float into sign, exponent, and mantissa bits."""
    if fmt == "float32":
        packed = struct.pack('>f', value)
        bits = format(struct.unpack('>I', packed)[0], '032b')
        sign = bits[0]
        exponent = bits[1:9]
        mantissa = bits[9:]
        bias = 127
    elif fmt == "float16":
        # Use numpy for float16 bit manipulation
        import numpy as np
        f16 = np.float16(value)
        raw = f16.view(np.uint16)
        bits = format(raw, '016b')
        sign = bits[0]
        exponent = bits[1:6]
        mantissa = bits[6:]
        bias = 15

    exp_val = int(exponent, 2) - bias
    mant_val = 1.0 + sum(int(b) * 2**(-(i+1))
                         for i, b in enumerate(mantissa))

    print(f"Value:    {value}")
    print(f"Format:   {fmt}")
    print(f"Sign:     {sign} ({'−' if sign == '1' else '+'})")
    print(f"Exponent: {exponent} (2^{exp_val})")
    print(f"Mantissa: {mantissa} ({mant_val:.6f})")
    print(f"Decoded:  {'-' if sign == '1' else ''}2^{exp_val} × {mant_val:.6f}"
          f" = {(-1)**int(sign) * 2**exp_val * mant_val:.6f}")

float_to_bits(13.5)
# Value:    13.5
# Format:   float32
# Sign:     0 (+)
# Exponent: 10000010 (2^3)
# Mantissa: 10110000000000000000000 (1.687500)
# Decoded:  2^3 × 1.687500 = 13.500000

The key insight for quantization: precision isn't uniform. The spacing between representable floats grows with magnitude. Near zero, float32 can distinguish numbers 10-45 apart. Near 1000, the gap is about 10-4. This uneven spacing matters because neural network weights are typically concentrated near zero — a fact we'll exploit later.

Here's how the common formats compare:

Format Bits Exponent Mantissa Range Use Case
FP32 32 8 23 ±3.4 × 1038 Training (gold standard)
FP16 16 5 10 ±65,504 Mixed-precision training
BF16 16 8 7 ±3.4 × 1038 Training (same range as FP32)
INT8 8 -128 to 127 Inference quantization
INT4 4 -8 to 7 Aggressive quantization

BF16 won the deep learning format war because it has the same exponent width as FP32 — meaning the same numerical range and no overflow issues — at the cost of reduced precision. For noisy gradient updates, that precision barely matters.

But integers? INT8 gives you just 256 possible values. INT4 gives you 16. To make that work, we need a way to map a continuous range of float weights onto these coarse grids. That's quantization.

The Quantization Idea: Map Floats to Integers

The core idea is simple: take a tensor of floating-point weights, find a scale factor that maps them onto integer values, and store only the integers plus the scale. At inference time, multiply back by the scale to approximate the original float.

Symmetric Quantization

The simplest approach centers the quantization grid at zero. If the weights range from -0.5 to +0.5 and we're quantizing to INT8 (-128 to 127), we compute a single scale factor that maps the largest absolute weight to the largest integer:

α = max(|W|)
scale = α / (2b-1 − 1)
q = clamp(round(w / scale), −2b-1, 2b-1 − 1)
ŵ = scale × q
import numpy as np

def symmetric_quantize(weights, bits=8):
    """Quantize weights symmetrically around zero."""
    q_max = 2**(bits - 1) - 1  # e.g., 127 for INT8, 7 for INT4

    # Scale maps the largest absolute weight to q_max
    alpha = np.max(np.abs(weights))
    scale = alpha / q_max

    # Quantize: divide by scale, round, clamp to integer range
    quantized = np.clip(np.round(weights / scale), -q_max, q_max).astype(int)

    # Dequantize: multiply back by scale
    dequantized = scale * quantized.astype(float)

    return quantized, scale, dequantized

# Try it on a sample weight tensor
np.random.seed(42)
weights = np.random.randn(1000).astype(np.float32) * 0.5

for bits in [8, 4, 2]:
    q, scale, deq = symmetric_quantize(weights, bits)
    mse = np.mean((weights - deq) ** 2)
    max_err = np.max(np.abs(weights - deq))
    unique_levels = 2**bits
    print(f"INT{bits}: MSE={mse:.6f}, Max Error={max_err:.4f}, "
          f"Levels={unique_levels}, Compression={32//bits}x")

# INT8: MSE=0.000019, Max Error=0.0076, Levels=256, Compression=4x
# INT4: MSE=0.006299, Max Error=0.1376, Levels=16, Compression=8x
# INT2: MSE=0.204871, Max Error=0.9594, Levels=4, Compression=16x

INT8 is practically lossless — the MSE is tiny compared to the weight variance. INT4 introduces noticeable error but the compression is compelling: 8x smaller. INT2 is too coarse for most uses — only 4 possible values to represent the entire weight distribution, and the max error nearly reaches the full weight range.

Asymmetric Quantization

Symmetric quantization has a blind spot: it wastes half its range on values that might not exist. Consider the output of a ReLU activation — all values are non-negative. A symmetric grid from -127 to +127 maps only the positive half, effectively using 7-bit precision when you paid for 8.

Asymmetric quantization fixes this with a zero point — an offset that shifts the integer grid to cover the actual data range:

scale = (max(W) − min(W)) / (2b − 1)
zero_point = round(−min(W) / scale)
q = clamp(round(w / scale) + zero_point, 0, 2b − 1)
ŵ = scale × (q − zero_point)
def asymmetric_quantize(weights, bits=8):
    """Quantize weights with a zero-point offset."""
    q_max = 2**bits - 1  # e.g., 255 for INT8, 15 for INT4

    w_min, w_max = np.min(weights), np.max(weights)
    scale = (w_max - w_min) / q_max
    zero_point = int(np.round(-w_min / scale))

    quantized = np.clip(np.round(weights / scale) + zero_point,
                        0, q_max).astype(int)
    dequantized = scale * (quantized.astype(float) - zero_point)

    return quantized, scale, zero_point, dequantized

# Compare symmetric vs asymmetric on a skewed distribution
# (post-ReLU activations are always non-negative)
np.random.seed(42)
relu_activations = np.abs(np.random.randn(1000).astype(np.float32) * 0.5)

# Symmetric wastes the negative range
_, _, deq_sym = symmetric_quantize(relu_activations, bits=4)
mse_sym = np.mean((relu_activations - deq_sym) ** 2)

# Asymmetric uses the full range for positive values
_, _, _, deq_asym = asymmetric_quantize(relu_activations, bits=4)
mse_asym = np.mean((relu_activations - deq_asym) ** 2)

print(f"INT4 Symmetric  MSE: {mse_sym:.6f}")
print(f"INT4 Asymmetric MSE: {mse_asym:.6f}")
print(f"Improvement: {(1 - mse_asym/mse_sym)*100:.1f}%")

# INT4 Symmetric  MSE: 0.006299
# INT4 Asymmetric MSE: 0.001297
# Improvement: 79.4%

For skewed distributions, asymmetric quantization can cut error by ~80% at the same bit width. The cost is one extra integer (the zero point) per scale factor — negligible overhead.

For neural network weights, which are typically centered near zero, symmetric quantization works well. For activations, which are often non-negative, asymmetric is the better choice. Modern quantization frameworks use symmetric for weights and asymmetric for activations.

Granularity: Per-Tensor, Per-Channel, Per-Group

We've been computing a single scale factor for the entire weight tensor. That's fine when weights are well-behaved, but in practice, one rogue outlier can ruin everything. If 99.9% of weights live in [-0.5, 0.5] but one weight hits 5.0, the scale factor stretches to cover that outlier — and all the small weights near zero get crushed into too few integer levels.

The fix: compute separate scale factors for smaller chunks of the tensor.

def quantize_per_tensor(W, bits=4):
    """One scale for the entire matrix."""
    _, scale, deq = symmetric_quantize(W.flatten(), bits)
    return deq.reshape(W.shape), 1  # 1 scale factor total

def quantize_per_channel(W, bits=4):
    """One scale per row (output channel)."""
    deq = np.zeros_like(W)
    for i in range(W.shape[0]):
        _, scale, deq[i] = symmetric_quantize(W[i], bits)
    return deq, W.shape[0]  # one scale per row

def quantize_per_group(W, bits=4, group_size=128):
    """One scale per group of `group_size` elements in each row."""
    deq = np.zeros_like(W)
    num_scales = 0
    for i in range(W.shape[0]):
        for j in range(0, W.shape[1], group_size):
            chunk = W[i, j:j+group_size]
            _, scale, deq[i, j:j+group_size] = symmetric_quantize(chunk, bits)
            num_scales += 1
    return deq, num_scales

# Create a weight matrix with one outlier channel
np.random.seed(42)
W = np.random.randn(64, 512).astype(np.float32) * 0.02
W[13, :] *= 50  # Channel 13 has abnormally large weights

print("Quantization granularity comparison (INT4):")
print(f"{'Method':<16} {'MSE':>12} {'Scale factors':>14}")
print("-" * 44)

for name, fn in [("Per-tensor", quantize_per_tensor),
                 ("Per-channel", quantize_per_channel),
                 ("Per-group", lambda W, b: quantize_per_group(W, b, 128))]:
    deq, n_scales = fn(W, 4)
    mse = np.mean((W - deq) ** 2)
    print(f"{name:<16} {mse:>12.8f} {n_scales:>14}")

# Quantization granularity comparison (INT4):
# Method                  MSE  Scale factors
# --------------------------------------------
# Per-tensor       0.00071684              1
# Per-channel      0.00033108             64
# Per-group        0.00024366            256

Per-group quantization with groups of 128 elements produces ~3x lower error than per-tensor — and the improvement is even more dramatic for the non-outlier channels (over 50x), because the outlier in channel 13 only affects its own group's scale, not the entire matrix. The overhead is modest: each scale factor is a float16 (2 bytes) shared across 128 four-bit weights (64 bytes), adding just 0.125 bits per weight. That's why GPTQ and QLoRA both use per-group quantization.

Post-Training Quantization: The Quick and Dirty Way

The simplest approach to quantizing a model is Round-to-Nearest (RTN): take a trained model, apply our quantization formula to every weight tensor, and hope for the best. No retraining, no calibration data, no fuss.

def simple_linear(x, W, b):
    """A plain linear layer: y = xW^T + b."""
    return x @ W.T + b

def quantized_linear(x, W, b, bits=8):
    """Linear layer with weight quantization."""
    q, scale, W_deq = symmetric_quantize(W.flatten(), bits)
    W_deq = W_deq.reshape(W.shape)
    return x @ W_deq.T + b

# Build a tiny trained network (2 layers, classification)
np.random.seed(0)
W1 = np.random.randn(32, 16).astype(np.float32) * 0.3
b1 = np.zeros(32)
W2 = np.random.randn(4, 32).astype(np.float32) * 0.3
b2 = np.zeros(4)

# Generate some test data
X_test = np.random.randn(200, 16).astype(np.float32)

def forward(x, bits=32):
    """Forward pass — quantize weights if bits < 32."""
    if bits < 32:
        _, _, W1_q = symmetric_quantize(W1.flatten(), bits)
        _, _, W2_q = symmetric_quantize(W2.flatten(), bits)
        h = np.maximum(0, x @ W1_q.reshape(W1.shape).T + b1)
        return h @ W2_q.reshape(W2.shape).T + b2
    else:
        h = np.maximum(0, x @ W1.T + b1)
        return h @ W2.T + b2

# Compare outputs at different bit widths
ref_output = forward(X_test, bits=32)
for bits in [8, 4, 3, 2]:
    q_output = forward(X_test, bits=bits)
    mse = np.mean((ref_output - q_output) ** 2)
    max_err = np.max(np.abs(ref_output - q_output))
    print(f"INT{bits}: Output MSE={mse:.6f}, Max deviation={max_err:.4f}")

# INT8: Output MSE=0.000178, Max deviation=0.0543
# INT4: Output MSE=0.054342, Max deviation=0.9069
# INT3: Output MSE=0.273438, Max deviation=2.0356
# INT2: Output MSE=1.565282, Max deviation=4.4790

RTN works beautifully at 8-bit — the output barely changes. At 4-bit, errors start accumulating across layers. At 2-bit, the output is garbage. This is the fundamental challenge: rounding errors compound through the network. Each layer's output error becomes the next layer's input error, and with dozens of layers in a real model, those small per-weight errors snowball.

The crucial observation: minimizing weight error isn't the same as minimizing output error. A weight that's large in magnitude but rarely activated by the data matters less than a small weight that sits on a high-activation path. The next technique exploits this insight.

GPTQ: Clever Error Compensation

GPTQ, introduced by Frantar and colleagues in 2022, changed the game for post-training quantization. Instead of rounding each weight independently and hoping for the best, GPTQ asks: when I round one weight (introducing error), can I adjust the remaining unquantized weights to compensate?

The objective is not to minimize weight error, but output error:

minimize ||WX − ŴX||2

where X is a small calibration dataset (typically 128 samples). The Hessian H = XXT captures how weights interact through the data — if two weights always activate together, adjusting one can compensate for rounding the other.

Think of it like a Rubik's cube: each twist (rounding a weight) disrupts some faces, but a skilled solver compensates with every subsequent move. Here's a simplified implementation on a small matrix:

def gptq_quantize(W, X_cal, bits=4):
    """Simplified GPTQ: column-by-column quantization with
    Hessian-based error compensation."""
    W = W.copy().astype(np.float64)
    n_rows, n_cols = W.shape
    q_max = 2**(bits - 1) - 1

    # Compute Hessian: H = X @ X^T (how columns interact through data)
    H = X_cal @ X_cal.T
    # Regularize for numerical stability
    H += 1e-4 * np.eye(n_cols) * np.mean(np.diag(H))
    # Cholesky-based inverse for speed and stability
    H_inv = np.linalg.inv(H)

    quantized = np.zeros_like(W, dtype=int)
    scales = np.zeros(n_cols)
    errors = []

    # Process column by column
    for col in range(n_cols):
        w_col = W[:, col]

        # Quantize this column (symmetric)
        alpha = np.max(np.abs(w_col)) + 1e-10
        scale = alpha / q_max
        scales[col] = scale
        q_col = np.clip(np.round(w_col / scale), -q_max, q_max)
        quantized[:, col] = q_col.astype(int)

        # Compute the quantization error for this column
        w_hat = scale * q_col
        delta = w_col - w_hat
        errors.append(np.mean(delta ** 2))

        # Compensate: adjust remaining columns using the Hessian
        # Key formula: W[:, remaining] -= delta * H_inv[col, remaining] / H_inv[col, col]
        if col < n_cols - 1:
            h_diag = H_inv[col, col] + 1e-10
            compensation = np.outer(delta, H_inv[col, col+1:] / h_diag)
            W[:, col+1:] += compensation

    return quantized, errors, scales

# Demo: compare RTN vs GPTQ on a small weight matrix
np.random.seed(42)
W_demo = np.random.randn(8, 32).astype(np.float64) * 0.5
X_cal = np.random.randn(32, 64).astype(np.float64) * 0.3

# RTN: just round each weight independently
_, _, W_rtn = symmetric_quantize(W_demo.flatten(), bits=4)
W_rtn = W_rtn.reshape(W_demo.shape)

# GPTQ: round with error compensation
q_gptq, _, gptq_scales = gptq_quantize(W_demo, X_cal, bits=4)
# Dequantize using the scales from GPTQ (computed on compensated weights)
W_gptq = np.zeros_like(W_demo)
for col in range(W_demo.shape[1]):
    W_gptq[:, col] = gptq_scales[col] * q_gptq[:, col]

# Compare OUTPUT error (what actually matters)
Y_ref = W_demo @ X_cal
Y_rtn = W_rtn @ X_cal
Y_gptq = W_gptq @ X_cal

mse_rtn = np.mean((Y_ref - Y_rtn) ** 2)
mse_gptq = np.mean((Y_ref - Y_gptq) ** 2)

print(f"RTN  output MSE: {mse_rtn:.6f}")
print(f"GPTQ output MSE: {mse_gptq:.6f}")
print(f"GPTQ reduction:  {(1 - mse_gptq/mse_rtn)*100:.1f}%")

# RTN  output MSE: 0.018856
# GPTQ output MSE: 0.006265
# GPTQ reduction:  66.8%

GPTQ reduces output error by over 65% compared to naive rounding. On real LLMs with billions of parameters and deeper correlations between weights, the improvement is even more dramatic — it's what makes 4-bit quantization usable in practice.

The real GPTQ implementation has three additional optimizations: processing columns in batches of 128 (lazy batch updates for GPU efficiency), using Cholesky decomposition for the Hessian inverse, and a fixed column ordering. But the core idea is exactly what we've built: quantize one column, compensate the rest.

NormalFloat: Information-Theoretically Optimal Quantization

Here's a question that most quantization approaches don't ask: where should we place our quantization levels?

Standard INT4 spaces its 16 levels uniformly: -7, -6, -5, ..., 0, ..., 5, 6, 7. But neural network weights aren't uniform — they're approximately normally distributed, clustered densely near zero with thin tails. A uniform grid wastes levels in the sparse tails where almost no weights live, and doesn't have enough resolution near zero where most weights are.

Tim Dettmers' NormalFloat (NF4), introduced in the QLoRA paper, flips this on its head. Instead of uniformly-spaced levels, place them at the quantiles of the normal distribution, so that each quantization bin captures an equal fraction of the probability mass:

from scipy.stats import norm

def compute_nf4_levels():
    """Compute NormalFloat4 quantization levels.

    Place 16 levels at quantiles of N(0,1) so each bin has
    equal probability mass. Then normalize to [-1, 1] and
    ensure there's an exact zero."""
    n_levels = 16
    # 8 negative levels, 1 zero, 7 positive = 16 total
    # Negative side: 8 quantiles of the negative half
    neg_levels = [norm.ppf((i + 0.5) / (2 * 8)) for i in range(8)]
    # Positive side: 7 quantiles of the positive half (zero handled separately)
    pos_levels = [norm.ppf(0.5 + (i + 0.5) / (2 * 8)) for i in range(1, 8)]

    levels = neg_levels + [0.0] + pos_levels
    levels = sorted(levels)

    # Normalize to [-1, 1]
    max_abs = max(abs(l) for l in levels)
    levels = [l / max_abs for l in levels]

    return np.array(levels)

def compute_int4_levels():
    """Standard INT4 levels, normalized to [-1, 1]."""
    levels = np.linspace(-1, 1, 16)
    return levels

nf4 = compute_nf4_levels()
int4 = compute_int4_levels()

print("NF4 levels (normalized):")
print("  ", [f"{l:+.4f}" for l in nf4])
print("\nINT4 levels (uniform):")
print("  ", [f"{l:+.4f}" for l in int4])

# Compare quantization error on normally-distributed weights
np.random.seed(42)
weights = np.random.randn(10000).astype(np.float32)

def quantize_with_levels(w, levels):
    """Map each weight to the nearest level."""
    scale = np.max(np.abs(w))
    normalized = w / scale
    # Find nearest level for each weight
    indices = np.argmin(np.abs(normalized[:, None] - levels[None, :]), axis=1)
    dequantized = levels[indices] * scale
    return dequantized

deq_nf4 = quantize_with_levels(weights, nf4)
deq_int4 = quantize_with_levels(weights, int4)

mse_nf4 = np.mean((weights - deq_nf4) ** 2)
mse_int4 = np.mean((weights - deq_int4) ** 2)

print(f"\nMSE on N(0,1) weights:")
print(f"  INT4 (uniform):      {mse_int4:.6f}")
print(f"  NF4  (normal-aware): {mse_nf4:.6f}")
print(f"  NF4 improvement:     {(1 - mse_nf4/mse_int4)*100:.1f}%")

# MSE on N(0,1) weights:
#   INT4 (uniform):      0.023122
#   NF4  (normal-aware): 0.013969
#   NF4 improvement:     39.6%

NF4 cuts quantization error by ~40% compared to uniform INT4 — and that's a free improvement. Same 4 bits, same memory, same speed. The only difference is where you place the grid lines.

The information-theoretic argument is clean: if your data follows distribution P, the optimal quantizer places levels so that each bin has equal probability mass P(bin). This minimizes the expected quantization error because you're spending your precious few bits where the data actually is, not where it isn't.

Double Quantization and QLoRA

With per-group quantization (group_size=64), each group needs a float16 scale factor: 2 bytes per 64 four-bit weights = 0.25 extra bits per weight. Dettmers' second trick: quantize the scale factors themselves to 8-bit, reducing the overhead to 0.127 bits per weight. This "double quantization" is minor in isolation but saves gigabytes at 70B scale.

QLoRA combines everything: 4-bit NF4 quantization for the base model, double-quantized scale factors, and float16 LoRA adapters for fine-tuning. The base model is frozen at 4 bits (saving memory), while the small LoRA matrices train in full precision (preserving gradient quality). This is how you fine-tune a 65B model on a single 48GB GPU — the payoff we promised in the LoRA post.

Quantization-Aware Training: Teaching the Model to Cope

Everything so far has been post-training quantization (PTQ): take a trained model and quantize it after the fact, treating quantization as damage to be minimized. But what if the model could learn to be quantized?

Quantization-Aware Training (QAT) inserts fake quantization into the training forward pass: quantize the weights, then immediately dequantize them back to float. The network sees the "damaged" values during training and learns to work around the rounding errors.

There's one catch: round() has zero gradient almost everywhere (it's a step function). If we can't backpropagate through rounding, we can't train. The fix is the Straight-Through Estimator (STE): during the backward pass, pretend round() is the identity function. It's a brazen lie, but it works remarkably well.

class FakeQuantize:
    """Fake quantization for QAT: quantize in forward, STE in backward."""

    def __init__(self, bits=4):
        self.bits = bits
        self.q_max = 2**(bits - 1) - 1

    def forward(self, w):
        """Quantize → dequantize (simulates quantization error)."""
        alpha = np.max(np.abs(w)) + 1e-10
        scale = alpha / self.q_max
        q = np.clip(np.round(w / scale), -self.q_max, self.q_max)
        return scale * q  # float output with quantization noise baked in

    def backward(self, grad):
        """Straight-Through Estimator: pass gradient unchanged.
        Equivalent to pretending round() is the identity."""
        return grad  # That's it. Just pass it through.

# Train a network WITH and WITHOUT QAT
# Needs enough parameters that INT4 quantization causes real damage
np.random.seed(42)
n_in, n_hidden, n_out = 16, 64, 4
lr = 0.02

# Generate a classification dataset with nonlinear boundaries
X = np.random.randn(500, n_in).astype(np.float32)
targets = ((X[:, 0] * X[:, 1] > 0).astype(int) +
           (X[:, 2] + X[:, 3] > 0.5).astype(int) +
           (X[:, 4] > 0).astype(int))
targets = np.clip(targets, 0, n_out - 1)

def softmax(z):
    e = np.exp(z - z.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def train_network(use_qat=False, bits=4, epochs=500):
    """Train a 2-layer network, optionally with fake quantization."""
    np.random.seed(123)  # Same init for fair comparison
    W1 = np.random.randn(n_in, n_hidden) * 0.5
    W2 = np.random.randn(n_hidden, n_out) * 0.5
    fq = FakeQuantize(bits) if use_qat else None

    for epoch in range(epochs):
        # Forward pass (with fake quantization if QAT)
        W1_eff = fq.forward(W1) if fq else W1
        W2_eff = fq.forward(W2) if fq else W2

        h = np.maximum(0, X @ W1_eff)
        logits = h @ W2_eff
        probs = softmax(logits)

        # Cross-entropy loss
        loss = -np.mean(np.log(probs[np.arange(len(targets)), targets] + 1e-10))

        # Backward pass (simplified, STE means grads flow through normally)
        grad_logits = probs.copy()
        grad_logits[np.arange(len(targets)), targets] -= 1
        grad_logits /= len(targets)

        grad_W2 = h.T @ grad_logits
        grad_h = grad_logits @ W2_eff.T
        grad_h[X @ W1_eff <= 0] = 0
        grad_W1 = X.T @ grad_h

        # STE: gradients pass through fake quantization unchanged
        W1 -= lr * grad_W1
        W2 -= lr * grad_W2

    return W1, W2

# Train both versions
W1_std, W2_std = train_network(use_qat=False)
W1_qat, W2_qat = train_network(use_qat=True, bits=4)

# Now quantize both to 4-bit and evaluate
def evaluate(W1, W2, bits=4):
    _, _, W1_q = symmetric_quantize(W1.flatten(), bits)
    _, _, W2_q = symmetric_quantize(W2.flatten(), bits)
    h = np.maximum(0, X @ W1_q.reshape(W1.shape))
    logits = h @ W2_q.reshape(W2.shape)
    preds = np.argmax(logits, axis=-1)
    return np.mean(preds == targets)

acc_full = evaluate(W1_std, W2_std, bits=32)  # No quantization
acc_ptq  = evaluate(W1_std, W2_std, bits=4)   # PTQ: train normal, quantize after
acc_qat  = evaluate(W1_qat, W2_qat, bits=4)   # QAT: train with fake quantization

print(f"Full precision (FP32): {acc_full:.1%}")
print(f"PTQ (INT4):            {acc_ptq:.1%}")
print(f"QAT (INT4):            {acc_qat:.1%}")

# Full precision (FP32): 68.6%
# PTQ (INT4):            59.0%
# QAT (INT4):            65.2%

QAT recovers most of the accuracy lost to quantization because the network learns weights that round well. It adjusts its parameters so they land near quantization grid lines, rather than straddling the boundary between two levels.

The practical trade-off: PTQ takes minutes (just round the weights), while QAT requires retraining. For 8-bit quantization, PTQ is almost always sufficient. At 4-bit, GPTQ-style PTQ usually works. Below 4 bits, QAT becomes essential.

The Practical Impact: Memory, Speed, Quality

Let's put concrete numbers on the table. Here's what quantization means for a real 7B-parameter model:

Format Bits/Weight Model Size Perplexity Δ Fits On
FP32 32 28 GB baseline 2× RTX 4090
FP16 / BF16 16 14 GB ~0.0 1× RTX 4090
INT8 8 7 GB ~0.1 RTX 4070
INT4 / NF4 4 3.5 GB ~0.3 RTX 3060 / M1 Mac
INT3 3 2.6 GB ~1.0 Older 4GB GPUs
INT2 2 1.75 GB ~5.0+ Any GPU (but quality suffers)

The sweet spot is blindingly obvious: 4-bit quantization. You get 8x compression over FP32 (4x over FP16) with barely measurable quality loss. Below 4 bits, quality degrades sharply. Above 4 bits, you're paying for precision the model doesn't need.

There's a beautiful scaling law at work here, too: larger models are more robust to quantization. A 70B model at 4-bit often outperforms a 13B model at 16-bit, even though the 70B is using fewer bits per weight. The extra parameters provide redundancy that absorbs quantization noise. This is why the 4-bit revolution mattered — it didn't just make existing models smaller, it made much bigger models accessible.

Speed also improves, because LLM inference is memory-bandwidth bound. A 4-bit model reads 4x less data from memory per token, which means 3-4x faster generation on typical hardware. You're not just saving space; you're saving time.

Decision tree: For 8-bit, use simple RTN — it just works. For 4-bit, use GPTQ or AWQ with calibration data — the error compensation is essential at this bit width. For 3-bit or below, you need QAT. For fine-tuning at 4-bit, use QLoRA (NF4 base + LoRA adapters).

Try It: Quantization Playground

Quantization Playground

Watch weights snap to quantization levels. Toggle between uniform (INT) and NF4 to see why smarter level placement reduces error.

Loading...

Conclusion

We've now completed the full pipeline that this series has been building:

tokenize → embed → position → attend → softmax → loss → optimize → decode → fine-tune → quantize

From raw text to compressed model, every piece built from scratch. And quantization is arguably the most consequential piece in terms of real-world impact. Without it, running LLMs requires datacenter GPUs. With 4-bit NF4, a 7B model fits in your laptop's GPU memory. A 70B model fits on a single A100 instead of two.

The key ideas we covered:

The deeper lesson is that quantization isn't just engineering — it's applied information theory. The best quantization schemes don't just compress; they understand the data. NF4 works because it asks "where are the weights?" before deciding where to put the grid lines. That principle — match your representation to your distribution — echoes throughout machine learning.

References & Further Reading