Quantization from Scratch: How LLMs Shrink to Fit Your GPU
The Memory Wall
You've built a 7-billion-parameter transformer. You can tokenize input, embed it, add positional encoding, run attention with a KV cache, apply softmax, compute loss, optimize, decode, and even fine-tune with LoRA. One problem: the model needs 14 GB of memory just to hold the weights in float16. Your GPU has 8 GB.
The arithmetic is brutal. Each parameter stored in 16-bit floating point takes 2 bytes. Multiply by 7 billion and you get 14 GB — that's the minimum just to load the model, before any activations, KV cache, or optimizer states. Scale up to a 70B model and you need 140 GB — that's two A100 GPUs worth of memory.
But what if you could store each weight in just 4 bits instead of 16? That same 7B model drops to 3.5 GB. The 70B model fits in 35 GB — a single GPU. And the kicker? With the right technique, you can barely measure the quality loss.
That's quantization: the art of representing neural network weights with fewer bits. In the LoRA post, we teased that "quantization deserves its own post." Here it is. We'll build every major quantization scheme from pure math and Python, from the simplest rounding to the information-theoretically optimal NF4 encoding that powers QLoRA.
Let's start at the very bottom — with the bits themselves.
How Floating-Point Numbers Actually Work
Before we can shrink numbers, we need to understand how they're stored. Every floating-point number in a computer is encoded as three fields packed into a fixed-width bit string:
- Sign (1 bit): 0 for positive, 1 for negative
- Exponent (several bits): controls the magnitude — how big or small the number is
- Mantissa / significand (remaining bits): controls the precision — the exact value within that magnitude range
The value is: (-1)sign × 2(exponent - bias) × (1 + mantissa). That implicit leading 1 is free precision — a clever trick that squeezes one extra bit of accuracy from the encoding.
Let's write a function that peels a float apart into its constituent bits:
import struct
def float_to_bits(value, fmt="float32"):
"""Decompose a float into sign, exponent, and mantissa bits."""
if fmt == "float32":
packed = struct.pack('>f', value)
bits = format(struct.unpack('>I', packed)[0], '032b')
sign = bits[0]
exponent = bits[1:9]
mantissa = bits[9:]
bias = 127
elif fmt == "float16":
# Use numpy for float16 bit manipulation
import numpy as np
f16 = np.float16(value)
raw = f16.view(np.uint16)
bits = format(raw, '016b')
sign = bits[0]
exponent = bits[1:6]
mantissa = bits[6:]
bias = 15
exp_val = int(exponent, 2) - bias
mant_val = 1.0 + sum(int(b) * 2**(-(i+1))
for i, b in enumerate(mantissa))
print(f"Value: {value}")
print(f"Format: {fmt}")
print(f"Sign: {sign} ({'−' if sign == '1' else '+'})")
print(f"Exponent: {exponent} (2^{exp_val})")
print(f"Mantissa: {mantissa} ({mant_val:.6f})")
print(f"Decoded: {'-' if sign == '1' else ''}2^{exp_val} × {mant_val:.6f}"
f" = {(-1)**int(sign) * 2**exp_val * mant_val:.6f}")
float_to_bits(13.5)
# Value: 13.5
# Format: float32
# Sign: 0 (+)
# Exponent: 10000010 (2^3)
# Mantissa: 10110000000000000000000 (1.687500)
# Decoded: 2^3 × 1.687500 = 13.500000
The key insight for quantization: precision isn't uniform. The spacing between representable floats grows with magnitude. Near zero, float32 can distinguish numbers 10-45 apart. Near 1000, the gap is about 10-4. This uneven spacing matters because neural network weights are typically concentrated near zero — a fact we'll exploit later.
Here's how the common formats compare:
| Format | Bits | Exponent | Mantissa | Range | Use Case |
|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | ±3.4 × 1038 | Training (gold standard) |
| FP16 | 16 | 5 | 10 | ±65,504 | Mixed-precision training |
| BF16 | 16 | 8 | 7 | ±3.4 × 1038 | Training (same range as FP32) |
| INT8 | 8 | — | — | -128 to 127 | Inference quantization |
| INT4 | 4 | — | — | -8 to 7 | Aggressive quantization |
BF16 won the deep learning format war because it has the same exponent width as FP32 — meaning the same numerical range and no overflow issues — at the cost of reduced precision. For noisy gradient updates, that precision barely matters.
But integers? INT8 gives you just 256 possible values. INT4 gives you 16. To make that work, we need a way to map a continuous range of float weights onto these coarse grids. That's quantization.
The Quantization Idea: Map Floats to Integers
The core idea is simple: take a tensor of floating-point weights, find a scale factor that maps them onto integer values, and store only the integers plus the scale. At inference time, multiply back by the scale to approximate the original float.
Symmetric Quantization
The simplest approach centers the quantization grid at zero. If the weights range from -0.5 to +0.5 and we're quantizing to INT8 (-128 to 127), we compute a single scale factor that maps the largest absolute weight to the largest integer:
scale = α / (2b-1 − 1)
q = clamp(round(w / scale), −2b-1, 2b-1 − 1)
ŵ = scale × q
import numpy as np
def symmetric_quantize(weights, bits=8):
"""Quantize weights symmetrically around zero."""
q_max = 2**(bits - 1) - 1 # e.g., 127 for INT8, 7 for INT4
# Scale maps the largest absolute weight to q_max
alpha = np.max(np.abs(weights))
scale = alpha / q_max
# Quantize: divide by scale, round, clamp to integer range
quantized = np.clip(np.round(weights / scale), -q_max, q_max).astype(int)
# Dequantize: multiply back by scale
dequantized = scale * quantized.astype(float)
return quantized, scale, dequantized
# Try it on a sample weight tensor
np.random.seed(42)
weights = np.random.randn(1000).astype(np.float32) * 0.5
for bits in [8, 4, 2]:
q, scale, deq = symmetric_quantize(weights, bits)
mse = np.mean((weights - deq) ** 2)
max_err = np.max(np.abs(weights - deq))
unique_levels = 2**bits
print(f"INT{bits}: MSE={mse:.6f}, Max Error={max_err:.4f}, "
f"Levels={unique_levels}, Compression={32//bits}x")
# INT8: MSE=0.000019, Max Error=0.0076, Levels=256, Compression=4x
# INT4: MSE=0.006299, Max Error=0.1376, Levels=16, Compression=8x
# INT2: MSE=0.204871, Max Error=0.9594, Levels=4, Compression=16x
INT8 is practically lossless — the MSE is tiny compared to the weight variance. INT4 introduces noticeable error but the compression is compelling: 8x smaller. INT2 is too coarse for most uses — only 4 possible values to represent the entire weight distribution, and the max error nearly reaches the full weight range.
Asymmetric Quantization
Symmetric quantization has a blind spot: it wastes half its range on values that might not exist. Consider the output of a ReLU activation — all values are non-negative. A symmetric grid from -127 to +127 maps only the positive half, effectively using 7-bit precision when you paid for 8.
Asymmetric quantization fixes this with a zero point — an offset that shifts the integer grid to cover the actual data range:
zero_point = round(−min(W) / scale)
q = clamp(round(w / scale) + zero_point, 0, 2b − 1)
ŵ = scale × (q − zero_point)
def asymmetric_quantize(weights, bits=8):
"""Quantize weights with a zero-point offset."""
q_max = 2**bits - 1 # e.g., 255 for INT8, 15 for INT4
w_min, w_max = np.min(weights), np.max(weights)
scale = (w_max - w_min) / q_max
zero_point = int(np.round(-w_min / scale))
quantized = np.clip(np.round(weights / scale) + zero_point,
0, q_max).astype(int)
dequantized = scale * (quantized.astype(float) - zero_point)
return quantized, scale, zero_point, dequantized
# Compare symmetric vs asymmetric on a skewed distribution
# (post-ReLU activations are always non-negative)
np.random.seed(42)
relu_activations = np.abs(np.random.randn(1000).astype(np.float32) * 0.5)
# Symmetric wastes the negative range
_, _, deq_sym = symmetric_quantize(relu_activations, bits=4)
mse_sym = np.mean((relu_activations - deq_sym) ** 2)
# Asymmetric uses the full range for positive values
_, _, _, deq_asym = asymmetric_quantize(relu_activations, bits=4)
mse_asym = np.mean((relu_activations - deq_asym) ** 2)
print(f"INT4 Symmetric MSE: {mse_sym:.6f}")
print(f"INT4 Asymmetric MSE: {mse_asym:.6f}")
print(f"Improvement: {(1 - mse_asym/mse_sym)*100:.1f}%")
# INT4 Symmetric MSE: 0.006299
# INT4 Asymmetric MSE: 0.001297
# Improvement: 79.4%
For skewed distributions, asymmetric quantization can cut error by ~80% at the same bit width. The cost is one extra integer (the zero point) per scale factor — negligible overhead.
For neural network weights, which are typically centered near zero, symmetric quantization works well. For activations, which are often non-negative, asymmetric is the better choice. Modern quantization frameworks use symmetric for weights and asymmetric for activations.
Granularity: Per-Tensor, Per-Channel, Per-Group
We've been computing a single scale factor for the entire weight tensor. That's fine when weights are well-behaved, but in practice, one rogue outlier can ruin everything. If 99.9% of weights live in [-0.5, 0.5] but one weight hits 5.0, the scale factor stretches to cover that outlier — and all the small weights near zero get crushed into too few integer levels.
The fix: compute separate scale factors for smaller chunks of the tensor.
def quantize_per_tensor(W, bits=4):
"""One scale for the entire matrix."""
_, scale, deq = symmetric_quantize(W.flatten(), bits)
return deq.reshape(W.shape), 1 # 1 scale factor total
def quantize_per_channel(W, bits=4):
"""One scale per row (output channel)."""
deq = np.zeros_like(W)
for i in range(W.shape[0]):
_, scale, deq[i] = symmetric_quantize(W[i], bits)
return deq, W.shape[0] # one scale per row
def quantize_per_group(W, bits=4, group_size=128):
"""One scale per group of `group_size` elements in each row."""
deq = np.zeros_like(W)
num_scales = 0
for i in range(W.shape[0]):
for j in range(0, W.shape[1], group_size):
chunk = W[i, j:j+group_size]
_, scale, deq[i, j:j+group_size] = symmetric_quantize(chunk, bits)
num_scales += 1
return deq, num_scales
# Create a weight matrix with one outlier channel
np.random.seed(42)
W = np.random.randn(64, 512).astype(np.float32) * 0.02
W[13, :] *= 50 # Channel 13 has abnormally large weights
print("Quantization granularity comparison (INT4):")
print(f"{'Method':<16} {'MSE':>12} {'Scale factors':>14}")
print("-" * 44)
for name, fn in [("Per-tensor", quantize_per_tensor),
("Per-channel", quantize_per_channel),
("Per-group", lambda W, b: quantize_per_group(W, b, 128))]:
deq, n_scales = fn(W, 4)
mse = np.mean((W - deq) ** 2)
print(f"{name:<16} {mse:>12.8f} {n_scales:>14}")
# Quantization granularity comparison (INT4):
# Method MSE Scale factors
# --------------------------------------------
# Per-tensor 0.00071684 1
# Per-channel 0.00033108 64
# Per-group 0.00024366 256
Per-group quantization with groups of 128 elements produces ~3x lower error than per-tensor — and the improvement is even more dramatic for the non-outlier channels (over 50x), because the outlier in channel 13 only affects its own group's scale, not the entire matrix. The overhead is modest: each scale factor is a float16 (2 bytes) shared across 128 four-bit weights (64 bytes), adding just 0.125 bits per weight. That's why GPTQ and QLoRA both use per-group quantization.
Post-Training Quantization: The Quick and Dirty Way
The simplest approach to quantizing a model is Round-to-Nearest (RTN): take a trained model, apply our quantization formula to every weight tensor, and hope for the best. No retraining, no calibration data, no fuss.
def simple_linear(x, W, b):
"""A plain linear layer: y = xW^T + b."""
return x @ W.T + b
def quantized_linear(x, W, b, bits=8):
"""Linear layer with weight quantization."""
q, scale, W_deq = symmetric_quantize(W.flatten(), bits)
W_deq = W_deq.reshape(W.shape)
return x @ W_deq.T + b
# Build a tiny trained network (2 layers, classification)
np.random.seed(0)
W1 = np.random.randn(32, 16).astype(np.float32) * 0.3
b1 = np.zeros(32)
W2 = np.random.randn(4, 32).astype(np.float32) * 0.3
b2 = np.zeros(4)
# Generate some test data
X_test = np.random.randn(200, 16).astype(np.float32)
def forward(x, bits=32):
"""Forward pass — quantize weights if bits < 32."""
if bits < 32:
_, _, W1_q = symmetric_quantize(W1.flatten(), bits)
_, _, W2_q = symmetric_quantize(W2.flatten(), bits)
h = np.maximum(0, x @ W1_q.reshape(W1.shape).T + b1)
return h @ W2_q.reshape(W2.shape).T + b2
else:
h = np.maximum(0, x @ W1.T + b1)
return h @ W2.T + b2
# Compare outputs at different bit widths
ref_output = forward(X_test, bits=32)
for bits in [8, 4, 3, 2]:
q_output = forward(X_test, bits=bits)
mse = np.mean((ref_output - q_output) ** 2)
max_err = np.max(np.abs(ref_output - q_output))
print(f"INT{bits}: Output MSE={mse:.6f}, Max deviation={max_err:.4f}")
# INT8: Output MSE=0.000178, Max deviation=0.0543
# INT4: Output MSE=0.054342, Max deviation=0.9069
# INT3: Output MSE=0.273438, Max deviation=2.0356
# INT2: Output MSE=1.565282, Max deviation=4.4790
RTN works beautifully at 8-bit — the output barely changes. At 4-bit, errors start accumulating across layers. At 2-bit, the output is garbage. This is the fundamental challenge: rounding errors compound through the network. Each layer's output error becomes the next layer's input error, and with dozens of layers in a real model, those small per-weight errors snowball.
The crucial observation: minimizing weight error isn't the same as minimizing output error. A weight that's large in magnitude but rarely activated by the data matters less than a small weight that sits on a high-activation path. The next technique exploits this insight.
GPTQ: Clever Error Compensation
GPTQ, introduced by Frantar and colleagues in 2022, changed the game for post-training quantization. Instead of rounding each weight independently and hoping for the best, GPTQ asks: when I round one weight (introducing error), can I adjust the remaining unquantized weights to compensate?
The objective is not to minimize weight error, but output error:
where X is a small calibration dataset (typically 128 samples). The Hessian H = XXT captures how weights interact through the data — if two weights always activate together, adjusting one can compensate for rounding the other.
Think of it like a Rubik's cube: each twist (rounding a weight) disrupts some faces, but a skilled solver compensates with every subsequent move. Here's a simplified implementation on a small matrix:
def gptq_quantize(W, X_cal, bits=4):
"""Simplified GPTQ: column-by-column quantization with
Hessian-based error compensation."""
W = W.copy().astype(np.float64)
n_rows, n_cols = W.shape
q_max = 2**(bits - 1) - 1
# Compute Hessian: H = X @ X^T (how columns interact through data)
H = X_cal @ X_cal.T
# Regularize for numerical stability
H += 1e-4 * np.eye(n_cols) * np.mean(np.diag(H))
# Cholesky-based inverse for speed and stability
H_inv = np.linalg.inv(H)
quantized = np.zeros_like(W, dtype=int)
scales = np.zeros(n_cols)
errors = []
# Process column by column
for col in range(n_cols):
w_col = W[:, col]
# Quantize this column (symmetric)
alpha = np.max(np.abs(w_col)) + 1e-10
scale = alpha / q_max
scales[col] = scale
q_col = np.clip(np.round(w_col / scale), -q_max, q_max)
quantized[:, col] = q_col.astype(int)
# Compute the quantization error for this column
w_hat = scale * q_col
delta = w_col - w_hat
errors.append(np.mean(delta ** 2))
# Compensate: adjust remaining columns using the Hessian
# Key formula: W[:, remaining] -= delta * H_inv[col, remaining] / H_inv[col, col]
if col < n_cols - 1:
h_diag = H_inv[col, col] + 1e-10
compensation = np.outer(delta, H_inv[col, col+1:] / h_diag)
W[:, col+1:] += compensation
return quantized, errors, scales
# Demo: compare RTN vs GPTQ on a small weight matrix
np.random.seed(42)
W_demo = np.random.randn(8, 32).astype(np.float64) * 0.5
X_cal = np.random.randn(32, 64).astype(np.float64) * 0.3
# RTN: just round each weight independently
_, _, W_rtn = symmetric_quantize(W_demo.flatten(), bits=4)
W_rtn = W_rtn.reshape(W_demo.shape)
# GPTQ: round with error compensation
q_gptq, _, gptq_scales = gptq_quantize(W_demo, X_cal, bits=4)
# Dequantize using the scales from GPTQ (computed on compensated weights)
W_gptq = np.zeros_like(W_demo)
for col in range(W_demo.shape[1]):
W_gptq[:, col] = gptq_scales[col] * q_gptq[:, col]
# Compare OUTPUT error (what actually matters)
Y_ref = W_demo @ X_cal
Y_rtn = W_rtn @ X_cal
Y_gptq = W_gptq @ X_cal
mse_rtn = np.mean((Y_ref - Y_rtn) ** 2)
mse_gptq = np.mean((Y_ref - Y_gptq) ** 2)
print(f"RTN output MSE: {mse_rtn:.6f}")
print(f"GPTQ output MSE: {mse_gptq:.6f}")
print(f"GPTQ reduction: {(1 - mse_gptq/mse_rtn)*100:.1f}%")
# RTN output MSE: 0.018856
# GPTQ output MSE: 0.006265
# GPTQ reduction: 66.8%
GPTQ reduces output error by over 65% compared to naive rounding. On real LLMs with billions of parameters and deeper correlations between weights, the improvement is even more dramatic — it's what makes 4-bit quantization usable in practice.
The real GPTQ implementation has three additional optimizations: processing columns in batches of 128 (lazy batch updates for GPU efficiency), using Cholesky decomposition for the Hessian inverse, and a fixed column ordering. But the core idea is exactly what we've built: quantize one column, compensate the rest.
NormalFloat: Information-Theoretically Optimal Quantization
Here's a question that most quantization approaches don't ask: where should we place our quantization levels?
Standard INT4 spaces its 16 levels uniformly: -7, -6, -5, ..., 0, ..., 5, 6, 7. But neural network weights aren't uniform — they're approximately normally distributed, clustered densely near zero with thin tails. A uniform grid wastes levels in the sparse tails where almost no weights live, and doesn't have enough resolution near zero where most weights are.
Tim Dettmers' NormalFloat (NF4), introduced in the QLoRA paper, flips this on its head. Instead of uniformly-spaced levels, place them at the quantiles of the normal distribution, so that each quantization bin captures an equal fraction of the probability mass:
from scipy.stats import norm
def compute_nf4_levels():
"""Compute NormalFloat4 quantization levels.
Place 16 levels at quantiles of N(0,1) so each bin has
equal probability mass. Then normalize to [-1, 1] and
ensure there's an exact zero."""
n_levels = 16
# 8 negative levels, 1 zero, 7 positive = 16 total
# Negative side: 8 quantiles of the negative half
neg_levels = [norm.ppf((i + 0.5) / (2 * 8)) for i in range(8)]
# Positive side: 7 quantiles of the positive half (zero handled separately)
pos_levels = [norm.ppf(0.5 + (i + 0.5) / (2 * 8)) for i in range(1, 8)]
levels = neg_levels + [0.0] + pos_levels
levels = sorted(levels)
# Normalize to [-1, 1]
max_abs = max(abs(l) for l in levels)
levels = [l / max_abs for l in levels]
return np.array(levels)
def compute_int4_levels():
"""Standard INT4 levels, normalized to [-1, 1]."""
levels = np.linspace(-1, 1, 16)
return levels
nf4 = compute_nf4_levels()
int4 = compute_int4_levels()
print("NF4 levels (normalized):")
print(" ", [f"{l:+.4f}" for l in nf4])
print("\nINT4 levels (uniform):")
print(" ", [f"{l:+.4f}" for l in int4])
# Compare quantization error on normally-distributed weights
np.random.seed(42)
weights = np.random.randn(10000).astype(np.float32)
def quantize_with_levels(w, levels):
"""Map each weight to the nearest level."""
scale = np.max(np.abs(w))
normalized = w / scale
# Find nearest level for each weight
indices = np.argmin(np.abs(normalized[:, None] - levels[None, :]), axis=1)
dequantized = levels[indices] * scale
return dequantized
deq_nf4 = quantize_with_levels(weights, nf4)
deq_int4 = quantize_with_levels(weights, int4)
mse_nf4 = np.mean((weights - deq_nf4) ** 2)
mse_int4 = np.mean((weights - deq_int4) ** 2)
print(f"\nMSE on N(0,1) weights:")
print(f" INT4 (uniform): {mse_int4:.6f}")
print(f" NF4 (normal-aware): {mse_nf4:.6f}")
print(f" NF4 improvement: {(1 - mse_nf4/mse_int4)*100:.1f}%")
# MSE on N(0,1) weights:
# INT4 (uniform): 0.023122
# NF4 (normal-aware): 0.013969
# NF4 improvement: 39.6%
NF4 cuts quantization error by ~40% compared to uniform INT4 — and that's a free improvement. Same 4 bits, same memory, same speed. The only difference is where you place the grid lines.
The information-theoretic argument is clean: if your data follows distribution P, the optimal quantizer places levels so that each bin has equal probability mass P(bin). This minimizes the expected quantization error because you're spending your precious few bits where the data actually is, not where it isn't.
Double Quantization and QLoRA
With per-group quantization (group_size=64), each group needs a float16 scale factor: 2 bytes per 64 four-bit weights = 0.25 extra bits per weight. Dettmers' second trick: quantize the scale factors themselves to 8-bit, reducing the overhead to 0.127 bits per weight. This "double quantization" is minor in isolation but saves gigabytes at 70B scale.
QLoRA combines everything: 4-bit NF4 quantization for the base model, double-quantized scale factors, and float16 LoRA adapters for fine-tuning. The base model is frozen at 4 bits (saving memory), while the small LoRA matrices train in full precision (preserving gradient quality). This is how you fine-tune a 65B model on a single 48GB GPU — the payoff we promised in the LoRA post.
Quantization-Aware Training: Teaching the Model to Cope
Everything so far has been post-training quantization (PTQ): take a trained model and quantize it after the fact, treating quantization as damage to be minimized. But what if the model could learn to be quantized?
Quantization-Aware Training (QAT) inserts fake quantization into the training forward pass: quantize the weights, then immediately dequantize them back to float. The network sees the "damaged" values during training and learns to work around the rounding errors.
There's one catch: round() has zero gradient almost everywhere (it's a step function). If we can't backpropagate through rounding, we can't train. The fix is the Straight-Through Estimator (STE): during the backward pass, pretend round() is the identity function. It's a brazen lie, but it works remarkably well.
class FakeQuantize:
"""Fake quantization for QAT: quantize in forward, STE in backward."""
def __init__(self, bits=4):
self.bits = bits
self.q_max = 2**(bits - 1) - 1
def forward(self, w):
"""Quantize → dequantize (simulates quantization error)."""
alpha = np.max(np.abs(w)) + 1e-10
scale = alpha / self.q_max
q = np.clip(np.round(w / scale), -self.q_max, self.q_max)
return scale * q # float output with quantization noise baked in
def backward(self, grad):
"""Straight-Through Estimator: pass gradient unchanged.
Equivalent to pretending round() is the identity."""
return grad # That's it. Just pass it through.
# Train a network WITH and WITHOUT QAT
# Needs enough parameters that INT4 quantization causes real damage
np.random.seed(42)
n_in, n_hidden, n_out = 16, 64, 4
lr = 0.02
# Generate a classification dataset with nonlinear boundaries
X = np.random.randn(500, n_in).astype(np.float32)
targets = ((X[:, 0] * X[:, 1] > 0).astype(int) +
(X[:, 2] + X[:, 3] > 0.5).astype(int) +
(X[:, 4] > 0).astype(int))
targets = np.clip(targets, 0, n_out - 1)
def softmax(z):
e = np.exp(z - z.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
def train_network(use_qat=False, bits=4, epochs=500):
"""Train a 2-layer network, optionally with fake quantization."""
np.random.seed(123) # Same init for fair comparison
W1 = np.random.randn(n_in, n_hidden) * 0.5
W2 = np.random.randn(n_hidden, n_out) * 0.5
fq = FakeQuantize(bits) if use_qat else None
for epoch in range(epochs):
# Forward pass (with fake quantization if QAT)
W1_eff = fq.forward(W1) if fq else W1
W2_eff = fq.forward(W2) if fq else W2
h = np.maximum(0, X @ W1_eff)
logits = h @ W2_eff
probs = softmax(logits)
# Cross-entropy loss
loss = -np.mean(np.log(probs[np.arange(len(targets)), targets] + 1e-10))
# Backward pass (simplified, STE means grads flow through normally)
grad_logits = probs.copy()
grad_logits[np.arange(len(targets)), targets] -= 1
grad_logits /= len(targets)
grad_W2 = h.T @ grad_logits
grad_h = grad_logits @ W2_eff.T
grad_h[X @ W1_eff <= 0] = 0
grad_W1 = X.T @ grad_h
# STE: gradients pass through fake quantization unchanged
W1 -= lr * grad_W1
W2 -= lr * grad_W2
return W1, W2
# Train both versions
W1_std, W2_std = train_network(use_qat=False)
W1_qat, W2_qat = train_network(use_qat=True, bits=4)
# Now quantize both to 4-bit and evaluate
def evaluate(W1, W2, bits=4):
_, _, W1_q = symmetric_quantize(W1.flatten(), bits)
_, _, W2_q = symmetric_quantize(W2.flatten(), bits)
h = np.maximum(0, X @ W1_q.reshape(W1.shape))
logits = h @ W2_q.reshape(W2.shape)
preds = np.argmax(logits, axis=-1)
return np.mean(preds == targets)
acc_full = evaluate(W1_std, W2_std, bits=32) # No quantization
acc_ptq = evaluate(W1_std, W2_std, bits=4) # PTQ: train normal, quantize after
acc_qat = evaluate(W1_qat, W2_qat, bits=4) # QAT: train with fake quantization
print(f"Full precision (FP32): {acc_full:.1%}")
print(f"PTQ (INT4): {acc_ptq:.1%}")
print(f"QAT (INT4): {acc_qat:.1%}")
# Full precision (FP32): 68.6%
# PTQ (INT4): 59.0%
# QAT (INT4): 65.2%
QAT recovers most of the accuracy lost to quantization because the network learns weights that round well. It adjusts its parameters so they land near quantization grid lines, rather than straddling the boundary between two levels.
The practical trade-off: PTQ takes minutes (just round the weights), while QAT requires retraining. For 8-bit quantization, PTQ is almost always sufficient. At 4-bit, GPTQ-style PTQ usually works. Below 4 bits, QAT becomes essential.
The Practical Impact: Memory, Speed, Quality
Let's put concrete numbers on the table. Here's what quantization means for a real 7B-parameter model:
| Format | Bits/Weight | Model Size | Perplexity Δ | Fits On |
|---|---|---|---|---|
| FP32 | 32 | 28 GB | baseline | 2× RTX 4090 |
| FP16 / BF16 | 16 | 14 GB | ~0.0 | 1× RTX 4090 |
| INT8 | 8 | 7 GB | ~0.1 | RTX 4070 |
| INT4 / NF4 | 4 | 3.5 GB | ~0.3 | RTX 3060 / M1 Mac |
| INT3 | 3 | 2.6 GB | ~1.0 | Older 4GB GPUs |
| INT2 | 2 | 1.75 GB | ~5.0+ | Any GPU (but quality suffers) |
The sweet spot is blindingly obvious: 4-bit quantization. You get 8x compression over FP32 (4x over FP16) with barely measurable quality loss. Below 4 bits, quality degrades sharply. Above 4 bits, you're paying for precision the model doesn't need.
There's a beautiful scaling law at work here, too: larger models are more robust to quantization. A 70B model at 4-bit often outperforms a 13B model at 16-bit, even though the 70B is using fewer bits per weight. The extra parameters provide redundancy that absorbs quantization noise. This is why the 4-bit revolution mattered — it didn't just make existing models smaller, it made much bigger models accessible.
Speed also improves, because LLM inference is memory-bandwidth bound. A 4-bit model reads 4x less data from memory per token, which means 3-4x faster generation on typical hardware. You're not just saving space; you're saving time.
Decision tree: For 8-bit, use simple RTN — it just works. For 4-bit, use GPTQ or AWQ with calibration data — the error compensation is essential at this bit width. For 3-bit or below, you need QAT. For fine-tuning at 4-bit, use QLoRA (NF4 base + LoRA adapters).
Try It: Quantization Playground
Quantization Playground
Watch weights snap to quantization levels. Toggle between uniform (INT) and NF4 to see why smarter level placement reduces error.
Conclusion
We've now completed the full pipeline that this series has been building:
tokenize → embed → position → attend → softmax → loss → optimize → decode → fine-tune → quantize
From raw text to compressed model, every piece built from scratch. And quantization is arguably the most consequential piece in terms of real-world impact. Without it, running LLMs requires datacenter GPUs. With 4-bit NF4, a 7B model fits in your laptop's GPU memory. A 70B model fits on a single A100 instead of two.
The key ideas we covered:
- Symmetric quantization maps weights to a uniform integer grid centered at zero — simple and effective for bell-curve weight distributions
- Per-group granularity prevents outliers from ruining quantization for an entire tensor
- GPTQ minimizes output error instead of weight error, using Hessian-based compensation to adjust surviving weights when one is rounded
- NF4 places quantization levels at normal distribution quantiles, giving a free ~40% error reduction for the same 4 bits
- QAT lets the model learn to work with quantization noise, recovering accuracy at very low bit widths
The deeper lesson is that quantization isn't just engineering — it's applied information theory. The best quantization schemes don't just compress; they understand the data. NF4 works because it asks "where are the weights?" before deciding where to put the grid lines. That principle — match your representation to your distribution — echoes throughout machine learning.
References & Further Reading
- Jacob et al. — "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (2018) — Google's foundational work on quantization-aware training
- Dettmers et al. — "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022) — introduced mixed-precision decomposition for handling emergent outlier features
- Frantar et al. — "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (2022) — the Hessian-based error compensation method we implemented
- Dettmers et al. — "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) — NormalFloat4, double quantization, and the LoRA combination that democratized fine-tuning
- Lin et al. — "AWQ: Activation-Aware Weight Quantization" (2023) — an alternative approach that protects salient weights based on activation magnitudes
- Nagel et al. — "A White Paper on Neural Network Quantization" (2021) — Qualcomm's comprehensive survey of quantization techniques
- Previous DadOps elementary posts: LoRA from Scratch (fine-tuning with QLoRA connection), Attention from Scratch (the weight matrices we're quantizing), Softmax & Temperature (numerical precision matters here too), Loss Functions (the training objective for QAT)