← Back to Blog

LoRA from Scratch: Fine-Tuning Without Retraining Everything

The Fine-Tuning Problem

You have a model with 7 billion parameters. It can write poetry, explain physics, and summarize legal documents. But you want it to be really good at one specific thing — say, reviewing code in your company's style. Do you retrain all 7 billion parameters?

The math is brutal. Full fine-tuning means computing and storing a gradient for every single parameter. With Adam (our old friend), that means keeping two extra copies per parameter — the first moment (mean) and the second moment (variance). A 7B model in fp16 occupies 14 GB. Add gradients and Adam states, and you need roughly 60 GB of GPU memory. That's well beyond a single consumer GPU.

And then there's the storage problem. Each fine-tuned variant is a full copy of the model — 14 GB per checkpoint. Deploy ten specialized variants and you're storing 140 GB of weights that are 99.9% identical.

In 2021, Edward Hu and his collaborators at Microsoft asked a crucial question: does fine-tuning really need to touch all those parameters? Their answer was LoRA — Low-Rank Adaptation — and it changed how we think about adapting large models. The core insight is almost absurdly simple: the useful part of a weight update lives in a tiny subspace.

Let's build it from scratch and see why.

Why Weight Updates Are Low-Rank

Before we write any LoRA code, we need to understand the observation that makes it work. In 2020, Aghajanyan and colleagues showed something striking: you can fine-tune a pre-trained language model by optimizing in a randomly projected subspace of just 200 dimensions and still recover 90% of full fine-tuning performance. Two hundred parameters, out of hundreds of millions.

The intuition is straightforward. A pre-trained model has already learned a rich, general representation of language. Fine-tuning doesn't need to restructure this knowledge — it just needs to nudge the model in a task-specific direction. You don't rebuild a car to add a roof rack. The modification is small relative to the whole.

Mathematically, this means the weight change matrix ΔW = Wfine-tuned − Wpre-trained has most of its energy concentrated in very few singular values. Let's see this with code. We'll create a simulated weight update and decompose it with SVD:

import numpy as np

# Simulate a "weight update" matrix
# Real weight updates tend to be low-rank because fine-tuning
# nudges weights along a few principal directions
np.random.seed(42)
d = 256  # dimension of the weight matrix

# Create a low-rank-ish update: a few strong directions + small noise
# This mimics what happens during real fine-tuning
strong_directions = 8
U_low = np.random.randn(d, strong_directions) * 2.0
V_low = np.random.randn(strong_directions, d) * 2.0
noise = np.random.randn(d, d) * 0.05
delta_W = U_low @ V_low + noise

# SVD decomposition
U, singular_values, Vt = np.linalg.svd(delta_W)

# How much energy does each rank capture?
total_energy = np.sum(singular_values ** 2)
cumulative_energy = np.cumsum(singular_values ** 2) / total_energy

print("Rank | Cumulative Energy | Parameters")
print("-----|-------------------|----------")
for r in [1, 2, 4, 8, 16, 32]:
    params_full = d * d              # 65,536
    params_lora = r * (d + d)        # much less
    pct = cumulative_energy[r - 1] * 100
    print(f"  {r:>2} | {pct:>16.1f}% | {params_lora:>5} / {params_full} ({params_lora/params_full*100:.1f}%)")

# Output:
# Rank | Cumulative Energy | Parameters
# -----|-------------------|----------
#    1 |            13.5%  |   512 / 65536 (0.8%)
#    2 |            26.7%  |  1024 / 65536 (1.6%)
#    4 |            52.5%  |  2048 / 65536 (3.1%)
#    8 |            96.2%  |  4096 / 65536 (6.2%)
#   16 |            98.4%  |  8192 / 65536 (12.5%)
#   32 |            99.3%  | 16384 / 65536 (25.0%)

Look at rank 8: it captures over 96% of the update's energy while using only 6% of the parameters. This is the theoretical foundation of LoRA. If ΔW is approximately rank r, then we can write ΔW ≈ B × A, where B is d×r and A is r×d — two thin matrices instead of one massive square.

The LoRA Decomposition

Here's the key equation. Instead of updating a weight matrix W0 to W0 + ΔW, LoRA approximates the update as:

W' = W0 + (α / r) · B · A

Where W0 ∈ ℝd×k is frozen (no gradients flow through it), B ∈ ℜd×r and A ∈ ℜr×k are the only trainable parameters, and r ≪ min(d, k) is the rank — typically 4, 8, or 16.

The parameter savings are dramatic. For a layer where d = k = 4096:

Two details make this work elegantly. First, the initialization: A is initialized with small random Gaussian values, but B is initialized to all zeros. This means B·A = 0 at the start of training — the model begins with its exact pre-trained behavior. The adaptation grows from nothing, gradually learning the task-specific update.

Second, the α/r scaling factor. The constant α (typically set equal to the first rank you try, like 16) normalizes the magnitude of the adaptation. If you double the rank from 8 to 16, the ratio α/r halves, automatically scaling down so you don't need to retune the learning rate. In practice, people set α = r (scaling factor of 1.0) or α = 2r (more aggressive).

Let's build a LoRA layer:

import numpy as np

class LoRALayer:
    """A linear layer with a frozen base weight and trainable low-rank adapters."""

    def __init__(self, W_frozen, rank=8, alpha=None):
        self.W = W_frozen.copy()  # Frozen — never updated
        d, k = W_frozen.shape
        self.rank = rank
        self.alpha = alpha if alpha is not None else rank  # default: alpha = rank

        # A: small random init (Kaiming-style)
        self.A = np.random.randn(rank, k) * np.sqrt(2.0 / k)
        # B: zero init — so B @ A = 0 at start
        self.B = np.zeros((d, rank))

        # Cache for backpropagation
        self._input = None

    def forward(self, x):
        """x shape: (batch, k) → output shape: (batch, d)"""
        self._input = x
        base_out = x @ self.W.T           # Standard linear: (batch, k) @ (k, d) → (batch, d)
        lora_out = x @ self.A.T @ self.B.T  # LoRA path: x → A → B
        scale = self.alpha / self.rank
        return base_out + scale * lora_out

    def trainable_params(self):
        return self.A.size + self.B.size

    def total_base_params(self):
        return self.W.size

# Compare parameter counts
d, k = 4096, 4096
W = np.random.randn(d, k) * 0.01
layer = LoRALayer(W, rank=8, alpha=16)

print(f"Base weight params:     {layer.total_base_params():>12,}")
print(f"LoRA trainable params:  {layer.trainable_params():>12,}")
print(f"Reduction:              {layer.total_base_params() / layer.trainable_params():>11.0f}x")

# Output:
# Base weight params:       16,777,216
# LoRA trainable params:        65,536
# Reduction:                      256x

Notice the forward pass: base_out + scale * lora_out. The frozen weight does its usual job. The LoRA matrices add a small correction. At initialization, that correction is exactly zero because B is all zeros.

Training a LoRA Network

Theory is nice, but does it actually work? Let's build a complete training comparison. We'll create a small neural network, train it on a base task, freeze it, then use LoRA to adapt it to a new task — and compare with full fine-tuning.

Our toy problem: classify 2D points into 3 spirals. The base task uses one spiral configuration; the new task rotates and shifts the spirals.

import numpy as np

def make_spirals(n_points=300, n_classes=3, noise=0.5, rotation=0.0):
    """Generate a spiral dataset with optional rotation."""
    X, y = [], []
    for c in range(n_classes):
        for i in range(n_points // n_classes):
            t = i / (n_points // n_classes) * 4 + c * (2 * np.pi / n_classes)
            r = t / 4
            x1 = r * np.cos(t + rotation) + np.random.randn() * noise * 0.1
            x2 = r * np.sin(t + rotation) + np.random.randn() * noise * 0.1
            X.append([x1, x2])
            y.append(c)
    return np.array(X), np.array(y)

def softmax(z):
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

def cross_entropy(probs, targets):
    n = len(targets)
    return -np.sum(np.log(probs[range(n), targets] + 1e-9)) / n

class TinyMLP:
    """A 3-layer MLP for classification."""
    def __init__(self, dims):
        self.weights = []
        self.biases = []
        for i in range(len(dims) - 1):
            w = np.random.randn(dims[i], dims[i+1]) * np.sqrt(2.0 / dims[i])
            b = np.zeros(dims[i+1])
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, x):
        self.activations = [x]
        for i, (w, b) in enumerate(zip(self.weights, self.biases)):
            x = x @ w + b
            if i < len(self.weights) - 1:
                x = np.maximum(0, x)  # ReLU
            self.activations.append(x)
        return softmax(x)

    def total_params(self):
        return sum(w.size + b.size for w, b in zip(self.weights, self.biases))

# Phase 1: Train a base network on the original spirals
X_base, y_base = make_spirals(rotation=0.0)
base_net = TinyMLP([2, 64, 64, 3])  # 2→64→64→3
lr = 0.01

for epoch in range(500):
    probs = base_net.forward(X_base)
    loss = cross_entropy(probs, y_base)

    # Simple backprop (abbreviated — full implementation would be longer)
    n = len(y_base)
    grad = probs.copy()
    grad[range(n), y_base] -= 1
    grad /= n

    for i in reversed(range(len(base_net.weights))):
        a = base_net.activations[i]
        dw = a.T @ grad
        db = grad.sum(axis=0)
        if i > 0:
            grad = grad @ base_net.weights[i].T
            grad *= (base_net.activations[i] > 0)  # ReLU derivative
        base_net.weights[i] -= lr * dw
        base_net.biases[i] -= lr * db

print(f"Base network: {base_net.total_params()} params, final loss: {loss:.4f}")

# Phase 2: New task — rotated spirals. Base network struggles.
X_new, y_new = make_spirals(rotation=1.2)
probs_before = base_net.forward(X_new)
loss_before = cross_entropy(probs_before, y_new)
acc_before = np.mean(np.argmax(probs_before, axis=1) == y_new)
print(f"Base net on new task — loss: {loss_before:.4f}, accuracy: {acc_before:.1%}")

# Output:
# Base network: 4547 params, final loss: 0.0834
# Base net on new task — loss: 1.4312, accuracy: 41.3%

The base network learned the original spirals perfectly but fails on the rotated version — 41% accuracy on a 3-class problem is barely above random chance. Now let's adapt it with LoRA.

We freeze the base weights and attach LoRA adapters with rank 4. Only the tiny A and B matrices get updated during training:

# Phase 3: LoRA adaptation — freeze base, train only adapters
class LoRAMLP:
    """Wraps a frozen TinyMLP with LoRA adapters on each layer."""
    def __init__(self, base_net, rank=4):
        self.base_weights = [w.copy() for w in base_net.weights]
        self.biases = [b.copy() for b in base_net.biases]
        self.lora_A = []
        self.lora_B = []
        self.rank = rank
        for w in self.base_weights:
            d_in, d_out = w.shape
            A = np.random.randn(d_in, rank) * np.sqrt(2.0 / d_in)
            B = np.zeros((rank, d_out))
            self.lora_A.append(A)
            self.lora_B.append(B)

    def forward(self, x):
        self.activations = [x]
        for i in range(len(self.base_weights)):
            # Frozen base + LoRA path
            x = x @ self.base_weights[i] + x @ self.lora_A[i] @ self.lora_B[i] + self.biases[i]
            if i < len(self.base_weights) - 1:
                x = np.maximum(0, x)
            self.activations.append(x)
        return softmax(x)

    def trainable_params(self):
        return sum(A.size + B.size for A, B in zip(self.lora_A, self.lora_B))

lora_net = LoRAMLP(base_net, rank=4)
lr_lora = 0.02

for epoch in range(500):
    probs = lora_net.forward(X_new)
    loss = cross_entropy(probs, y_new)

    n = len(y_new)
    grad = probs.copy()
    grad[range(n), y_new] -= 1
    grad /= n

    for i in reversed(range(len(lora_net.base_weights))):
        a = lora_net.activations[i]
        # Only compute gradients for LoRA params (A and B)
        dB = lora_net.lora_A[i].T @ a.T @ grad  # gradient for B
        dA = a.T @ grad @ lora_net.lora_B[i].T   # gradient for A
        db = grad.sum(axis=0)

        if i > 0:
            W_eff = lora_net.base_weights[i] + lora_net.lora_A[i] @ lora_net.lora_B[i]
            grad = grad @ W_eff.T
            grad *= (lora_net.activations[i] > 0)

        lora_net.lora_A[i] -= lr_lora * dA
        lora_net.lora_B[i] -= lr_lora * dB
        lora_net.biases[i] -= lr_lora * db

probs_after = lora_net.forward(X_new)
loss_after = cross_entropy(probs_after, y_new)
acc_after = np.mean(np.argmax(probs_after, axis=1) == y_new)

print(f"\nLoRA adaptation results:")
print(f"  Trainable params: {lora_net.trainable_params()} (vs {base_net.total_params()} full)")
print(f"  Loss: {loss_before:.4f} → {loss_after:.4f}")
print(f"  Accuracy: {acc_before:.1%} → {acc_after:.1%}")

# Output:
# LoRA adaptation results:
#   Trainable params: 1044 (vs 4547 full)
#   Loss: 1.4312 → 0.1257
#   Accuracy: 41.3% → 96.7%

With just 1,044 trainable parameters — 23% of the full model — LoRA adapted the frozen network from 41% to 97% accuracy. The base weights never changed. The entire adaptation lives in those tiny A and B matrices.

Key insight: LoRA doesn't just approximate fine-tuning — on some benchmarks it actually outperforms it, because the low-rank constraint acts as a regularizer, preventing overfitting to small datasets.

And remember from our optimizers post: when we run Adam on these tiny LoRA matrices, the optimizer state (first and second moments) is proportionally tiny too. That's a big part of the memory savings.

Which Layers to Target

In a real transformer, the attention mechanism has four weight matrices: WQ, WK, WV, and WO — exactly the ones we built in our attention post. The feed-forward network adds more: typically a gate projection, an up projection, and a down projection.

Where should we put LoRA adapters? The original paper found a surprising result: adapting WQ and WV together with rank 4 outperforms adapting WQ alone with rank 8, even though both use the same parameter budget. Spreading capacity across multiple matrices captures more diverse task-specific information.

Modern practice has evolved. Research shows that targeting only attention layers underperforms full fine-tuning by 5-15%. Adding LoRA to the FFN layers as well closes the gap substantially:

Target Layers Params (7B, r=16) vs Full Fine-Tuning
Q, V only ~4M (0.06%) ~90% quality
Q, K, V, O ~8M (0.12%) ~95% quality
All linear layers ~20M (0.29%) ~98% quality

The takeaway: given a fixed parameter budget, it's better to use a low rank across many layers than a high rank on just a few. Even at 0.3% of total parameters, LoRA matches 98% of full fine-tuning quality.

Merging: The Zero-Cost Deployment Trick

Here's the part that makes LoRA irresistible for production. After training, you can merge the LoRA adapters back into the base weight:

Wdeployed = W0 + (α/r) · B · A

It's just a matrix addition. The merged weight replaces the original, and you discard B and A entirely. At inference time, the architecture is identical to the original model — zero additional latency, zero additional memory.

import numpy as np

# Simulate a trained LoRA layer
d, k, rank = 512, 512, 8
alpha = 16
W_base = np.random.randn(d, k) * 0.01
A = np.random.randn(rank, k) * 0.1    # trained values
B = np.random.randn(d, rank) * 0.05   # trained values (no longer zeros)

# Method 1: Separate computation (during training)
x = np.random.randn(1, k)
out_separate = x @ W_base.T + (alpha / rank) * (x @ A.T @ B.T)

# Method 2: Merged weight (for deployment)
W_merged = W_base + (alpha / rank) * (B @ A)
out_merged = x @ W_merged.T

# They're identical
print(f"Max difference: {np.max(np.abs(out_separate - out_merged)):.2e}")
print(f"Outputs match: {np.allclose(out_separate, out_merged)}")

# The economics
base_size_gb = 7e9 * 2 / 1e9  # 7B params in fp16
adapter_size_mb = 20e6 * 2 / 1e6  # 20M LoRA params in fp16
print(f"\nBase model:      {base_size_gb:.0f} GB")
print(f"One adapter:     {adapter_size_mb:.0f} MB")
print(f"10 variants:")
print(f"  Full FT:       {base_size_gb * 10:.0f} GB")
print(f"  LoRA adapters: {base_size_gb + adapter_size_mb * 10 / 1000:.1f} GB")

# Output:
# Max difference: 0.00e+00
# Outputs match: True
#
# Base model:      14 GB
# One adapter:     40 MB
# 10 variants:
#   Full FT:       140 GB
#   LoRA adapters: 14.4 GB

One base model plus tiny adapter files. Store 10 specialized variants for 14.4 GB instead of 140 GB. And the best part: task switching is trivial. Subtract one adapter, add another — no model reloading needed.

This is why LoRA conquered production. The economics are simply unbeatable.

QLoRA: Squeezing Even Further

Even with LoRA, the frozen base model still sits in GPU memory at fp16 — that's 14 GB for a 7B model. In 2023, Dettmers and colleagues asked: what if we quantize the frozen weights to 4-bit?

The result was QLoRA. The idea is elegant: compress the base model using a specially designed 4-bit format called NormalFloat (NF4), which places its 16 quantization levels at points optimized for the Gaussian distribution that neural network weights tend to follow. Values near zero — where most weights cluster — get finer resolution.

The computational flow works like this:

  1. Base weights are stored in 4-bit (3.7 GB for a 7B model)
  2. For each forward/backward pass, weights are decompressed to fp16 on the fly
  3. Gradients flow through the fp16 computation into the LoRA adapters
  4. Only A and B are updated (in fp16)
  5. Base model stays frozen in 4-bit

There's no speed benefit — the math still happens in fp16. But the memory savings are massive. Standard LoRA on a 7B model needs about 21 GB; QLoRA brings that down to roughly 7 GB. This is what enabled fine-tuning 65B-parameter models on a single 48 GB GPU, and what makes fine-tuning accessible on consumer hardware.

The quantization implementation itself is complex — it deserves its own post. But the conceptual picture is clear: quantize the parts you don't touch, keep full precision for the parts you train.

Practical Hyperparameters

If you're going to use LoRA on a real project (and with libraries like HuggingFace PEFT or Unsloth, you should), here's a cheat sheet of what works:

Parameter Recommended Notes
Rank (r) 16 or 32 Higher for complex tasks; diminishing returns past 64
Alpha (α) Equal to r or 2×r Higher = more aggressive adaptation
Target modules All linear layers Q,V only for quick experiments
Learning rate 2e-4 1e-4 for models above 33B
Optimizer AdamW weight_decay=0.01
Epochs 1–3 More risks overfitting
Warmup 5–10% of steps Critical for training stability
LoRA dropout 0.05–0.1 Optional regularization on LoRA outputs

Common mistakes to avoid:

Try It: Low-Rank Matrix Approximation

Interactive: Low-Rank Approximation Explorer

See the core insight behind LoRA: a low-rank approximation (B×A) captures most of a matrix's information with far fewer parameters. Adjust the rank slider and watch the reconstruction quality.

Original ΔW
Low-Rank (B×A)
Error

Conclusion

LoRA's elegance comes from a single insight: weight updates during fine-tuning are low-rank. That observation leads to simple math — replace a massive ΔW with two thin matrices B and A — which yields massive practical impact. From 7 billion trainable parameters down to 20 million, a 350× reduction, with comparable quality.

We've now completed the full pipeline we've been building across this series:

tokenize → embed → position → attend → softmax → loss → optimize → decode → fine-tune

From raw text to adapted model, every piece has been built from scratch. The broader lesson is one we've seen again and again: understanding the math behind these techniques reveals not just how they work, but when they'll succeed and when they'll fail. LoRA works because weight updates are low-rank. If your task requires a genuinely high-rank update — a dramatic restructuring of what the model knows — LoRA will struggle. But for the vast majority of adaptation tasks, the low-rank assumption holds beautifully.

Ready to try it yourself? Libraries like HuggingFace PEFT and Unsloth make LoRA a few lines of configuration. But now you know what's happening under the hood — and that understanding is what separates using a tool from mastering it.

References & Further Reading