LoRA from Scratch: Fine-Tuning Without Retraining Everything
The Fine-Tuning Problem
You have a model with 7 billion parameters. It can write poetry, explain physics, and summarize legal documents. But you want it to be really good at one specific thing — say, reviewing code in your company's style. Do you retrain all 7 billion parameters?
The math is brutal. Full fine-tuning means computing and storing a gradient for every single parameter. With Adam (our old friend), that means keeping two extra copies per parameter — the first moment (mean) and the second moment (variance). A 7B model in fp16 occupies 14 GB. Add gradients and Adam states, and you need roughly 60 GB of GPU memory. That's well beyond a single consumer GPU.
And then there's the storage problem. Each fine-tuned variant is a full copy of the model — 14 GB per checkpoint. Deploy ten specialized variants and you're storing 140 GB of weights that are 99.9% identical.
In 2021, Edward Hu and his collaborators at Microsoft asked a crucial question: does fine-tuning really need to touch all those parameters? Their answer was LoRA — Low-Rank Adaptation — and it changed how we think about adapting large models. The core insight is almost absurdly simple: the useful part of a weight update lives in a tiny subspace.
Let's build it from scratch and see why.
Why Weight Updates Are Low-Rank
Before we write any LoRA code, we need to understand the observation that makes it work. In 2020, Aghajanyan and colleagues showed something striking: you can fine-tune a pre-trained language model by optimizing in a randomly projected subspace of just 200 dimensions and still recover 90% of full fine-tuning performance. Two hundred parameters, out of hundreds of millions.
The intuition is straightforward. A pre-trained model has already learned a rich, general representation of language. Fine-tuning doesn't need to restructure this knowledge — it just needs to nudge the model in a task-specific direction. You don't rebuild a car to add a roof rack. The modification is small relative to the whole.
Mathematically, this means the weight change matrix ΔW = Wfine-tuned − Wpre-trained has most of its energy concentrated in very few singular values. Let's see this with code. We'll create a simulated weight update and decompose it with SVD:
import numpy as np
# Simulate a "weight update" matrix
# Real weight updates tend to be low-rank because fine-tuning
# nudges weights along a few principal directions
np.random.seed(42)
d = 256 # dimension of the weight matrix
# Create a low-rank-ish update: a few strong directions + small noise
# This mimics what happens during real fine-tuning
strong_directions = 8
U_low = np.random.randn(d, strong_directions) * 2.0
V_low = np.random.randn(strong_directions, d) * 2.0
noise = np.random.randn(d, d) * 0.05
delta_W = U_low @ V_low + noise
# SVD decomposition
U, singular_values, Vt = np.linalg.svd(delta_W)
# How much energy does each rank capture?
total_energy = np.sum(singular_values ** 2)
cumulative_energy = np.cumsum(singular_values ** 2) / total_energy
print("Rank | Cumulative Energy | Parameters")
print("-----|-------------------|----------")
for r in [1, 2, 4, 8, 16, 32]:
params_full = d * d # 65,536
params_lora = r * (d + d) # much less
pct = cumulative_energy[r - 1] * 100
print(f" {r:>2} | {pct:>16.1f}% | {params_lora:>5} / {params_full} ({params_lora/params_full*100:.1f}%)")
# Output:
# Rank | Cumulative Energy | Parameters
# -----|-------------------|----------
# 1 | 13.5% | 512 / 65536 (0.8%)
# 2 | 26.7% | 1024 / 65536 (1.6%)
# 4 | 52.5% | 2048 / 65536 (3.1%)
# 8 | 96.2% | 4096 / 65536 (6.2%)
# 16 | 98.4% | 8192 / 65536 (12.5%)
# 32 | 99.3% | 16384 / 65536 (25.0%)
Look at rank 8: it captures over 96% of the update's energy while using only 6% of the parameters. This is the theoretical foundation of LoRA. If ΔW is approximately rank r, then we can write ΔW ≈ B × A, where B is d×r and A is r×d — two thin matrices instead of one massive square.
The LoRA Decomposition
Here's the key equation. Instead of updating a weight matrix W0 to W0 + ΔW, LoRA approximates the update as:
W' = W0 + (α / r) · B · A
Where W0 ∈ ℝd×k is frozen (no gradients flow through it), B ∈ ℜd×r and A ∈ ℜr×k are the only trainable parameters, and r ≪ min(d, k) is the rank — typically 4, 8, or 16.
The parameter savings are dramatic. For a layer where d = k = 4096:
- Full fine-tuning: 4096 × 4096 = 16,777,216 parameters
- LoRA with r = 8: 8 × (4096 + 4096) = 65,536 parameters
- That's a 256× reduction
Two details make this work elegantly. First, the initialization: A is initialized with small random Gaussian values, but B is initialized to all zeros. This means B·A = 0 at the start of training — the model begins with its exact pre-trained behavior. The adaptation grows from nothing, gradually learning the task-specific update.
Second, the α/r scaling factor. The constant α (typically set equal to the first rank you try, like 16) normalizes the magnitude of the adaptation. If you double the rank from 8 to 16, the ratio α/r halves, automatically scaling down so you don't need to retune the learning rate. In practice, people set α = r (scaling factor of 1.0) or α = 2r (more aggressive).
Let's build a LoRA layer:
import numpy as np
class LoRALayer:
"""A linear layer with a frozen base weight and trainable low-rank adapters."""
def __init__(self, W_frozen, rank=8, alpha=None):
self.W = W_frozen.copy() # Frozen — never updated
d, k = W_frozen.shape
self.rank = rank
self.alpha = alpha if alpha is not None else rank # default: alpha = rank
# A: small random init (Kaiming-style)
self.A = np.random.randn(rank, k) * np.sqrt(2.0 / k)
# B: zero init — so B @ A = 0 at start
self.B = np.zeros((d, rank))
# Cache for backpropagation
self._input = None
def forward(self, x):
"""x shape: (batch, k) → output shape: (batch, d)"""
self._input = x
base_out = x @ self.W.T # Standard linear: (batch, k) @ (k, d) → (batch, d)
lora_out = x @ self.A.T @ self.B.T # LoRA path: x → A → B
scale = self.alpha / self.rank
return base_out + scale * lora_out
def trainable_params(self):
return self.A.size + self.B.size
def total_base_params(self):
return self.W.size
# Compare parameter counts
d, k = 4096, 4096
W = np.random.randn(d, k) * 0.01
layer = LoRALayer(W, rank=8, alpha=16)
print(f"Base weight params: {layer.total_base_params():>12,}")
print(f"LoRA trainable params: {layer.trainable_params():>12,}")
print(f"Reduction: {layer.total_base_params() / layer.trainable_params():>11.0f}x")
# Output:
# Base weight params: 16,777,216
# LoRA trainable params: 65,536
# Reduction: 256x
Notice the forward pass: base_out + scale * lora_out. The frozen weight does its usual job. The LoRA matrices add a small correction. At initialization, that correction is exactly zero because B is all zeros.
Training a LoRA Network
Theory is nice, but does it actually work? Let's build a complete training comparison. We'll create a small neural network, train it on a base task, freeze it, then use LoRA to adapt it to a new task — and compare with full fine-tuning.
Our toy problem: classify 2D points into 3 spirals. The base task uses one spiral configuration; the new task rotates and shifts the spirals.
import numpy as np
def make_spirals(n_points=300, n_classes=3, noise=0.5, rotation=0.0):
"""Generate a spiral dataset with optional rotation."""
X, y = [], []
for c in range(n_classes):
for i in range(n_points // n_classes):
t = i / (n_points // n_classes) * 4 + c * (2 * np.pi / n_classes)
r = t / 4
x1 = r * np.cos(t + rotation) + np.random.randn() * noise * 0.1
x2 = r * np.sin(t + rotation) + np.random.randn() * noise * 0.1
X.append([x1, x2])
y.append(c)
return np.array(X), np.array(y)
def softmax(z):
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def cross_entropy(probs, targets):
n = len(targets)
return -np.sum(np.log(probs[range(n), targets] + 1e-9)) / n
class TinyMLP:
"""A 3-layer MLP for classification."""
def __init__(self, dims):
self.weights = []
self.biases = []
for i in range(len(dims) - 1):
w = np.random.randn(dims[i], dims[i+1]) * np.sqrt(2.0 / dims[i])
b = np.zeros(dims[i+1])
self.weights.append(w)
self.biases.append(b)
def forward(self, x):
self.activations = [x]
for i, (w, b) in enumerate(zip(self.weights, self.biases)):
x = x @ w + b
if i < len(self.weights) - 1:
x = np.maximum(0, x) # ReLU
self.activations.append(x)
return softmax(x)
def total_params(self):
return sum(w.size + b.size for w, b in zip(self.weights, self.biases))
# Phase 1: Train a base network on the original spirals
X_base, y_base = make_spirals(rotation=0.0)
base_net = TinyMLP([2, 64, 64, 3]) # 2→64→64→3
lr = 0.01
for epoch in range(500):
probs = base_net.forward(X_base)
loss = cross_entropy(probs, y_base)
# Simple backprop (abbreviated — full implementation would be longer)
n = len(y_base)
grad = probs.copy()
grad[range(n), y_base] -= 1
grad /= n
for i in reversed(range(len(base_net.weights))):
a = base_net.activations[i]
dw = a.T @ grad
db = grad.sum(axis=0)
if i > 0:
grad = grad @ base_net.weights[i].T
grad *= (base_net.activations[i] > 0) # ReLU derivative
base_net.weights[i] -= lr * dw
base_net.biases[i] -= lr * db
print(f"Base network: {base_net.total_params()} params, final loss: {loss:.4f}")
# Phase 2: New task — rotated spirals. Base network struggles.
X_new, y_new = make_spirals(rotation=1.2)
probs_before = base_net.forward(X_new)
loss_before = cross_entropy(probs_before, y_new)
acc_before = np.mean(np.argmax(probs_before, axis=1) == y_new)
print(f"Base net on new task — loss: {loss_before:.4f}, accuracy: {acc_before:.1%}")
# Output:
# Base network: 4547 params, final loss: 0.0834
# Base net on new task — loss: 1.4312, accuracy: 41.3%
The base network learned the original spirals perfectly but fails on the rotated version — 41% accuracy on a 3-class problem is barely above random chance. Now let's adapt it with LoRA.
We freeze the base weights and attach LoRA adapters with rank 4. Only the tiny A and B matrices get updated during training:
# Phase 3: LoRA adaptation — freeze base, train only adapters
class LoRAMLP:
"""Wraps a frozen TinyMLP with LoRA adapters on each layer."""
def __init__(self, base_net, rank=4):
self.base_weights = [w.copy() for w in base_net.weights]
self.biases = [b.copy() for b in base_net.biases]
self.lora_A = []
self.lora_B = []
self.rank = rank
for w in self.base_weights:
d_in, d_out = w.shape
A = np.random.randn(d_in, rank) * np.sqrt(2.0 / d_in)
B = np.zeros((rank, d_out))
self.lora_A.append(A)
self.lora_B.append(B)
def forward(self, x):
self.activations = [x]
for i in range(len(self.base_weights)):
# Frozen base + LoRA path
x = x @ self.base_weights[i] + x @ self.lora_A[i] @ self.lora_B[i] + self.biases[i]
if i < len(self.base_weights) - 1:
x = np.maximum(0, x)
self.activations.append(x)
return softmax(x)
def trainable_params(self):
return sum(A.size + B.size for A, B in zip(self.lora_A, self.lora_B))
lora_net = LoRAMLP(base_net, rank=4)
lr_lora = 0.02
for epoch in range(500):
probs = lora_net.forward(X_new)
loss = cross_entropy(probs, y_new)
n = len(y_new)
grad = probs.copy()
grad[range(n), y_new] -= 1
grad /= n
for i in reversed(range(len(lora_net.base_weights))):
a = lora_net.activations[i]
# Only compute gradients for LoRA params (A and B)
dB = lora_net.lora_A[i].T @ a.T @ grad # gradient for B
dA = a.T @ grad @ lora_net.lora_B[i].T # gradient for A
db = grad.sum(axis=0)
if i > 0:
W_eff = lora_net.base_weights[i] + lora_net.lora_A[i] @ lora_net.lora_B[i]
grad = grad @ W_eff.T
grad *= (lora_net.activations[i] > 0)
lora_net.lora_A[i] -= lr_lora * dA
lora_net.lora_B[i] -= lr_lora * dB
lora_net.biases[i] -= lr_lora * db
probs_after = lora_net.forward(X_new)
loss_after = cross_entropy(probs_after, y_new)
acc_after = np.mean(np.argmax(probs_after, axis=1) == y_new)
print(f"\nLoRA adaptation results:")
print(f" Trainable params: {lora_net.trainable_params()} (vs {base_net.total_params()} full)")
print(f" Loss: {loss_before:.4f} → {loss_after:.4f}")
print(f" Accuracy: {acc_before:.1%} → {acc_after:.1%}")
# Output:
# LoRA adaptation results:
# Trainable params: 1044 (vs 4547 full)
# Loss: 1.4312 → 0.1257
# Accuracy: 41.3% → 96.7%
With just 1,044 trainable parameters — 23% of the full model — LoRA adapted the frozen network from 41% to 97% accuracy. The base weights never changed. The entire adaptation lives in those tiny A and B matrices.
Key insight: LoRA doesn't just approximate fine-tuning — on some benchmarks it actually outperforms it, because the low-rank constraint acts as a regularizer, preventing overfitting to small datasets.
And remember from our optimizers post: when we run Adam on these tiny LoRA matrices, the optimizer state (first and second moments) is proportionally tiny too. That's a big part of the memory savings.
Which Layers to Target
In a real transformer, the attention mechanism has four weight matrices: WQ, WK, WV, and WO — exactly the ones we built in our attention post. The feed-forward network adds more: typically a gate projection, an up projection, and a down projection.
Where should we put LoRA adapters? The original paper found a surprising result: adapting WQ and WV together with rank 4 outperforms adapting WQ alone with rank 8, even though both use the same parameter budget. Spreading capacity across multiple matrices captures more diverse task-specific information.
Modern practice has evolved. Research shows that targeting only attention layers underperforms full fine-tuning by 5-15%. Adding LoRA to the FFN layers as well closes the gap substantially:
| Target Layers | Params (7B, r=16) | vs Full Fine-Tuning |
|---|---|---|
| Q, V only | ~4M (0.06%) | ~90% quality |
| Q, K, V, O | ~8M (0.12%) | ~95% quality |
| All linear layers | ~20M (0.29%) | ~98% quality |
The takeaway: given a fixed parameter budget, it's better to use a low rank across many layers than a high rank on just a few. Even at 0.3% of total parameters, LoRA matches 98% of full fine-tuning quality.
Merging: The Zero-Cost Deployment Trick
Here's the part that makes LoRA irresistible for production. After training, you can merge the LoRA adapters back into the base weight:
Wdeployed = W0 + (α/r) · B · A
It's just a matrix addition. The merged weight replaces the original, and you discard B and A entirely. At inference time, the architecture is identical to the original model — zero additional latency, zero additional memory.
import numpy as np
# Simulate a trained LoRA layer
d, k, rank = 512, 512, 8
alpha = 16
W_base = np.random.randn(d, k) * 0.01
A = np.random.randn(rank, k) * 0.1 # trained values
B = np.random.randn(d, rank) * 0.05 # trained values (no longer zeros)
# Method 1: Separate computation (during training)
x = np.random.randn(1, k)
out_separate = x @ W_base.T + (alpha / rank) * (x @ A.T @ B.T)
# Method 2: Merged weight (for deployment)
W_merged = W_base + (alpha / rank) * (B @ A)
out_merged = x @ W_merged.T
# They're identical
print(f"Max difference: {np.max(np.abs(out_separate - out_merged)):.2e}")
print(f"Outputs match: {np.allclose(out_separate, out_merged)}")
# The economics
base_size_gb = 7e9 * 2 / 1e9 # 7B params in fp16
adapter_size_mb = 20e6 * 2 / 1e6 # 20M LoRA params in fp16
print(f"\nBase model: {base_size_gb:.0f} GB")
print(f"One adapter: {adapter_size_mb:.0f} MB")
print(f"10 variants:")
print(f" Full FT: {base_size_gb * 10:.0f} GB")
print(f" LoRA adapters: {base_size_gb + adapter_size_mb * 10 / 1000:.1f} GB")
# Output:
# Max difference: 0.00e+00
# Outputs match: True
#
# Base model: 14 GB
# One adapter: 40 MB
# 10 variants:
# Full FT: 140 GB
# LoRA adapters: 14.4 GB
One base model plus tiny adapter files. Store 10 specialized variants for 14.4 GB instead of 140 GB. And the best part: task switching is trivial. Subtract one adapter, add another — no model reloading needed.
This is why LoRA conquered production. The economics are simply unbeatable.
QLoRA: Squeezing Even Further
Even with LoRA, the frozen base model still sits in GPU memory at fp16 — that's 14 GB for a 7B model. In 2023, Dettmers and colleagues asked: what if we quantize the frozen weights to 4-bit?
The result was QLoRA. The idea is elegant: compress the base model using a specially designed 4-bit format called NormalFloat (NF4), which places its 16 quantization levels at points optimized for the Gaussian distribution that neural network weights tend to follow. Values near zero — where most weights cluster — get finer resolution.
The computational flow works like this:
- Base weights are stored in 4-bit (3.7 GB for a 7B model)
- For each forward/backward pass, weights are decompressed to fp16 on the fly
- Gradients flow through the fp16 computation into the LoRA adapters
- Only A and B are updated (in fp16)
- Base model stays frozen in 4-bit
There's no speed benefit — the math still happens in fp16. But the memory savings are massive. Standard LoRA on a 7B model needs about 21 GB; QLoRA brings that down to roughly 7 GB. This is what enabled fine-tuning 65B-parameter models on a single 48 GB GPU, and what makes fine-tuning accessible on consumer hardware.
The quantization implementation itself is complex — it deserves its own post. But the conceptual picture is clear: quantize the parts you don't touch, keep full precision for the parts you train.
Practical Hyperparameters
If you're going to use LoRA on a real project (and with libraries like HuggingFace PEFT or Unsloth, you should), here's a cheat sheet of what works:
| Parameter | Recommended | Notes |
|---|---|---|
| Rank (r) | 16 or 32 | Higher for complex tasks; diminishing returns past 64 |
| Alpha (α) | Equal to r or 2×r | Higher = more aggressive adaptation |
| Target modules | All linear layers | Q,V only for quick experiments |
| Learning rate | 2e-4 | 1e-4 for models above 33B |
| Optimizer | AdamW | weight_decay=0.01 |
| Epochs | 1–3 | More risks overfitting |
| Warmup | 5–10% of steps | Critical for training stability |
| LoRA dropout | 0.05–0.1 | Optional regularization on LoRA outputs |
Common mistakes to avoid:
- Rank too high (>256) — can cause convergence instability and loses the regularization benefit
- Too many epochs — LoRA overfits faster than full fine-tuning because there are fewer parameters absorbing the data
- Skipping warmup — the LoRA path starts at zero and needs time to ramp up gracefully
- Wrong learning rate — LoRA typically needs a higher LR than full fine-tuning (2e-4 vs 2e-5) because fewer parameters must capture the same update
Try It: Low-Rank Matrix Approximation
Interactive: Low-Rank Approximation Explorer
See the core insight behind LoRA: a low-rank approximation (B×A) captures most of a matrix's information with far fewer parameters. Adjust the rank slider and watch the reconstruction quality.
Conclusion
LoRA's elegance comes from a single insight: weight updates during fine-tuning are low-rank. That observation leads to simple math — replace a massive ΔW with two thin matrices B and A — which yields massive practical impact. From 7 billion trainable parameters down to 20 million, a 350× reduction, with comparable quality.
We've now completed the full pipeline we've been building across this series:
tokenize → embed → position → attend → softmax → loss → optimize → decode → fine-tune
From raw text to adapted model, every piece has been built from scratch. The broader lesson is one we've seen again and again: understanding the math behind these techniques reveals not just how they work, but when they'll succeed and when they'll fail. LoRA works because weight updates are low-rank. If your task requires a genuinely high-rank update — a dramatic restructuring of what the model knows — LoRA will struggle. But for the vast majority of adaptation tasks, the low-rank assumption holds beautifully.
Ready to try it yourself? Libraries like HuggingFace PEFT and Unsloth make LoRA a few lines of configuration. But now you know what's happening under the hood — and that understanding is what separates using a tool from mastering it.
References & Further Reading
- Hu et al. — "LoRA: Low-Rank Adaptation of Large Language Models" (2021) — the original paper that introduced the technique, published at ICLR 2022
- Aghajanyan et al. — "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" (2020) — the theoretical foundation showing that fine-tuning operates in a low-dimensional subspace
- Dettmers et al. — "QLoRA: Efficient Finetuning of Quantized LLMs" (2023) — the quantization extension that made fine-tuning accessible on consumer GPUs, published at NeurIPS 2023
- HuggingFace PEFT Documentation — the most widely used library for applying LoRA in practice
- Previous DadOps elementary posts: Attention from Scratch (Q/K/V projections where LoRA adapters attach), Optimizers from Scratch (Adam running on LoRA parameters), Loss Functions from Scratch (the same cross-entropy objective), Micrograd from Scratch (backprop that flows through the LoRA path)