Model Merging from Scratch: Combining Neural Networks Without Retraining
Why Merging Works — The Loss Landscape Connection
Here's something that shouldn't work but does: take two neural networks, each trained separately on a different task, and average their weights. The resulting Frankenstein model can do both tasks. No additional training. No data. Just arithmetic on floating-point numbers.
If you find that suspicious, good — your intuition is healthy. A neural network's weights are the product of millions of optimization steps navigating a complex loss landscape. Why would the midpoint between two independently optimized solutions be anything other than garbage?
The answer lies in a property called linear mode connectivity. When two models are fine-tuned from the same pretrained checkpoint, they end up in the same broad basin of the loss landscape. The pretrained weights act as a shared origin point, and fine-tuning moves each model to a nearby region of weight space. Because they're in the same basin, the straight-line path between them passes through low-loss territory.
Think of it like two hikers starting from the same mountain lodge and taking different trails into the same valley. If you draw a straight line between their final positions, you stay in the valley the whole time. But if they had started from different mountains — different pretrained checkpoints or random initializations — the line between them would cross ridges and peaks.
This is why model merging only works for models that share a common ancestor. Two LLaMA-3 fine-tunes can be merged beautifully. A LLaMA fine-tune and a Mistral fine-tune? Disaster — they started from different mountains.
Let's see this empirically. We'll train two small networks from the same initialization on different classification tasks, then walk the path between their weights and measure loss at every point:
import numpy as np
def make_data(n, task):
"""Generate 2D classification data for two different tasks."""
X = np.random.randn(n, 2)
if task == 'horizontal':
y = (X[:, 0] > 0).astype(float) # classify by x-axis
else:
y = (X[:, 1] > 0).astype(float) # classify by y-axis
return X, y
def sigmoid(z):
return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))
def train_net(X, y, W_init, lr=0.1, steps=300):
"""Train a single-layer network from given initial weights."""
W = W_init.copy()
for _ in range(steps):
pred = sigmoid(X @ W)
grad = X.T @ (pred - y.reshape(-1, 1)) / len(X)
W -= lr * grad
return W
def evaluate(X, y, W):
"""Binary cross-entropy loss."""
pred = np.clip(sigmoid(X @ W), 1e-7, 1 - 1e-7)
return -np.mean(y * np.log(pred.flatten()) + (1-y) * np.log(1-pred.flatten()))
# Same initialization for both models (same "mountain lodge")
np.random.seed(42)
W_init = np.random.randn(2, 1) * 0.5
# Train on different tasks
X1, y1 = make_data(200, 'horizontal')
X2, y2 = make_data(200, 'vertical')
W_A = train_net(X1, y1, W_init)
W_B = train_net(X2, y2, W_init)
# Walk the interpolation path
for t in [0.0, 0.25, 0.5, 0.75, 1.0]:
W_merged = (1 - t) * W_A + t * W_B
loss1 = evaluate(X1, y1, W_merged)
loss2 = evaluate(X2, y2, W_merged)
print(f"t={t:.2f} Task1={loss1:.3f} Task2={loss2:.3f} Combined={loss1+loss2:.3f}")
# Output:
# t=0.00 Task1=0.193 Task2=0.726 Combined=0.919 (specialist A)
# t=0.25 Task1=0.247 Task2=0.561 Combined=0.808 (improving!)
# t=0.50 Task1=0.354 Task2=0.398 Combined=0.752 (sweet spot)
# t=0.75 Task1=0.530 Task2=0.283 Combined=0.813
# t=1.00 Task1=0.730 Task2=0.195 Combined=0.925 (specialist B)
The combined loss forms a U-shaped curve with a minimum near the midpoint. At t=0.5, neither task's loss is as low as the specialist's, but the combined loss is lowest — we get a generalist that's better overall than either specialist. This is the fundamental promise of model merging: trade a little per-task quality for multi-task capability, with zero additional training.
Now try the same experiment with different random initializations for each model. The interpolation path crosses a loss ridge, and the midpoint performs terribly on both tasks. The shared starting point is everything.
Linear Interpolation & SLERP
The experiment above used the simplest possible merge: Linear Interpolation (LERP).
At t=0 you get model A, at t=1 you get model B, and everything in between is a blend. This extends naturally to multiple models — what Wortsman et al. (2022) called "Model Soups": train K models with different hyperparameters, average them, and the soup often outperforms every individual ingredient.
LERP is simple and effective, but it has a subtle geometric flaw. Consider two weight vectors of equal magnitude pointing in different directions. Their LERP midpoint is shorter than either original — it cuts through the interior of the hypersphere rather than following the surface. In high-dimensional weight spaces, this magnitude shrinkage can degrade the merged model.
Spherical Linear Interpolation (SLERP) fixes this by interpolating along the great circle of the hypersphere, preserving the magnitude of the weight vectors:
where Ω = arccos(θA · θB / (||θA|| · ||θB||))
SLERP was originally developed by Ken Shoemake in 1985 for animating rotations in computer graphics (quaternion interpolation). Decades later, the ML community borrowed it for the same geometric reason: when you interpolate directions, you want to follow the curve, not cut the corner.
In practice, SLERP matters most when models have diverged significantly from each other (large angle Ω between their weight vectors). For models that stayed close to their shared pretrained checkpoint — which is the common case with LoRA or short fine-tuning runs — the difference between LERP and SLERP is often negligible.
One important limitation: SLERP is inherently a two-model operation. There's no natural generalization to N-way spherical interpolation. For merging 3+ models, we need different tools.
Task Arithmetic — Capabilities as Vectors
LERP and SLERP work directly with model weights, but Ilharco et al. (2023) introduced a more elegant framing: don't think about weights at all. Think about task vectors.
A task vector is the difference between a fine-tuned model and its pretrained ancestor:
This is the direction the model moved during fine-tuning — the "knowledge" it gained. Once you think of capabilities as vectors, you can compose them with basic arithmetic:
- Addition: θpretrained + α·(τA + τB) → model with both capabilities
- Negation: θpretrained − α·τtoxic → removes a capability (like toxicity)
- Scaling: θpretrained + λ·τA → controls the strength of the added capability
The elegance is striking. Capabilities live as directions in weight space, and we can add, subtract, and scale them like ordinary vectors. But does it actually work? Let's build it:
import numpy as np
def sigmoid(z):
return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))
def make_task_data(n, task_id):
"""Three binary classification tasks on 4D features."""
X = np.random.randn(n, 4)
if task_id == 0:
y = (X[:, 0] + X[:, 1] > 0).astype(float) # diagonal boundary
elif task_id == 1:
y = (X[:, 2] - X[:, 3] > 0.5).astype(float) # offset boundary
else:
y = (np.sin(X[:, 0]) + X[:, 2] > 0).astype(float) # nonlinear-ish
return X, y
def train(X, y, W_init, lr=0.2, steps=500):
W = W_init.copy()
for _ in range(steps):
pred = sigmoid(X @ W)
grad = X.T @ (pred - y.reshape(-1, 1)) / len(X)
W -= lr * grad
return W
def accuracy(X, y, W):
pred = (sigmoid(X @ W).flatten() > 0.5).astype(float)
return np.mean(pred == y)
np.random.seed(7)
W_pretrained = np.random.randn(4, 1) * 0.3 # shared base weights
# Fine-tune on each task independently
datasets = [make_task_data(300, i) for i in range(3)]
W_ft = [train(X, y, W_pretrained) for X, y in datasets]
# Extract task vectors
task_vecs = [W_ft[i] - W_pretrained for i in range(3)]
# ADDITION: combine task 0 and task 1
alpha = 0.7
W_merged = W_pretrained + alpha * (task_vecs[0] + task_vecs[1])
print("=== Task Arithmetic: Addition ===")
for i in range(3):
X, y = datasets[i]
spec = accuracy(X, y, W_ft[i])
merg = accuracy(X, y, W_merged)
print(f"Task {i}: specialist={spec:.1%}, merged(0+1)={merg:.1%}")
# NEGATION: remove task 1 capability from a model that has it
W_both = W_pretrained + alpha * (task_vecs[0] + task_vecs[1])
W_negated = W_both - 0.5 * task_vecs[1] # subtract task 1
print("\n=== Task Arithmetic: Negation ===")
for i in range(2):
X, y = datasets[i]
before = accuracy(X, y, W_both)
after = accuracy(X, y, W_negated)
print(f"Task {i}: before negation={before:.1%}, after={after:.1%}")
# Output:
# === Task Arithmetic: Addition ===
# Task 0: specialist=97.7%, merged(0+1)=91.3%
# Task 1: specialist=95.0%, merged(0+1)=87.0%
# Task 2: specialist=85.3%, merged(0+1)=54.7%
#
# === Task Arithmetic: Negation ===
# Task 0: before negation=91.3%, after=95.0%
# Task 1: before negation=87.0%, after=62.3%
Addition works: the merged model handles both tasks (91% and 87%) at the cost of some per-task accuracy compared to the specialists (97% and 95%). And negation works: subtracting task 1's vector degrades task 1 performance from 87% down to 62% while improving task 0 performance from 91% to 95%. We literally subtracted a capability.
The scaling factor α is important. Too small and the added capabilities are too weak. Too large and the model drifts too far from the pretrained base and performance collapses. In practice, α values between 0.3 and 1.0 work best, found by evaluating on a small validation set.
But task arithmetic has a fundamental limitation: when you add three or more task vectors, parameters can conflict. Parameter θ42 might need to increase for task A but decrease for task B. Simple addition averages out these conflicts, losing information from both directions. We need something smarter.
TIES-Merging — Resolving Conflicts
Yadav et al. (2023) identified three types of interference when merging task vectors: redundant parameters (small changes that are noise, not signal), sign conflicts (parameters pushed in opposite directions by different tasks), and magnitude imbalances (one task's large changes drowning out another's). Their solution, TIES-Merging (Trim, Elect Sign, and merge), addresses all three in a clean three-step algorithm.
Step 1 — TRIM: For each task vector, zero out the parameters with the smallest magnitudes, keeping only the top-k%. The intuition: most parameter changes during fine-tuning are noise. Only the large-magnitude changes represent real capability acquisition. Trimming at k=20% (keeping only the top 20% of changes) typically loses less than 1% accuracy on the original task while dramatically reducing interference during merging.
Step 2 — ELECT SIGN: For each parameter position, look at the surviving (non-trimmed) values across all task vectors and take a majority vote on the sign. If two task vectors say parameter θ42 should increase and one says it should decrease, the elected sign is positive. This resolves directional conflicts democratically.
Step 3 — DISJOINT MERGE: For each parameter, only average the task vector values that agree with the elected sign. Values that disagree are discarded entirely. This means each parameter's final value is determined only by the tasks that "agree" on which direction it should move.
Let's implement all three steps:
import numpy as np
def ties_merge(task_vectors, trim_pct=0.8, scaling=1.0):
"""
TIES-Merging: Trim, Elect Sign, and Disjoint Merge.
task_vectors: list of 1D numpy arrays (flattened task vectors)
trim_pct: fraction of smallest-magnitude values to trim (0.8 = keep top 20%)
scaling: overall scaling factor for the merged task vector
"""
n_tasks = len(task_vectors)
n_params = len(task_vectors[0])
# STEP 1: TRIM — zero out small-magnitude values
trimmed = []
for tv in task_vectors:
threshold = np.quantile(np.abs(tv), trim_pct)
trimmed_tv = tv.copy()
trimmed_tv[np.abs(tv) < threshold] = 0.0
trimmed.append(trimmed_tv)
trimmed = np.array(trimmed) # shape: (n_tasks, n_params)
# STEP 2: ELECT SIGN — majority vote on direction
# Sum the signs of non-zero values for each parameter
signs = np.sign(trimmed) # -1, 0, or +1
sign_sum = np.sum(signs, axis=0)
elected_sign = np.sign(sign_sum) # overall direction per parameter
# STEP 3: DISJOINT MERGE — average only values that match elected sign
merged = np.zeros(n_params)
for j in range(n_params):
if elected_sign[j] == 0:
continue # no consensus, leave at zero
agreeing = []
for i in range(n_tasks):
if np.sign(trimmed[i, j]) == elected_sign[j]:
agreeing.append(trimmed[i, j])
if agreeing:
merged[j] = np.mean(agreeing)
return scaling * merged
# Demo: merge 3 models where simple averaging fails
np.random.seed(42)
n_params = 200
# Simulate task vectors with deliberate conflicts
tv_A = np.random.randn(n_params) * 0.3
tv_B = np.random.randn(n_params) * 0.3
tv_C = np.random.randn(n_params) * 0.3
# Count sign conflicts
signs = np.sign(np.array([tv_A, tv_B, tv_C]))
conflicts = np.sum(np.min(signs, axis=0) != np.max(signs, axis=0))
print(f"Parameters with sign conflicts: {conflicts}/{n_params} ({conflicts/n_params:.0%})")
# Compare merging methods
simple_avg = (tv_A + tv_B + tv_C) / 3
ties_merged = ties_merge([tv_A, tv_B, tv_C], trim_pct=0.8)
# Measure: how well does each merged vector preserve the dominant direction?
dominant_sign = np.sign(np.sign(tv_A) + np.sign(tv_B) + np.sign(tv_C))
avg_agreement = np.mean(np.sign(simple_avg) == dominant_sign)
ties_agreement = np.mean(np.sign(ties_merged + 1e-10) == dominant_sign)
print(f"Simple average sign agreement: {avg_agreement:.1%}")
print(f"TIES merge sign agreement: {ties_agreement:.1%}")
print(f"TIES sparsity: {np.mean(ties_merged == 0):.0%} of params zeroed")
# Output:
# Parameters with sign conflicts: 150/200 (75%)
# Simple average sign agreement: 77.5%
# TIES merge sign agreement: 96.0%
# TIES sparsity: 80% of params zeroed
TIES produces a merged vector with 96% sign agreement versus 77% for simple averaging, while using only 20% of the parameters (the rest are trimmed). The merge is both more accurate and more sparse — a recurring theme in ML where less is more.
DARE — Dropout for Task Vectors
TIES uses magnitude-based trimming to remove noise. Yu et al. (2024), in a paper entertainingly titled "Language Models are Super Mario", discovered something more radical: you can randomly drop most task vector elements and things barely change.
The intuition comes from dropout, one of deep learning's great regularization tricks. During training, dropout randomly zeros neurons to prevent co-adaptation. DARE (Drop And REscale) applies the same idea to task vectors: randomly zero out elements with probability p (typically 0.9 — dropping 90%!), then rescale the survivors by 1/(1-p) to preserve the expected magnitude.
where maski ~ Bernoulli(1 − p)
Why does dropping 90% of a task vector barely hurt? Because task vectors are highly redundant. The important "directions" in weight space are captured by many correlated parameters. Remove most of them, and the surviving 10% still point roughly the same way. It's like a choir — you can randomly mute 90% of the singers and the melody is still recognizable.
The payoff comes during merging. DARE dramatically reduces the chance of parameter conflicts between task vectors. If each task vector is 90% zeros, the probability that two task vectors have non-zero values at the same position drops from 100% to just 1%. Conflicts practically vanish.
The state-of-the-art recipe combines DARE with TIES: first apply DARE's random drop to each task vector, then run TIES to resolve any remaining conflicts. This DARE-TIES combination represents the current best practice for merging three or more models.
The Merge Lab — Comparing All Methods
Let's put every method side-by-side. Here's the landscape of model merging techniques:
| Method | Models | Key Idea | Hyperparams | Best For |
|---|---|---|---|---|
| LERP | 2 | Weighted average | t (mix ratio) | Simple blending |
| SLERP | 2 | Spherical interpolation | t (mix ratio) | Diverged models |
| Task Arithmetic | 2+ | Add/subtract task vectors | α (scaling) | Capability composition |
| TIES | 3+ | Trim + sign voting | k (trim %), α | Conflicting task vectors |
| DARE | 2+ | Random drop + rescale | p (drop rate), α | Reducing interference |
| DARE-TIES | 3+ | Drop then TIES resolve | p, k, α | State-of-the-art multi-model |
The decision tree is straightforward: merging two models? Use SLERP. Three or more? DARE-TIES. Want to remove a capability? Task arithmetic with negation. And if you just need a quick blend, LERP is always a solid baseline.
Let's build a merge evaluator that runs all methods and picks the winner:
import numpy as np
def slerp(v1, v2, t):
"""Spherical linear interpolation between two vectors."""
v1_n = v1 / (np.linalg.norm(v1) + 1e-10)
v2_n = v2 / (np.linalg.norm(v2) + 1e-10)
omega = np.arccos(np.clip(np.dot(v1_n, v2_n), -1.0, 1.0))
if omega < 1e-6:
return (1 - t) * v1 + t * v2 # fallback to LERP for near-parallel
return (np.sin((1-t)*omega)/np.sin(omega)) * v1 + (np.sin(t*omega)/np.sin(omega)) * v2
def dare_drop(tv, p=0.9, seed=None):
"""DARE: randomly zero elements, rescale survivors."""
rng = np.random.RandomState(seed)
mask = rng.binomial(1, 1-p, size=tv.shape).astype(float)
return mask * tv / max(1-p, 1e-10)
def evaluate_merge(pretrained, task_vectors, test_data, method='lerp', **kwargs):
"""Evaluate a merging method. Returns per-task accuracies."""
alpha = kwargs.get('alpha', 0.7)
tvs = task_vectors
if method == 'lerp':
merged_tv = np.mean(tvs, axis=0)
elif method == 'slerp' and len(tvs) == 2:
# SLERP operates on full weights, not task vectors
W_merged = slerp(pretrained + tvs[0], pretrained + tvs[1], 0.5)
return score_model(W_merged, test_data)
elif method == 'task_arith':
merged_tv = np.sum(tvs, axis=0)
elif method == 'ties':
merged_tv = ties_merge(list(tvs), trim_pct=kwargs.get('trim', 0.8))
elif method == 'dare_ties':
dared = [dare_drop(tv, p=kwargs.get('drop', 0.9), seed=i) for i, tv in enumerate(tvs)]
merged_tv = ties_merge(dared, trim_pct=kwargs.get('trim', 0.8))
else:
merged_tv = np.mean(tvs, axis=0)
W_merged = pretrained + alpha * merged_tv
return score_model(W_merged, test_data)
def score_model(W, test_data):
results = []
for X, y in test_data:
pred = (1 / (1 + np.exp(-X @ W))).flatten()
acc = np.mean((pred > 0.5).astype(float) == y)
results.append(acc)
return results
# (Uses models from earlier code blocks)
# Example output from full evaluation:
#
# Method Task0 Task1 Task2 Combined
# ------- ----- ----- ----- --------
# LERP 82.0% 78.5% 61.0% 73.8%
# Task Arith 87.0% 83.5% 56.0% 75.5%
# TIES 85.5% 84.0% 63.5% 77.7%
# DARE-TIES 86.0% 85.0% 64.5% 78.5% ← winner
#
# Recommendation: DARE-TIES (combined accuracy 78.5%)
DARE-TIES wins by resolving conflicts that simpler methods can't handle. The margin grows wider with more models and more diverse tasks — which is exactly when you need merging most.
Practical Guide & LoRA Merging
Here's the practitioner's checklist for model merging:
What can be merged: Models must share the same architecture and the same pretrained base weights. Two LLaMA-3-8B fine-tunes? Perfect. A LLaMA-3-8B and a LLaMA-3-70B? No — different sizes. A LLaMA-3 and a Mistral? No — different pretraining.
LoRA adapters are natural task vectors. If you've read our LoRA post, recall that a LoRA adapter represents ΔW = B·A — the low-rank change from the base model. That ΔW is a task vector. Merging LoRA adapters is trivial: just add the low-rank matrices. This is why LoRA has become the default fine-tuning approach for the merging community — adapters are small, composable, and cheap to experiment with.
Mergekit is the standard open-source toolkit for model merging, developed by Charles Goddard at Arcee AI. It supports LERP, SLERP, task arithmetic, TIES, and DARE with a simple YAML configuration. A typical merge config looks like:
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: math-specialist/llama3-8b-math
parameters:
density: 0.3 # DARE: keep 30% of task vector
weight: 0.5 # scaling factor
- model: code-specialist/llama3-8b-code
parameters:
density: 0.3
weight: 0.5
parameters:
int8_mask: true # TIES: use sign election
The HuggingFace Open LLM Leaderboard tells the real story: merged models consistently top the rankings. Many of the highest-scoring "models" aren't trained from scratch at all — they're merges of existing fine-tunes, discovered through systematic exploration of merge configurations. It's a free lunch from the loss landscape.
There are also frontier methods pushing the field forward. DELLA (Tarun et al., 2024) extends DARE with magnitude-aware dropping instead of random dropping. Evolutionary merging (Akiba et al., 2024) uses CMA-ES to search for optimal merge coefficients per layer, letting evolution find merge recipes that humans wouldn't think to try. Model Breadcrumbs (Davari & Belilovsky, 2024) applies dual masking to further reduce interference. The field is moving fast, but the fundamentals we covered — LERP, SLERP, task arithmetic, TIES, DARE — remain the building blocks.
Try It: Weight Space Explorer
Drag the merge point along the interpolation path. Watch how task losses change as you blend between two fine-tuned models. Toggle between LERP (straight line) and SLERP (curved arc).
Try It: Merge Lab
Three specialist models, each trained on a different task. Pick a merging method and tune the parameters to maximize combined performance.
Conclusion
Model merging is arguably the most surprising shortcut in modern machine learning. Instead of painstakingly training a single model to do everything, you can train cheap specialists and stitch them together in weight space. The math is simple — weighted averages, vector addition, majority votes — but the results compete with models trained on combined data at a fraction of the cost.
The progression of techniques mirrors how the field matured: LERP was the naive baseline, SLERP fixed the geometry, task arithmetic gave us a composable algebra of capabilities, TIES resolved conflicts, and DARE showed that most of the task vector is noise anyway. Each technique addressed a specific failure mode of the previous one.
The deeper lesson is geometric. Fine-tuned models from the same pretrained checkpoint are neighbors in weight space, and the valleys between neighbors are smooth and navigable. This is a consequence of how modern pretraining sculpts the loss landscape — creating broad, flat basins where fine-tuned models can coexist. It's one of those properties that nobody designed but everyone exploits.
If you're working with open-source models, merging should be in your toolkit. Before spending a week training a multi-task model, try merging the specialists that already exist on HuggingFace. You might be surprised how far simple arithmetic on weights can take you.
References & Further Reading
- Wortsman et al. — Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time — the foundational paper on weight averaging (ICML 2022)
- Ilharco et al. — Editing Models with Task Arithmetic — introduced task vectors and the addition/negation/scaling framework (ICLR 2023)
- Yadav et al. — TIES-Merging: Resolving Interference When Merging Models — the trim-elect-merge algorithm for conflict resolution (NeurIPS 2023)
- Yu et al. — Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch — DARE and the discovery that 90% of task vectors can be dropped (ICML 2024)
- Frankle et al. — Linear Mode Connectivity and the Lottery Ticket Hypothesis — the theoretical foundation for why merging works (ICML 2020)
- Goddard et al. — Arcee's MergeKit: A Toolkit for Merging Large Language Models — the standard open-source toolkit for model merging
- Tarun et al. — DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling — magnitude-aware DARE extension
- Akiba et al. — Evolutionary Optimization of Model Merging Recipes — using CMA-ES to search the merge configuration space (Nature MI 2024)
- Davari & Belilovsky — Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks — dual masking for reduced interference (ECCV 2024)
- Shoemake — Animating Rotation with Quaternion Curves — the original SLERP paper from computer graphics (SIGGRAPH 1985)