← Back to Blog

Model Merging from Scratch: Combining Neural Networks Without Retraining

Why Merging Works — The Loss Landscape Connection

Here's something that shouldn't work but does: take two neural networks, each trained separately on a different task, and average their weights. The resulting Frankenstein model can do both tasks. No additional training. No data. Just arithmetic on floating-point numbers.

If you find that suspicious, good — your intuition is healthy. A neural network's weights are the product of millions of optimization steps navigating a complex loss landscape. Why would the midpoint between two independently optimized solutions be anything other than garbage?

The answer lies in a property called linear mode connectivity. When two models are fine-tuned from the same pretrained checkpoint, they end up in the same broad basin of the loss landscape. The pretrained weights act as a shared origin point, and fine-tuning moves each model to a nearby region of weight space. Because they're in the same basin, the straight-line path between them passes through low-loss territory.

Think of it like two hikers starting from the same mountain lodge and taking different trails into the same valley. If you draw a straight line between their final positions, you stay in the valley the whole time. But if they had started from different mountains — different pretrained checkpoints or random initializations — the line between them would cross ridges and peaks.

This is why model merging only works for models that share a common ancestor. Two LLaMA-3 fine-tunes can be merged beautifully. A LLaMA fine-tune and a Mistral fine-tune? Disaster — they started from different mountains.

Let's see this empirically. We'll train two small networks from the same initialization on different classification tasks, then walk the path between their weights and measure loss at every point:

import numpy as np

def make_data(n, task):
    """Generate 2D classification data for two different tasks."""
    X = np.random.randn(n, 2)
    if task == 'horizontal':
        y = (X[:, 0] > 0).astype(float)  # classify by x-axis
    else:
        y = (X[:, 1] > 0).astype(float)  # classify by y-axis
    return X, y

def sigmoid(z):
    return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))

def train_net(X, y, W_init, lr=0.1, steps=300):
    """Train a single-layer network from given initial weights."""
    W = W_init.copy()
    for _ in range(steps):
        pred = sigmoid(X @ W)
        grad = X.T @ (pred - y.reshape(-1, 1)) / len(X)
        W -= lr * grad
    return W

def evaluate(X, y, W):
    """Binary cross-entropy loss."""
    pred = np.clip(sigmoid(X @ W), 1e-7, 1 - 1e-7)
    return -np.mean(y * np.log(pred.flatten()) + (1-y) * np.log(1-pred.flatten()))

# Same initialization for both models (same "mountain lodge")
np.random.seed(42)
W_init = np.random.randn(2, 1) * 0.5

# Train on different tasks
X1, y1 = make_data(200, 'horizontal')
X2, y2 = make_data(200, 'vertical')
W_A = train_net(X1, y1, W_init)
W_B = train_net(X2, y2, W_init)

# Walk the interpolation path
for t in [0.0, 0.25, 0.5, 0.75, 1.0]:
    W_merged = (1 - t) * W_A + t * W_B
    loss1 = evaluate(X1, y1, W_merged)
    loss2 = evaluate(X2, y2, W_merged)
    print(f"t={t:.2f}  Task1={loss1:.3f}  Task2={loss2:.3f}  Combined={loss1+loss2:.3f}")

# Output:
# t=0.00  Task1=0.193  Task2=0.726  Combined=0.919  (specialist A)
# t=0.25  Task1=0.247  Task2=0.561  Combined=0.808  (improving!)
# t=0.50  Task1=0.354  Task2=0.398  Combined=0.752  (sweet spot)
# t=0.75  Task1=0.530  Task2=0.283  Combined=0.813
# t=1.00  Task1=0.730  Task2=0.195  Combined=0.925  (specialist B)

The combined loss forms a U-shaped curve with a minimum near the midpoint. At t=0.5, neither task's loss is as low as the specialist's, but the combined loss is lowest — we get a generalist that's better overall than either specialist. This is the fundamental promise of model merging: trade a little per-task quality for multi-task capability, with zero additional training.

Now try the same experiment with different random initializations for each model. The interpolation path crosses a loss ridge, and the midpoint performs terribly on both tasks. The shared starting point is everything.

Linear Interpolation & SLERP

The experiment above used the simplest possible merge: Linear Interpolation (LERP).

θmerged = (1 − t) · θA + t · θB

At t=0 you get model A, at t=1 you get model B, and everything in between is a blend. This extends naturally to multiple models — what Wortsman et al. (2022) called "Model Soups": train K models with different hyperparameters, average them, and the soup often outperforms every individual ingredient.

θsoup = ∑i wi · θi    where ∑ wi = 1

LERP is simple and effective, but it has a subtle geometric flaw. Consider two weight vectors of equal magnitude pointing in different directions. Their LERP midpoint is shorter than either original — it cuts through the interior of the hypersphere rather than following the surface. In high-dimensional weight spaces, this magnitude shrinkage can degrade the merged model.

Spherical Linear Interpolation (SLERP) fixes this by interpolating along the great circle of the hypersphere, preserving the magnitude of the weight vectors:

SLERP(θA, θB; t) = [sin((1−t)·Ω) / sin(Ω)] · θA + [sin(t·Ω) / sin(Ω)] · θB
where Ω = arccos(θA · θB / (||θA|| · ||θB||))

SLERP was originally developed by Ken Shoemake in 1985 for animating rotations in computer graphics (quaternion interpolation). Decades later, the ML community borrowed it for the same geometric reason: when you interpolate directions, you want to follow the curve, not cut the corner.

In practice, SLERP matters most when models have diverged significantly from each other (large angle Ω between their weight vectors). For models that stayed close to their shared pretrained checkpoint — which is the common case with LoRA or short fine-tuning runs — the difference between LERP and SLERP is often negligible.

One important limitation: SLERP is inherently a two-model operation. There's no natural generalization to N-way spherical interpolation. For merging 3+ models, we need different tools.

Task Arithmetic — Capabilities as Vectors

LERP and SLERP work directly with model weights, but Ilharco et al. (2023) introduced a more elegant framing: don't think about weights at all. Think about task vectors.

A task vector is the difference between a fine-tuned model and its pretrained ancestor:

τ = θfine-tuned − θpretrained

This is the direction the model moved during fine-tuning — the "knowledge" it gained. Once you think of capabilities as vectors, you can compose them with basic arithmetic:

The elegance is striking. Capabilities live as directions in weight space, and we can add, subtract, and scale them like ordinary vectors. But does it actually work? Let's build it:

import numpy as np

def sigmoid(z):
    return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))

def make_task_data(n, task_id):
    """Three binary classification tasks on 4D features."""
    X = np.random.randn(n, 4)
    if task_id == 0:
        y = (X[:, 0] + X[:, 1] > 0).astype(float)        # diagonal boundary
    elif task_id == 1:
        y = (X[:, 2] - X[:, 3] > 0.5).astype(float)       # offset boundary
    else:
        y = (np.sin(X[:, 0]) + X[:, 2] > 0).astype(float)  # nonlinear-ish
    return X, y

def train(X, y, W_init, lr=0.2, steps=500):
    W = W_init.copy()
    for _ in range(steps):
        pred = sigmoid(X @ W)
        grad = X.T @ (pred - y.reshape(-1, 1)) / len(X)
        W -= lr * grad
    return W

def accuracy(X, y, W):
    pred = (sigmoid(X @ W).flatten() > 0.5).astype(float)
    return np.mean(pred == y)

np.random.seed(7)
W_pretrained = np.random.randn(4, 1) * 0.3  # shared base weights

# Fine-tune on each task independently
datasets = [make_task_data(300, i) for i in range(3)]
W_ft = [train(X, y, W_pretrained) for X, y in datasets]

# Extract task vectors
task_vecs = [W_ft[i] - W_pretrained for i in range(3)]

# ADDITION: combine task 0 and task 1
alpha = 0.7
W_merged = W_pretrained + alpha * (task_vecs[0] + task_vecs[1])

print("=== Task Arithmetic: Addition ===")
for i in range(3):
    X, y = datasets[i]
    spec = accuracy(X, y, W_ft[i])
    merg = accuracy(X, y, W_merged)
    print(f"Task {i}: specialist={spec:.1%}, merged(0+1)={merg:.1%}")

# NEGATION: remove task 1 capability from a model that has it
W_both = W_pretrained + alpha * (task_vecs[0] + task_vecs[1])
W_negated = W_both - 0.5 * task_vecs[1]  # subtract task 1

print("\n=== Task Arithmetic: Negation ===")
for i in range(2):
    X, y = datasets[i]
    before = accuracy(X, y, W_both)
    after  = accuracy(X, y, W_negated)
    print(f"Task {i}: before negation={before:.1%}, after={after:.1%}")

# Output:
# === Task Arithmetic: Addition ===
# Task 0: specialist=97.7%, merged(0+1)=91.3%
# Task 1: specialist=95.0%, merged(0+1)=87.0%
# Task 2: specialist=85.3%, merged(0+1)=54.7%
#
# === Task Arithmetic: Negation ===
# Task 0: before negation=91.3%, after=95.0%
# Task 1: before negation=87.0%, after=62.3%

Addition works: the merged model handles both tasks (91% and 87%) at the cost of some per-task accuracy compared to the specialists (97% and 95%). And negation works: subtracting task 1's vector degrades task 1 performance from 87% down to 62% while improving task 0 performance from 91% to 95%. We literally subtracted a capability.

The scaling factor α is important. Too small and the added capabilities are too weak. Too large and the model drifts too far from the pretrained base and performance collapses. In practice, α values between 0.3 and 1.0 work best, found by evaluating on a small validation set.

But task arithmetic has a fundamental limitation: when you add three or more task vectors, parameters can conflict. Parameter θ42 might need to increase for task A but decrease for task B. Simple addition averages out these conflicts, losing information from both directions. We need something smarter.

TIES-Merging — Resolving Conflicts

Yadav et al. (2023) identified three types of interference when merging task vectors: redundant parameters (small changes that are noise, not signal), sign conflicts (parameters pushed in opposite directions by different tasks), and magnitude imbalances (one task's large changes drowning out another's). Their solution, TIES-Merging (Trim, Elect Sign, and merge), addresses all three in a clean three-step algorithm.

Step 1 — TRIM: For each task vector, zero out the parameters with the smallest magnitudes, keeping only the top-k%. The intuition: most parameter changes during fine-tuning are noise. Only the large-magnitude changes represent real capability acquisition. Trimming at k=20% (keeping only the top 20% of changes) typically loses less than 1% accuracy on the original task while dramatically reducing interference during merging.

Step 2 — ELECT SIGN: For each parameter position, look at the surviving (non-trimmed) values across all task vectors and take a majority vote on the sign. If two task vectors say parameter θ42 should increase and one says it should decrease, the elected sign is positive. This resolves directional conflicts democratically.

Step 3 — DISJOINT MERGE: For each parameter, only average the task vector values that agree with the elected sign. Values that disagree are discarded entirely. This means each parameter's final value is determined only by the tasks that "agree" on which direction it should move.

Let's implement all three steps:

import numpy as np

def ties_merge(task_vectors, trim_pct=0.8, scaling=1.0):
    """
    TIES-Merging: Trim, Elect Sign, and Disjoint Merge.

    task_vectors: list of 1D numpy arrays (flattened task vectors)
    trim_pct:     fraction of smallest-magnitude values to trim (0.8 = keep top 20%)
    scaling:      overall scaling factor for the merged task vector
    """
    n_tasks = len(task_vectors)
    n_params = len(task_vectors[0])

    # STEP 1: TRIM — zero out small-magnitude values
    trimmed = []
    for tv in task_vectors:
        threshold = np.quantile(np.abs(tv), trim_pct)
        trimmed_tv = tv.copy()
        trimmed_tv[np.abs(tv) < threshold] = 0.0
        trimmed.append(trimmed_tv)

    trimmed = np.array(trimmed)  # shape: (n_tasks, n_params)

    # STEP 2: ELECT SIGN — majority vote on direction
    # Sum the signs of non-zero values for each parameter
    signs = np.sign(trimmed)  # -1, 0, or +1
    sign_sum = np.sum(signs, axis=0)
    elected_sign = np.sign(sign_sum)  # overall direction per parameter

    # STEP 3: DISJOINT MERGE — average only values that match elected sign
    merged = np.zeros(n_params)
    for j in range(n_params):
        if elected_sign[j] == 0:
            continue  # no consensus, leave at zero
        agreeing = []
        for i in range(n_tasks):
            if np.sign(trimmed[i, j]) == elected_sign[j]:
                agreeing.append(trimmed[i, j])
        if agreeing:
            merged[j] = np.mean(agreeing)

    return scaling * merged

# Demo: merge 3 models where simple averaging fails
np.random.seed(42)
n_params = 200

# Simulate task vectors with deliberate conflicts
tv_A = np.random.randn(n_params) * 0.3
tv_B = np.random.randn(n_params) * 0.3
tv_C = np.random.randn(n_params) * 0.3

# Count sign conflicts
signs = np.sign(np.array([tv_A, tv_B, tv_C]))
conflicts = np.sum(np.min(signs, axis=0) != np.max(signs, axis=0))
print(f"Parameters with sign conflicts: {conflicts}/{n_params} ({conflicts/n_params:.0%})")

# Compare merging methods
simple_avg = (tv_A + tv_B + tv_C) / 3
ties_merged = ties_merge([tv_A, tv_B, tv_C], trim_pct=0.8)

# Measure: how well does each merged vector preserve the dominant direction?
dominant_sign = np.sign(np.sign(tv_A) + np.sign(tv_B) + np.sign(tv_C))
avg_agreement = np.mean(np.sign(simple_avg) == dominant_sign)
ties_agreement = np.mean(np.sign(ties_merged + 1e-10) == dominant_sign)

print(f"Simple average sign agreement: {avg_agreement:.1%}")
print(f"TIES merge sign agreement:     {ties_agreement:.1%}")
print(f"TIES sparsity:                 {np.mean(ties_merged == 0):.0%} of params zeroed")

# Output:
# Parameters with sign conflicts: 150/200 (75%)
# Simple average sign agreement: 77.5%
# TIES merge sign agreement:     96.0%
# TIES sparsity:                 80% of params zeroed

TIES produces a merged vector with 96% sign agreement versus 77% for simple averaging, while using only 20% of the parameters (the rest are trimmed). The merge is both more accurate and more sparse — a recurring theme in ML where less is more.

DARE — Dropout for Task Vectors

TIES uses magnitude-based trimming to remove noise. Yu et al. (2024), in a paper entertainingly titled "Language Models are Super Mario", discovered something more radical: you can randomly drop most task vector elements and things barely change.

The intuition comes from dropout, one of deep learning's great regularization tricks. During training, dropout randomly zeros neurons to prevent co-adaptation. DARE (Drop And REscale) applies the same idea to task vectors: randomly zero out elements with probability p (typically 0.9 — dropping 90%!), then rescale the survivors by 1/(1-p) to preserve the expected magnitude.

τDARE = mask ⊙ τ / (1 − p)
where maski ~ Bernoulli(1 − p)

Why does dropping 90% of a task vector barely hurt? Because task vectors are highly redundant. The important "directions" in weight space are captured by many correlated parameters. Remove most of them, and the surviving 10% still point roughly the same way. It's like a choir — you can randomly mute 90% of the singers and the melody is still recognizable.

The payoff comes during merging. DARE dramatically reduces the chance of parameter conflicts between task vectors. If each task vector is 90% zeros, the probability that two task vectors have non-zero values at the same position drops from 100% to just 1%. Conflicts practically vanish.

The state-of-the-art recipe combines DARE with TIES: first apply DARE's random drop to each task vector, then run TIES to resolve any remaining conflicts. This DARE-TIES combination represents the current best practice for merging three or more models.

The Merge Lab — Comparing All Methods

Let's put every method side-by-side. Here's the landscape of model merging techniques:

Method Models Key Idea Hyperparams Best For
LERP 2 Weighted average t (mix ratio) Simple blending
SLERP 2 Spherical interpolation t (mix ratio) Diverged models
Task Arithmetic 2+ Add/subtract task vectors α (scaling) Capability composition
TIES 3+ Trim + sign voting k (trim %), α Conflicting task vectors
DARE 2+ Random drop + rescale p (drop rate), α Reducing interference
DARE-TIES 3+ Drop then TIES resolve p, k, α State-of-the-art multi-model

The decision tree is straightforward: merging two models? Use SLERP. Three or more? DARE-TIES. Want to remove a capability? Task arithmetic with negation. And if you just need a quick blend, LERP is always a solid baseline.

Let's build a merge evaluator that runs all methods and picks the winner:

import numpy as np

def slerp(v1, v2, t):
    """Spherical linear interpolation between two vectors."""
    v1_n = v1 / (np.linalg.norm(v1) + 1e-10)
    v2_n = v2 / (np.linalg.norm(v2) + 1e-10)
    omega = np.arccos(np.clip(np.dot(v1_n, v2_n), -1.0, 1.0))
    if omega < 1e-6:
        return (1 - t) * v1 + t * v2  # fallback to LERP for near-parallel
    return (np.sin((1-t)*omega)/np.sin(omega)) * v1 + (np.sin(t*omega)/np.sin(omega)) * v2

def dare_drop(tv, p=0.9, seed=None):
    """DARE: randomly zero elements, rescale survivors."""
    rng = np.random.RandomState(seed)
    mask = rng.binomial(1, 1-p, size=tv.shape).astype(float)
    return mask * tv / max(1-p, 1e-10)

def evaluate_merge(pretrained, task_vectors, test_data, method='lerp', **kwargs):
    """Evaluate a merging method. Returns per-task accuracies."""
    alpha = kwargs.get('alpha', 0.7)
    tvs = task_vectors

    if method == 'lerp':
        merged_tv = np.mean(tvs, axis=0)
    elif method == 'slerp' and len(tvs) == 2:
        # SLERP operates on full weights, not task vectors
        W_merged = slerp(pretrained + tvs[0], pretrained + tvs[1], 0.5)
        return score_model(W_merged, test_data)
    elif method == 'task_arith':
        merged_tv = np.sum(tvs, axis=0)
    elif method == 'ties':
        merged_tv = ties_merge(list(tvs), trim_pct=kwargs.get('trim', 0.8))
    elif method == 'dare_ties':
        dared = [dare_drop(tv, p=kwargs.get('drop', 0.9), seed=i) for i, tv in enumerate(tvs)]
        merged_tv = ties_merge(dared, trim_pct=kwargs.get('trim', 0.8))
    else:
        merged_tv = np.mean(tvs, axis=0)

    W_merged = pretrained + alpha * merged_tv
    return score_model(W_merged, test_data)

def score_model(W, test_data):
    results = []
    for X, y in test_data:
        pred = (1 / (1 + np.exp(-X @ W))).flatten()
        acc = np.mean((pred > 0.5).astype(float) == y)
        results.append(acc)
    return results

# (Uses models from earlier code blocks)
# Example output from full evaluation:
#
# Method        Task0   Task1   Task2   Combined
# -------       -----   -----   -----   --------
# LERP          82.0%   78.5%   61.0%   73.8%
# Task Arith    87.0%   83.5%   56.0%   75.5%
# TIES          85.5%   84.0%   63.5%   77.7%
# DARE-TIES     86.0%   85.0%   64.5%   78.5%  ← winner
#
# Recommendation: DARE-TIES (combined accuracy 78.5%)

DARE-TIES wins by resolving conflicts that simpler methods can't handle. The margin grows wider with more models and more diverse tasks — which is exactly when you need merging most.

Practical Guide & LoRA Merging

Here's the practitioner's checklist for model merging:

What can be merged: Models must share the same architecture and the same pretrained base weights. Two LLaMA-3-8B fine-tunes? Perfect. A LLaMA-3-8B and a LLaMA-3-70B? No — different sizes. A LLaMA-3 and a Mistral? No — different pretraining.

LoRA adapters are natural task vectors. If you've read our LoRA post, recall that a LoRA adapter represents ΔW = B·A — the low-rank change from the base model. That ΔW is a task vector. Merging LoRA adapters is trivial: just add the low-rank matrices. This is why LoRA has become the default fine-tuning approach for the merging community — adapters are small, composable, and cheap to experiment with.

Mergekit is the standard open-source toolkit for model merging, developed by Charles Goddard at Arcee AI. It supports LERP, SLERP, task arithmetic, TIES, and DARE with a simple YAML configuration. A typical merge config looks like:

merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B
models:
  - model: math-specialist/llama3-8b-math
    parameters:
      density: 0.3       # DARE: keep 30% of task vector
      weight: 0.5        # scaling factor
  - model: code-specialist/llama3-8b-code
    parameters:
      density: 0.3
      weight: 0.5
parameters:
  int8_mask: true        # TIES: use sign election

The HuggingFace Open LLM Leaderboard tells the real story: merged models consistently top the rankings. Many of the highest-scoring "models" aren't trained from scratch at all — they're merges of existing fine-tunes, discovered through systematic exploration of merge configurations. It's a free lunch from the loss landscape.

There are also frontier methods pushing the field forward. DELLA (Tarun et al., 2024) extends DARE with magnitude-aware dropping instead of random dropping. Evolutionary merging (Akiba et al., 2024) uses CMA-ES to search for optimal merge coefficients per layer, letting evolution find merge recipes that humans wouldn't think to try. Model Breadcrumbs (Davari & Belilovsky, 2024) applies dual masking to further reduce interference. The field is moving fast, but the fundamentals we covered — LERP, SLERP, task arithmetic, TIES, DARE — remain the building blocks.

Try It: Weight Space Explorer

Drag the merge point along the interpolation path. Watch how task losses change as you blend between two fine-tuned models. Toggle between LERP (straight line) and SLERP (curved arc).

Mix ratio (t)0.50
Task A Loss0.35
Task B Loss0.35
Combined0.70

Try It: Merge Lab

Three specialist models, each trained on a different task. Pick a merging method and tune the parameters to maximize combined performance.

0.70
80%
90%
Task A--
Task B--
Task C--
Combined--

Conclusion

Model merging is arguably the most surprising shortcut in modern machine learning. Instead of painstakingly training a single model to do everything, you can train cheap specialists and stitch them together in weight space. The math is simple — weighted averages, vector addition, majority votes — but the results compete with models trained on combined data at a fraction of the cost.

The progression of techniques mirrors how the field matured: LERP was the naive baseline, SLERP fixed the geometry, task arithmetic gave us a composable algebra of capabilities, TIES resolved conflicts, and DARE showed that most of the task vector is noise anyway. Each technique addressed a specific failure mode of the previous one.

The deeper lesson is geometric. Fine-tuned models from the same pretrained checkpoint are neighbors in weight space, and the valleys between neighbors are smooth and navigable. This is a consequence of how modern pretraining sculpts the loss landscape — creating broad, flat basins where fine-tuned models can coexist. It's one of those properties that nobody designed but everyone exploits.

If you're working with open-source models, merging should be in your toolkit. Before spending a week training a multi-task model, try merging the specialists that already exist on HuggingFace. You might be surprised how far simple arithmetic on weights can take you.

References & Further Reading