ML Experiment Tracking
The Experiment Tracking Problem
It's 11 PM. You've run 47 experiments today. Your best model is saved as model_final_v2_FINAL_actually_final.pkl. You can't remember which hyperparameters produced it, which version of the preprocessing pipeline you used, or whether you normalized the features before or after splitting. Welcome to the experiment tracking problem.
Every ML practitioner hits this wall. You try a learning rate of 0.001, then 0.01, then 0.003 with dropout 0.3, then 0.003 with dropout 0.5 and batch size 64 instead of 32, and somewhere in that sequence you got a great result that you now can't reproduce. The model file exists, but the recipe is lost.
There are three categories of information you need to capture for every experiment:
- Configuration — hyperparameters, data version, preprocessing steps, model architecture, random seeds, code version (git commit)
- Metrics — training loss per epoch, validation accuracy, F1 score, inference latency, memory usage — anything you'd use to judge whether this run was better than the last one
- Artifacts — model checkpoints, confusion matrices, learning curve plots, predictions on a held-out set, the actual config file used
The goal is simple: given any past result, you should be able to reproduce it exactly and understand how it was produced. Spreadsheets don't cut it — they get out of sync with your code. Jupyter notebooks don't cut it — rerunning cells in different orders gives different results. You need a system. Let's build one.
Building a Minimal Experiment Tracker
Before reaching for MLflow or Weights & Biases, let's build an experiment tracker from scratch using nothing but the Python standard library. It's under 60 lines, it's transparent (everything is human-readable JSON and CSV), and it teaches you exactly what production tools do under the hood.
import os
import json
import csv
import shutil
import subprocess
from datetime import datetime
class ExperimentTracker:
def __init__(self, base_dir="runs"):
timestamp = datetime.now().strftime("%Y-%m-%d_%H%M%S")
short_id = os.urandom(3).hex()
self.run_dir = os.path.join(base_dir, f"{timestamp}_{short_id}")
os.makedirs(self.run_dir, exist_ok=True)
self.metrics_file = os.path.join(self.run_dir, "metrics.csv")
self.start_time = datetime.now()
self._save_git_info()
def log_params(self, params):
"""Save hyperparameters as human-readable JSON."""
path = os.path.join(self.run_dir, "params.json")
with open(path, "w") as f:
json.dump(params, f, indent=2)
def log_metric(self, name, value, step=None):
"""Append a metric to the CSV log (supports time-series)."""
file_exists = os.path.exists(self.metrics_file)
with open(self.metrics_file, "a", newline="") as f:
writer = csv.writer(f)
if not file_exists:
writer.writerow(["step", "name", "value"])
writer.writerow([step, name, value])
def log_artifact(self, filepath):
"""Copy a file into the run directory."""
dest = os.path.join(self.run_dir, os.path.basename(filepath))
shutil.copy2(filepath, dest)
def _save_git_info(self):
"""Record the exact code state."""
try:
commit = subprocess.check_output(
["git", "rev-parse", "HEAD"], text=True
).strip()
diff = subprocess.check_output(
["git", "diff", "--stat"], text=True
).strip()
info = {"commit": commit, "uncommitted_changes": diff or "none"}
path = os.path.join(self.run_dir, "git_info.json")
with open(path, "w") as f:
json.dump(info, f, indent=2)
except (subprocess.CalledProcessError, FileNotFoundError):
pass # not in a git repo — skip
def finish(self, final_metrics=None):
"""Write a summary with runtime and final metrics."""
summary = {
"run_dir": self.run_dir,
"start_time": self.start_time.isoformat(),
"end_time": datetime.now().isoformat(),
"duration_seconds": (datetime.now() - self.start_time).total_seconds(),
"final_metrics": final_metrics or {},
}
path = os.path.join(self.run_dir, "summary.json")
with open(path, "w") as f:
json.dump(summary, f, indent=2)
print(f"Run saved to {self.run_dir}")
Here's how you'd instrument a training loop with it:
# Usage example — instrument any training loop
tracker = ExperimentTracker()
tracker.log_params({
"model": "resnet18",
"lr": 0.003,
"batch_size": 64,
"epochs": 20,
"dropout": 0.3,
"optimizer": "adam",
"data_version": "v2.1",
})
for epoch in range(20):
train_loss = train_one_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
tracker.log_metric("train_loss", train_loss, step=epoch)
tracker.log_metric("val_loss", val_loss, step=epoch)
tracker.log_metric("val_accuracy", val_acc, step=epoch)
tracker.finish({"val_accuracy": val_acc, "val_loss": val_loss})
After training, your run directory looks like this:
runs/2026-02-27_143022_a1b2c3/
params.json ← hyperparameters
metrics.csv ← loss & accuracy per epoch
git_info.json ← exact code version
summary.json ← runtime & final metrics
Everything is a plain text file. You can cat params.json from the command line, grep across all runs to find the one with the best accuracy, or diff two parameter files to see what changed. No special tools required. This transparency is the single most important design decision — when you're debugging at 2 AM, you want to ls your way to the answer, not fight a database.
Comparing Experiments — The Dashboard
Logging experiments is only half the problem. The other half is finding the good ones. After 50 runs, you need to quickly answer: which run had the best validation accuracy? What was different about it? Let's build a comparison tool.
import os
import json
def load_all_runs(base_dir="runs"):
"""Load params and final metrics from every run directory."""
runs = []
for name in sorted(os.listdir(base_dir)):
run_dir = os.path.join(base_dir, name)
if not os.path.isdir(run_dir):
continue
run = {"name": name}
params_path = os.path.join(run_dir, "params.json")
summary_path = os.path.join(run_dir, "summary.json")
if os.path.exists(params_path):
with open(params_path) as f:
run["params"] = json.load(f)
if os.path.exists(summary_path):
with open(summary_path) as f:
run["summary"] = json.load(f)
runs.append(run)
return runs
def compare_runs(runs, sort_by="val_accuracy", top_n=10):
"""Print a sorted leaderboard of runs."""
scored = []
for r in runs:
metrics = r.get("summary", {}).get("final_metrics", {})
score = metrics.get(sort_by, 0)
params = r.get("params", {})
scored.append((score, r["name"], params.get("lr"), params.get("batch_size")))
scored.sort(reverse=True)
print(f"{'Rank':<6}{'Run':<30}{'Score':<10}{'LR':<10}{'Batch':<8}")
print("-" * 64)
for i, (score, name, lr, bs) in enumerate(scored[:top_n]):
print(f"{i+1:<6}{name:<30}{score:<10.4f}{lr!s:<10}{bs!s:<8}")
def diff_runs(run_a, run_b):
"""Show which hyperparameters differ between two runs."""
params_a = run_a.get("params", {})
params_b = run_b.get("params", {})
all_keys = set(params_a) | set(params_b)
diffs = []
for key in sorted(all_keys):
va, vb = params_a.get(key), params_b.get(key)
if va != vb:
diffs.append((key, va, vb))
if diffs:
print(f"{'Parameter':<20}{'Run A':<20}{'Run B':<20}")
print("-" * 60)
for key, va, vb in diffs:
print(f"{key:<20}{str(va):<20}{str(vb):<20}")
else:
print("Runs have identical parameters.")
Now the workflow becomes natural: run compare_runs(load_all_runs(), sort_by="val_accuracy") to see your leaderboard. Spot the winner. Then diff_runs(winner, runner_up) to see exactly what made it better — maybe it was a larger learning rate, maybe it was an extra epoch, maybe it was a different random seed. The diff turns hindsight into insight.
Try It: Experiment Dashboard
Click column headers to sort. Click any two rows to compare them and see which parameters differ.
Reproducibility — From Config to Identical Results
Tracking experiments is useless if you can't reproduce them. "I got 94% accuracy with these hyperparameters" means nothing if rerunning the same code gives 91%. Reproducibility requires controlling every source of randomness — and there are more than you'd think.
import os
import json
import random
import hashlib
import numpy as np
def seed_everything(seed=42):
"""Fix all sources of randomness for exact reproducibility."""
random.seed(seed)
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
# For PyTorch (if available):
try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
except ImportError:
pass
class ReproduciblePipeline:
def __init__(self, config_path):
self.config_path = config_path
with open(config_path) as f:
self.config = json.load(f)
seed_everything(self.config.get("seed", 42))
def run(self):
"""Every step is determined by the config — nothing is implicit."""
cfg = self.config
tracker = ExperimentTracker()
tracker.log_params(cfg)
tracker.log_artifact(self.config_path) # save the config itself
# Data loading — version controlled
X_train, y_train = load_data(cfg["data_path"], cfg["data_version"])
# Preprocessing — parameterized, not hardcoded
if cfg.get("normalize", True):
mean, std = X_train.mean(axis=0), X_train.std(axis=0) + 1e-8
X_train = (X_train - mean) / std
# Model — architecture from config
model = build_model(cfg["model"], cfg["hidden_size"], cfg["dropout"])
# Training loop
for epoch in range(cfg["epochs"]):
loss = train_epoch(model, X_train, y_train, lr=cfg["lr"])
tracker.log_metric("train_loss", loss, step=epoch)
tracker.finish({"final_loss": loss})
return model
def fingerprint(self):
"""Hash the config to uniquely identify this experiment."""
config_str = json.dumps(self.config, sort_keys=True)
return hashlib.sha256(config_str.encode()).hexdigest()[:12]
# Run from command line: python train.py --config experiment_42.json
# Two runs with the same config WILL produce identical results.
The key principle: nothing is implicit. Every decision — the random seed, the normalization choice, the model architecture, the learning rate — lives in the config file. The code is a deterministic function from config to results. Change nothing in the config, get the same bits out. This is the gold standard of reproducibility.
There's a spectrum of reproducibility, and you should know where you are on it:
- Exact reproducibility — same hardware, same seeds, same library versions → bit-identical results. Achievable but fragile (CUDA non-determinism, floating-point ordering)
- Statistical reproducibility — different random seeds → results within the same confidence interval. This is what most research needs
- Conceptual reproducibility — completely different implementation → same conclusions. The highest bar, and what makes a finding trustworthy
For most production work, exact reproducibility is worth the effort. It's not just academic — when your model suddenly performs worse, you need to diff the current run against the last known good run and find exactly what changed. If runs aren't deterministic, you can't distinguish "the data distribution shifted" from "the random seed was different."
Production Patterns — MLflow and W&B Concepts
Our from-scratch tracker handles single runs well. But in a real ML team, you need higher-level organization: grouping related runs into experiments, managing model versions, and promoting the best model to production. Here's how production tools like MLflow and Weights & Biases structure this.
import os
import json
from datetime import datetime
class Experiment:
"""A named collection of related runs (one hypothesis being tested)."""
def __init__(self, name, base_dir="experiments"):
self.name = name
self.exp_dir = os.path.join(base_dir, name)
os.makedirs(self.exp_dir, exist_ok=True)
def new_run(self, tags=None):
"""Create a tracked run within this experiment."""
tracker = ExperimentTracker(base_dir=self.exp_dir)
if tags:
tag_path = os.path.join(tracker.run_dir, "tags.json")
with open(tag_path, "w") as f:
json.dump(tags, f, indent=2)
return tracker
class ModelRegistry:
"""Track model versions and their deployment status."""
def __init__(self, registry_dir="model_registry"):
self.registry_dir = registry_dir
os.makedirs(registry_dir, exist_ok=True)
self.registry_file = os.path.join(registry_dir, "registry.json")
self.registry = self._load()
def _load(self):
if os.path.exists(self.registry_file):
with open(self.registry_file) as f:
return json.load(f)
return {"models": []}
def _save(self):
with open(self.registry_file, "w") as f:
json.dump(self.registry, f, indent=2)
def register(self, name, run_dir, metrics):
"""Register a model from a completed run."""
version = len([m for m in self.registry["models"] if m["name"] == name]) + 1
entry = {
"name": name, "version": version, "run_dir": run_dir,
"metrics": metrics, "stage": "development",
"registered_at": datetime.now().isoformat(),
}
self.registry["models"].append(entry)
self._save()
return version
def promote(self, name, version, stage):
"""Move a model version to staging or production."""
for m in self.registry["models"]:
# Demote existing production model of the same name
if m["name"] == name and m["stage"] == stage:
m["stage"] = "archived"
if m["name"] == name and m["version"] == version:
m["stage"] = stage
self._save()
def get_production_model(self, name):
"""Get the current production model."""
for m in reversed(self.registry["models"]):
if m["name"] == name and m["stage"] == "production":
return m
return None
# Workflow:
# exp = Experiment("lr_sweep_resnet18")
# run = exp.new_run(tags={"hypothesis": "higher lr helps"})
# ... train ...
# registry = ModelRegistry()
# registry.register("resnet18", run.run_dir, {"accuracy": 0.94})
# registry.promote("resnet18", version=3, stage="production")
The key concepts that every production tool implements:
- Experiments — group runs that test the same hypothesis ("does dropout help?" gets its own experiment with 10 runs at different dropout rates)
- Tags and notes — human-readable annotations that make runs searchable. "baseline", "trying-larger-model", "bug-fix-v2" are worth more than timestamps
- Model registry — promotes the best run's model through stages: development → staging → production. Provides version numbers and rollback capability
The tradeoff between self-hosted and managed tools is real. MLflow gives you full control — your data stays on your infrastructure, you own the database — but you're responsible for ops. Weights & Biases gives you a polished UI, team collaboration, and GPU monitoring, but your experiment data lives in their cloud. For small teams, W&B's free tier is hard to beat. For regulated industries (healthcare, finance), self-hosted MLflow is often the only option.
Hyperparameter Sweeps — Systematic Search
Once you can track experiments reliably, the next step is running them systematically. Instead of manually trying learning rates, let the computer explore the space. But how you explore matters enormously — the wrong strategy wastes 10x the compute for the same result.
import random
import math
class SearchSpace:
"""Define the space of hyperparameters to explore."""
@staticmethod
def log_uniform(low, high):
return lambda: math.exp(random.uniform(math.log(low), math.log(high)))
@staticmethod
def choice(options):
return lambda: random.choice(options)
@staticmethod
def uniform(low, high):
return lambda: random.uniform(low, high)
def grid_search(param_grid):
"""Exhaustive: try every combination. O(k^d) evaluations."""
import itertools
keys = list(param_grid.keys())
values = list(param_grid.values())
for combo in itertools.product(*values):
yield dict(zip(keys, combo))
def random_search(search_space, n_trials):
"""Sample randomly: Bergstra & Bengio (2012) showed this
beats grid search because it explores each dimension independently."""
for _ in range(n_trials):
yield {name: sampler() for name, sampler in search_space.items()}
def successive_halving(search_space, n_configs, min_budget, max_budget, evaluate_fn):
"""Start many configs with small budget, prune worst half,
double budget for survivors. Reaches optimal config in O(n log n)."""
configs = list(random_search(search_space, n_configs))
budget = min_budget
while len(configs) > 1 and budget <= max_budget:
# Evaluate all surviving configs with current budget
results = [(evaluate_fn(cfg, budget), cfg) for cfg in configs]
results.sort(reverse=True)
# Keep top half
configs = [cfg for _, cfg in results[:max(len(results) // 2, 1)]]
budget *= 2
return configs[0] if configs else None
# Example usage:
# space = {
# "lr": SearchSpace.log_uniform(1e-5, 1e-1),
# "batch_size": SearchSpace.choice([16, 32, 64, 128]),
# "dropout": SearchSpace.uniform(0.0, 0.5),
# }
# best = successive_halving(space, n_configs=27, min_budget=1, max_budget=81, evaluate_fn=train_and_eval)
Why does random search beat grid search? It's not because random is magic — it's because grid search wastes samples on unimportant dimensions. If your model's accuracy depends almost entirely on the learning rate but barely on the batch size, a 10×10 grid allocates 100 evaluations but only explores 10 distinct learning rates. Random search with 100 evaluations explores 100 distinct learning rates. Bergstra and Bengio proved this mathematically: when some hyperparameters matter more than others (which is almost always the case), random search finds better configurations in fewer trials.
Successive halving is even smarter. Instead of running every configuration to completion, it gives each one a tiny budget (say, 1 epoch), evaluates them all, throws away the bottom half, and doubles the budget for the survivors. After log₂(n) rounds, only the best configuration remains — and it's been trained to the full budget. The total compute is O(n × log n) instead of O(n × max_budget) for trying everything to completion.
Try It: Hyperparameter Search Visualizer
Watch three search strategies explore a 2D hyperparameter space. The true optimum (hidden) is a peak in the landscape — warmer colors mean better scores. See how random search finds better results faster than grid search.
Conclusion
Experiment tracking isn't glamorous. Nobody writes blog posts about their logging infrastructure (well, except this one). But it's the difference between ML as alchemy — "I tweaked some knobs and something good happened" — and ML as engineering — "I changed the learning rate from 0.001 to 0.003, which improved validation accuracy by 2.1 percentage points across 5 seeds, and here's the run to prove it."
We built a complete tracking system from scratch: a minimal tracker that writes human-readable files, a comparison tool for finding winning configurations, a reproducibility framework for deterministic results, MLflow-style experiments and model registry patterns, and a hyperparameter sweep manager with three search strategies. The total code is under 300 lines of Python with zero external dependencies.
Start simple. Add the tracker to your next training script — it takes five minutes and saves hours. When you outgrow flat files, you'll already understand the concepts that production tools implement, and the migration will be painless. The best experiment you'll ever run is the one you can actually find again six months later.
References & Further Reading
- Bergstra & Bengio — "Random Search for Hyper-Parameter Optimization" (2012) — proved random search beats grid search when some hyperparameters matter more than others
- Lipton & Steinhardt — "Troubling Trends in Machine Learning Scholarship" (2018) — on the reproducibility crisis in ML research
- MLflow Documentation — the most popular open-source experiment tracking platform
- Weights & Biases Documentation — managed experiment tracking with team collaboration features
- Vartak et al. — "ModelDB: A System for Machine Learning Model Management" (2016) — the first purpose-built experiment management system
- Li et al. — "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization" (2017) — the paper that formalized successive halving for ML
- Bouthillier et al. — "Accounting for Variance in Machine Learning Benchmarks" (2021) — why single-run comparisons are unreliable and how to fix it
Related posts: ML Evaluation from Scratch, Fine-Tuning LLMs, LLM Observability, Optimizers from Scratch, Learning Rate Schedules