Second-Order Optimization from Scratch
Why Gradient Descent Struggles with Curvature
Picture a long, narrow valley — steep walls on the sides, gentle slope toward the minimum. Gradient descent sees the steepest direction and charges sideways into the wall, bounces off, charges into the opposite wall, bounces again. It zigzags endlessly, barely making progress along the valley floor. A smarter optimizer would recognize the shape of the valley and step directly toward the bottom. That’s what second-order methods do — they see the shape, not just the slope.
The trouble comes down to the condition number κ = λmax / λmin, the ratio of the largest to smallest eigenvalue of the Hessian matrix. For a 2D quadratic, this is the ratio of the longest to shortest axis of the elliptical contour lines. When κ = 1, the contours are circles and gradient descent walks straight to the minimum. When κ = 100, the contours are long thin ellipses and gradient descent needs roughly O(κ) iterations — a hundred times slower. In practice, neural network loss surfaces routinely have condition numbers of 103 to 106.
The key insight is that the gradient points in the direction of steepest descent, but that’s not the direction toward the minimum unless all curvatures are equal. Momentum helps (reducing the dependence to O(√κ)), but doesn’t eliminate the fundamental problem. To truly fix it, we need curvature information — the Hessian.
Let’s see this in action. We’ll minimize a 2D quadratic f(x) = ½xTAx where A has eigenvalues [1, 100], giving κ = 100:
import numpy as np
# Ill-conditioned 2D quadratic: f(x) = 0.5 * x^T A x
# Eigenvalues 1 and 100 give condition number kappa = 100
A = np.array([[100.0, 0.0],
[0.0, 1.0]])
def f(x):
return 0.5 * x @ A @ x
def grad_f(x):
return A @ x
# Gradient descent with fixed step size
x = np.array([1.0, 1.0])
lr = 0.019 # near-optimal for this problem: 2/(lambda_max + lambda_min)
trajectory = [x.copy()]
for i in range(200):
x = x - lr * grad_f(x)
trajectory.append(x.copy())
if np.linalg.norm(grad_f(x)) < 1e-6:
break
print(f"GD converged in {len(trajectory)-1} iterations")
print(f"Final point: ({x[0]:.6f}, {x[1]:.6f})")
# Output: GD converged in 200 iterations (didn't converge!)
# The zigzag pattern in x[0] shows the problem
With κ = 100, gradient descent crawls. It oscillates along the high-curvature direction (the x-axis with eigenvalue 100) while making tiny progress along the low-curvature direction. This is the fundamental limitation that second-order methods solve.
Newton’s Method — Using the Full Hessian
The idea behind Newton’s method is elegant: at any point, approximate the loss function with a local quadratic model, then jump directly to that model’s minimum. The second-order Taylor expansion around the current weights w is:
f(w + δ) ≈ f(w) + ∇fTδ + ½δT∇²f · δ
Setting the gradient of this quadratic model to zero gives the Newton update: δ* = −(∇²f)−1∇f. In words: multiply the gradient by the inverse Hessian to rescale each direction by its curvature. High-curvature directions get smaller steps; low-curvature directions get larger steps.
For a pure quadratic, the local model is the function, so Newton converges in exactly one step regardless of the condition number. For general smooth functions, Newton achieves quadratic convergence near the minimum — the error squares at each iteration. Compare that to gradient descent’s linear convergence, where only a constant fraction of the error is removed per step.
There are three catches. First, computing the full Hessian costs O(n2) storage and O(n3) to invert for n parameters. For a neural net with 108 parameters, that’s 1016 entries — roughly 80 petabytes in float64. Second, far from the minimum, the Hessian may not be positive definite, which means Newton can walk uphill or toward saddle points. Third, the quadratic model may be inaccurate far from the minimum. The fix for the last two problems: damped Newton (add a step size α) or trust regions (constrain ‖δ‖ ≤ Δ).
import numpy as np
A = np.array([[100.0, 0.0],
[0.0, 1.0]])
def grad_f(x):
return A @ x
def hessian_f(x):
return A # constant for quadratics
# Newton's method on the same quadratic
x_newton = np.array([1.0, 1.0])
H = hessian_f(x_newton)
x_newton = x_newton - np.linalg.solve(H, grad_f(x_newton))
print(f"Newton converged in 1 step: ({x_newton[0]:.6f}, {x_newton[1]:.6f})")
# Output: Newton converged in 1 step: (0.000000, 0.000000)
# Newton on the Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2
def rosenbrock_grad(xy):
x, y = xy
dx = -2*(1-x) + 200*(y - x**2)*(-2*x)
dy = 200*(y - x**2)
return np.array([dx, dy])
def rosenbrock_hessian(xy):
x, y = xy
dxx = 2 - 400*(y - x**2) + 800*x**2
dxy = -400*x
dyy = 200.0
return np.array([[dxx, dxy], [dxy, dyy]])
x_r = np.array([-1.0, 1.0])
for i in range(20):
g = rosenbrock_grad(x_r)
H = rosenbrock_hessian(x_r)
x_r = x_r - np.linalg.solve(H, g)
if np.linalg.norm(g) < 1e-10:
print(f"Newton on Rosenbrock: {i+1} iterations to ({x_r[0]:.4f}, {x_r[1]:.4f})")
break
# Output: Newton on Rosenbrock: 8 iterations to (1.0000, 1.0000)
# GD would need thousands of iterations on the same problem
One step for the quadratic, eight for the notoriously difficult Rosenbrock function (where gradient descent would need thousands). The power of curvature information is clear. But the cost of computing and inverting that Hessian matrix is the reason we can’t just use Newton on every deep learning problem.
Quasi-Newton Methods — Approximating the Hessian Cheaply
What if we could build up an approximation of the Hessian over time, using only gradient information? That’s the quasi-Newton idea. The key constraint is the secant condition: our Hessian approximation B must satisfy Bk+1 · sk = yk, where sk = wk+1 − wk is the step we took and yk = ∇fk+1 − ∇fk is the gradient change. This is essentially saying: “the Hessian times a step should equal the gradient change along that step.”
The BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno) does this with a rank-2 update that maintains both symmetry and positive definiteness — the two properties that make the Hessian useful. BFGS is the gold standard for medium-sized optimization problems (up to ~10,000 parameters).
For larger problems, even storing the n × n approximation matrix is too expensive. L-BFGS (limited-memory BFGS) solves this by storing only the last m gradient pairs (typically m = 5–20) and computing the matrix-vector product Hk∇f on the fly using the elegant two-loop recursion. Memory drops from O(n²) to O(mn), while convergence stays superlinear.
L-BFGS is everywhere: it’s scikit-learn’s default solver for logistic regression, SciPy’s go-to optimizer, and the standard for any convex problem where you can compute gradients. Here’s the two-loop recursion at its heart:
import numpy as np
def lbfgs_two_loop(grad, s_history, y_history):
"""Compute H_k * grad using L-BFGS two-loop recursion.
s_history: list of step vectors (w_{k+1} - w_k)
y_history: list of gradient differences (grad_{k+1} - grad_k)
Returns the search direction (approximate Newton step)."""
m = len(s_history)
q = grad.copy()
alphas = np.zeros(m)
rhos = np.zeros(m)
# Backward loop: peel off each correction
for i in range(m - 1, -1, -1):
rhos[i] = 1.0 / (y_history[i] @ s_history[i])
alphas[i] = rhos[i] * (s_history[i] @ q)
q = q - alphas[i] * y_history[i]
# Initial Hessian estimate: scale by most recent curvature
if m > 0:
gamma = (s_history[-1] @ y_history[-1]) / (y_history[-1] @ y_history[-1])
r = gamma * q
else:
r = q
# Forward loop: apply corrections in order
for i in range(m):
beta = rhos[i] * (y_history[i] @ r)
r = r + (alphas[i] - beta) * s_history[i]
return r
# Demo: L-BFGS on a 10D ill-conditioned quadratic
n = 10
np.random.seed(42)
eigvals = np.logspace(0, 3, n) # condition number = 1000
Q, _ = np.linalg.qr(np.random.randn(n, n))
A = Q @ np.diag(eigvals) @ Q.T
x = np.ones(n)
s_hist, y_hist = [], []
m_keep = 10 # number of pairs to store
for step in range(50):
g = A @ x
if np.linalg.norm(g) < 1e-10:
print(f"L-BFGS converged in {step} steps (kappa=1000, n={n})")
break
direction = lbfgs_two_loop(g, s_hist, y_hist)
x_new = x - direction # step size = 1 for quadratics
s_hist.append(x_new - x)
y_hist.append(A @ x_new - g)
if len(s_hist) > m_keep:
s_hist.pop(0)
y_hist.pop(0)
x = x_new
# Output: L-BFGS converged in 11 steps (kappa=1000, n=10)
Eleven steps on a 10-dimensional problem with condition number 1,000 — gradient descent would need thousands. The two-loop recursion is a masterpiece of algorithm design: it extracts near-Newton quality search directions from just a handful of stored gradient differences, all in O(mn) time.
The Fisher Information Matrix and Natural Gradient
So far we’ve talked about curvature of the loss function in parameter space. But when we’re training probabilistic models, there’s a subtler problem: the geometry of the parameter space itself is warped. Small changes to parameters don’t always mean small changes to the output distribution.
Consider fitting a Gaussian N(μ, σ²). Changing μ by 0.1 matters a lot when σ = 0.01 (shifts the distribution by 10 standard deviations) but barely matters when σ = 100 (shifts it by 0.001 standard deviations). Euclidean distance in (μ, σ) space doesn’t reflect distributional distance. Gradient descent in parameter space takes the same-sized step regardless — clearly suboptimal.
The Fisher information matrix F = E[∇log p(x|θ) · ∇log p(x|θ)T] measures how sensitive the output distribution is to each parameter direction. It gives the local curvature of the KL divergence: KL(pθ ‖ pθ+δ) ≈ ½δTFδ. The natural gradient uses this: wt+1 = wt − η · F−1∇L, giving steepest descent in distribution space rather than parameter space.
This has a beautiful connection to Newton’s method: for maximum likelihood estimation, the Fisher information equals the expected Hessian of the negative log-likelihood. So natural gradient is Newton’s method in disguise for probabilistic models. And just like Newton on quadratics, natural gradient converges in one step for exponential family distributions. It’s also reparameterization invariant — the optimization path is the same regardless of how you parameterize the model. For more on KL divergence and Fisher information, see the information theory post.
import numpy as np
# Fit a 1D Gaussian N(mu, sigma^2) to data using natural gradient vs vanilla GD
np.random.seed(42)
data = np.random.normal(loc=3.0, scale=2.0, size=200)
def neg_log_likelihood(mu, log_sigma, data):
sigma = np.exp(log_sigma)
return 0.5 * np.mean(((data - mu) / sigma)**2) + log_sigma
def grad_nll(mu, log_sigma, data):
sigma = np.exp(log_sigma)
d_mu = -np.mean((data - mu) / sigma**2)
d_log_sigma = 1.0 - np.mean(((data - mu) / sigma)**2)
return np.array([d_mu, d_log_sigma])
def fisher_matrix(mu, log_sigma):
"""Analytical Fisher for Gaussian: F = diag(1/sigma^2, 2)."""
sigma2 = np.exp(2 * log_sigma)
return np.array([[1.0 / sigma2, 0.0],
[0.0, 2.0]])
# Vanilla gradient descent
params_gd = np.array([0.0, 0.0]) # mu=0, sigma=1
lr = 0.05
gd_path = [params_gd.copy()]
for _ in range(100):
g = grad_nll(params_gd[0], params_gd[1], data)
params_gd = params_gd - lr * g
gd_path.append(params_gd.copy())
# Natural gradient descent
params_ng = np.array([0.0, 0.0])
ng_path = [params_ng.copy()]
for _ in range(100):
g = grad_nll(params_ng[0], params_ng[1], data)
F = fisher_matrix(params_ng[0], params_ng[1])
nat_grad = np.linalg.solve(F, g)
params_ng = params_ng - lr * nat_grad
ng_path.append(params_ng.copy())
mu_gd, sig_gd = params_gd[0], np.exp(params_gd[1])
mu_ng, sig_ng = params_ng[0], np.exp(params_ng[1])
print(f"True: mu=3.00, sigma=2.00")
print(f"GD after 100 steps: mu={mu_gd:.2f}, sigma={sig_gd:.2f}")
print(f"NatGrad after 100: mu={mu_ng:.2f}, sigma={sig_ng:.2f}")
# Natural gradient reaches the true parameters much faster
The natural gradient’s F−1 rescaling means it takes big steps in directions where the distribution is insensitive (low Fisher information) and small steps where it’s sensitive. This is why natural gradient appears in reinforcement learning (TRPO and PPO use a Fisher-based trust region) and variational inference — anywhere we optimize over distributions.
Practical Second-Order Methods for Deep Learning
Here’s the scale problem: a model with 108 parameters would need a Hessian with 1016 entries. Even L-BFGS, at O(mn) memory, needs 109 floats for m = 10. So what can we actually use?
Diagonal approximations are the simplest answer — and you’re probably already using them. Adam’s second-moment estimate vt approximates the diagonal of E[ggT], which is a rough diagonal Hessian approximation. This is why adaptive methods work so well: they’re cheap second-order methods that give each parameter its own effective learning rate based on local curvature. RMSProp and AdaGrad do the same thing.
K-FAC (Kronecker-Factored Approximate Curvature) is the most important structured approximation. For a fully-connected layer with weight matrix W mapping input a to pre-activation s, the Fisher information block factors as FW = E[aaT] ⊗ E[∇sℓ · ∇sℓT]. The Kronecker product ⊗ is key: if A is m×m and B is n×n, then (A ⊗ B)−1 = A−1 ⊗ B−1. This drops the inversion cost from O((mn)³) to O(m³ + n³) — exploiting the layer-wise structure of neural networks.
K-FAC takes fewer iterations to converge but more computation per iteration. In practice, it shines on medium-scale problems and when you can amortize the Fisher factor updates over multiple gradient steps. Here’s K-FAC for a simple 2-layer MLP:
import numpy as np
np.random.seed(42)
# Synthetic binary classification: 100 points, 5 features
X = np.random.randn(100, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(float)
# 2-layer MLP: 5 -> 8 -> 1
W1 = np.random.randn(5, 8) * 0.3
b1 = np.zeros(8)
W2 = np.random.randn(8, 1) * 0.3
b2 = np.zeros(1)
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def forward_and_loss(X, y, W1, b1, W2, b2):
h = np.tanh(X @ W1 + b1) # hidden activations
logits = (h @ W2 + b2).ravel()
probs = sigmoid(logits)
loss = -np.mean(y * np.log(probs + 1e-8) + (1-y) * np.log(1-probs + 1e-8))
return h, probs, loss
# K-FAC update for one step
damping = 0.01
for step in range(60):
h, probs, loss = forward_and_loss(X, y, W1, b1, W2, b2)
if step % 20 == 0:
print(f"Step {step}: loss = {loss:.4f}")
# Backprop: output gradient
dl_dlogits = (probs - y) / len(y) # (100,)
dl_dW2 = h.T @ dl_dlogits.reshape(-1, 1)
dl_db2 = dl_dlogits.sum(keepdims=True)
dl_dh = dl_dlogits.reshape(-1, 1) @ W2.T # (100, 8)
dl_dpre = dl_dh * (1 - h**2) # tanh derivative
dl_dW1 = X.T @ dl_dpre
dl_db1 = dl_dpre.sum(axis=0)
# K-FAC: Kronecker factors for layer 1
A1 = (X.T @ X) / len(y) + damping * np.eye(5) # input covariance
G1 = (dl_dpre.T @ dl_dpre) / len(y) + damping * np.eye(8) # gradient covariance
# Natural gradient update: dW = A^{-1} @ dW_flat @ G^{-1}
A1_inv = np.linalg.inv(A1)
G1_inv = np.linalg.inv(G1)
nat_dW1 = A1_inv @ dl_dW1 @ G1_inv
# K-FAC: Kronecker factors for layer 2
A2 = (h.T @ h) / len(y) + damping * np.eye(8)
G2_scalar = np.mean(dl_dlogits**2) + damping
nat_dW2 = np.linalg.inv(A2) @ dl_dW2 / G2_scalar
W1 -= 0.1 * nat_dW1
b1 -= 0.1 * dl_db1
W2 -= 0.1 * nat_dW2
b2 -= 0.1 * dl_db2
print(f"Final loss: {loss:.4f}")
# K-FAC typically converges in fewer steps than SGD or Adam
The Kronecker factorization is what makes this tractable. Instead of inverting a 40×40 Fisher block for layer 1 (vectorized W1 has 40 entries), we invert a 5×5 and an 8×8 matrix separately. For larger networks, this savings is enormous.
Hessian-Vector Products — Second-Order Information for Free
What if you don’t need the full Hessian, just its effect on a vector? The Hessian-vector product Hv can be computed with a simple finite-difference trick: Hv ≈ [∇f(θ + εv) − ∇f(θ − εv)] / (2ε). That’s two gradient evaluations — no Hessian storage, no matrix inversion. In automatic differentiation frameworks, you can compute Hv = ∇(∇fTv) exactly with one forward pass and one backward pass.
This unlocks two powerful tools. Power iteration finds the top eigenvalue of the Hessian by repeatedly multiplying a random vector by H — useful for understanding loss landscape curvature (sharp vs flat minima). Conjugate gradient (CG) iteratively solves the Newton system Hδ = −∇f using only Hessian-vector products, giving an approximate Newton step in k iterations (typically k = 10–50) without ever forming H. This is Hessian-free optimization (Martens 2010), and it made Newton-like methods feasible for neural networks.
import numpy as np
def hessian_vector_product(grad_fn, theta, v, eps=1e-5):
"""Compute H @ v using central finite differences."""
g_plus = grad_fn(theta + eps * v)
g_minus = grad_fn(theta - eps * v)
return (g_plus - g_minus) / (2 * eps)
def power_iteration(grad_fn, theta, num_iters=50):
"""Find the top eigenvalue of the Hessian via power iteration."""
v = np.random.randn(len(theta))
v = v / np.linalg.norm(v)
for _ in range(num_iters):
Hv = hessian_vector_product(grad_fn, theta, v)
eigenvalue = v @ Hv
v = Hv / np.linalg.norm(Hv)
return eigenvalue, v
def conjugate_gradient(grad_fn, theta, grad, max_iters=20, tol=1e-8):
"""Solve H @ delta = -grad approximately using CG + Hv products."""
delta = np.zeros_like(grad)
r = -grad.copy() # residual
p = r.copy() # search direction
rs_old = r @ r
for i in range(max_iters):
Hp = hessian_vector_product(grad_fn, theta, p)
alpha = rs_old / (p @ Hp + 1e-10)
delta = delta + alpha * p
r = r - alpha * Hp
rs_new = r @ r
if np.sqrt(rs_new) < tol:
break
p = r + (rs_new / rs_old) * p
rs_old = rs_new
return delta
# Demo: 5D quadratic with known Hessian
np.random.seed(42)
eigvals = np.array([1.0, 5.0, 20.0, 50.0, 200.0])
Q, _ = np.linalg.qr(np.random.randn(5, 5))
A = Q @ np.diag(eigvals) @ Q.T
grad_fn = lambda x: A @ x
theta = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
# Power iteration: find top eigenvalue
top_eigval, top_eigvec = power_iteration(grad_fn, theta)
print(f"Top eigenvalue (power iter): {top_eigval:.1f} (true: 200.0)")
# CG: approximate Newton step
grad = grad_fn(theta)
cg_step = conjugate_gradient(grad_fn, theta, grad, max_iters=10)
newton_step = -np.linalg.solve(A, grad) # exact Newton
print(f"CG step error vs exact Newton: {np.linalg.norm(cg_step - newton_step):.2e}")
# Output: CG matches exact Newton to high precision in ≤ n iterations
Conjugate gradient on a 5-dimensional problem needs at most 5 iterations to match the exact Newton step perfectly. In practice, even 10–20 CG iterations on a million-parameter network produce a search direction that’s dramatically better than the raw gradient, at a cost of only 20 backward passes — expensive, but orders of magnitude cheaper than forming and inverting the full Hessian.
The Second-Order Landscape
Let’s step back and see the full picture. The progression from first-order to second-order methods follows a clear trade-off curve:
- Gradient descent: O(n) per step, O(κ) steps. Simple but slow on ill-conditioned problems.
- Momentum / Adam: O(n) per step, O(√κ) steps. Diagonal Hessian approximation in disguise.
- L-BFGS: O(mn) per step, superlinear convergence. Gold standard for medium-scale convex problems.
- Natural gradient: O(n²) for Fisher, invariant to parameterization. Best for probabilistic models.
- K-FAC: O(m³ + n³) per layer, exploits neural network structure. Practical deep learning second-order.
- CG + Hessian-vector products: O(kn) per step, no Hessian storage. Newton quality at gradient cost.
The dirty secret of deep learning optimization is that Adam already is a second-order method — just a very crude one. It approximates the diagonal of the Hessian using running averages of squared gradients. K-FAC is a better approximation (block-diagonal with Kronecker structure), and full Newton is the best but most expensive. Every practical optimizer lives somewhere on this spectrum, trading off curvature information against computational cost.
Try It: Optimizer Trajectory Arena
Try It: Curvature Visualizer
Conclusion
Gradient descent is blind to shape. It sees the slope and follows it downhill, oblivious to the fact that the valley is a hundred times longer than it is wide. Second-order methods open a new eye: by using curvature information — the Hessian, the Fisher, or approximations thereof — they transform the landscape so that every direction looks equally scaled. The zigzag becomes a straight line.
You don’t need to implement K-FAC or Hessian-free optimization to benefit from this understanding. Knowing that Adam is a diagonal Hessian approximation explains why it works and when it fails (highly correlated parameters where off-diagonal curvature matters). Knowing about condition numbers explains why learning rate schedules matter and why some architectures train faster than others. And knowing about L-BFGS explains why classical machine learning (logistic regression, SVMs) can use exact solvers while deep learning can’t.
The frontier keeps pushing: Shampoo (Google, 2018) generalizes K-FAC to tensor-valued preconditioners, and recent work on sketched second-order methods uses randomized linear algebra to compress the Hessian. But the fundamental insight remains the same one Newton had in 1685: if you know the shape of the curve, you can find the root in one step.
References & Further Reading
- Boyd & Vandenberghe — Convex Optimization (2004) — The standard textbook; chapters 9–10 cover Newton’s method and quasi-Newton
- Nocedal & Wright — Numerical Optimization (2006) — The practitioner’s reference for quasi-Newton, conjugate gradient, and trust regions
- Amari (1998) — Natural Gradient Works Efficiently in Learning — The foundational paper for natural gradient descent
- Martens (2010) — Deep Learning via Hessian-Free Optimization — Showed CG + Hessian-vector products can train deep networks
- Martens & Grosse (2015) — Optimizing Neural Networks with K-FAC — Kronecker-factored approximate curvature for practical deep learning
- Duchi, Hazan & Singer (2011) — Adaptive Subgradient Methods (AdaGrad) — The connection between adaptive methods and diagonal second-order
- Gupta, Koren & Singer (2018) — Shampoo — Generalized K-FAC with matrix-valued preconditioners