Time Series Forecasting from Scratch

February 26, 2026 · Elementary · 18 min read

Introduction

Every second of every day, the world generates an unending river of time-ordered data. Your heart beats 100,000 times a day — that's a time series. Your server logs millions of requests — each timestamped, each telling a story. Stock markets, weather stations, IoT sensors, web analytics, baby sleep schedules — all time series. And the question everyone asks is always the same: what happens next?

This post builds five progressively sophisticated forecasting methods from scratch, starting with classical statistical approaches that predate computers and ending with transformer-based temporal models. Along the way, we'll discover something surprising: unlike natural language processing where transformers dominate, time series forecasting has a richer landscape where simple methods regularly beat complex ones. The M-competition series — the Olympics of forecasting — has consistently shown that statistical methods compete with or beat deep learning on standard benchmarks. Understanding why teaches deep lessons about inductive biases, overfitting, and the difference between interpolation and extrapolation.

We'll build: exponential smoothing, ARIMA, LSTMs, temporal convolutions, and temporal transformers. Each one teaches a different lesson about how to reason about time. Let's start by understanding what we're working with.

1. The Anatomy of a Time Series

Before building any model, we need to understand the structure hiding inside time-ordered data. Most real-world time series are not random walks — they're composed of distinct, predictable components layered on top of each other.

Additive decomposition splits a time series into three components:

Trend — the long-term direction. A growing user base, a rising global temperature, an aging server that slows down over months.
Seasonality — repeating patterns at known periods. Weekly traffic spikes on Mondays, holiday sales bumps in December, hourly temperature oscillations.
Residuals — whatever's left after removing trend and seasonality. Noise, one-off events, the unexplained remainder.

The equation is deceptively simple: observed = trend + seasonal + residual. Every forecasting method either models these components explicitly (like exponential smoothing, which has separate equations for each) or learns them implicitly (like neural networks, which discover the patterns from data). Understanding decomposition tells you what kind of signal your model needs to capture.

Let's build a decomposer. The core idea: use a centered moving average to extract the trend (it smooths away seasonality), then average each calendar position (e.g., all Mondays) to extract the seasonal pattern, and whatever's left is residual.

import numpy as np

def decompose_additive(series, period):
    """Decompose time series into trend, seasonal, and residual."""
    n = len(series)

    # Step 1: Extract trend with centered moving average
    trend = np.full(n, np.nan)
    half = period // 2
    for i in range(half, n - half):
        trend[i] = np.mean(series[i - half:i + half + 1])

    # Step 2: Detrend to isolate seasonal + residual
    detrended = series - trend

    # Step 3: Average each seasonal position across all cycles
    seasonal = np.zeros(n)
    for pos in range(period):
        indices = range(pos, n, period)
        vals = [detrended[j] for j in indices if not np.isnan(detrended[j])]
        avg = np.mean(vals) if vals else 0.0
        for j in indices:
            seasonal[j] = avg

    # Normalize seasonal component to sum to zero per cycle
    seasonal -= np.mean(seasonal[:period])

    # Step 4: Residual = observed - trend - seasonal
    residual = series - trend - seasonal
    return trend, seasonal, residual

# --- Generate synthetic daily website traffic ---
np.random.seed(42)
days = 365
t = np.arange(days, dtype=float)
trend_true = 1000 + 2 * t                        # +2 visitors/day growth
seasonal_true = 200 * np.sin(2 * np.pi * t / 7)  # Weekly cycle
noise = np.random.normal(0, 50, days)
traffic = trend_true + seasonal_true + noise

trend_est, seasonal_est, residual_est = decompose_additive(traffic, period=7)
print(f"Trend range:  {np.nanmin(trend_est):.0f} to {np.nanmax(trend_est):.0f}")
print(f"Seasonal amp: {seasonal_est.max() - seasonal_est.min():.0f}")
print(f"Residual std: {np.nanstd(residual_est):.1f}")

The decomposer recovers the trend (1,000 to 1,730 daily visitors), the weekly pattern (~400 peak-to-trough), and the residuals (standard deviation near our injected noise of 50). This is the foundation — every method we build next is essentially trying to model these components, just with different assumptions about how they behave.

One important concept: stationarity. Most forecasting methods assume the statistical properties of a series don't change over time. A series with an upward trend is not stationary — its mean is increasing. The fix is differencing: subtract each value from its predecessor. This removes the trend and produces a series that fluctuates around a stable mean. We'll use differencing explicitly in ARIMA (Section 3), and neural networks learn to handle non-stationarity implicitly.

Try It: Decomposition Explorer

Build a time series from its components and watch the decomposition in real time. What looks like random noise often has hidden structure.

Trend: 1.0 Seasonality: 2.0 Period: 12 Noise: 0.5

2. Exponential Smoothing — The Unbeatable Baseline

If you could only use one forecasting method for the rest of your life, this would be a strong choice. Exponential smoothing has won more forecasting competitions than any neural network. It's three lines of recurrence relations, runs in O(n), requires no training/test split, and yet it consistently outperforms many deep learning models on univariate forecasting benchmarks.

The method maintains three estimates that update with each new observation:

Level: l₀ = α · y₀ + (1 - α) · (l₋₁ + b₋₁)
Trend: b₀ = β · (l₀ - l₋₁) + (1 - β) · b₋₁
Season: s₀ = γ · (y₀ - l₀) + (1 - γ) · s₋ₘ

Each equation is a weighted average between what we just observed and what we previously believed. The smoothing parameters (α, β, γ) control the memory-versus-responsiveness tradeoff: high α trusts recent data more (reacts quickly but gets noisy); low α trusts history more (smooth but slow to adapt). This is the same explore-exploit tension that shows up everywhere in machine learning.

The forecast simply extrapolates: level plus projected trend plus the expected seasonal factor. It's the "inertia assumption" — the future is like the recent past, with the same trend and the same seasonal rhythm.

import numpy as np

class HoltWinters:
    """Triple exponential smoothing with additive seasonality."""
    def __init__(self, alpha=0.3, beta=0.1, gamma=0.2, period=7):
        self.alpha = alpha   # Level smoothing
        self.beta = beta     # Trend smoothing
        self.gamma = gamma   # Seasonal smoothing
        self.period = period

    def fit(self, series):
        p = self.period
        # Initialize from the first two cycles
        self.level = np.mean(series[:p])
        self.trend = (np.mean(series[p:2*p]) - np.mean(series[:p])) / p
        self.seasonal = np.zeros(p)
        for i in range(p):
            self.seasonal[i] = series[i] - self.level

        fitted = np.zeros(len(series))
        for t in range(len(series)):
            s = self.seasonal[t % p]
            fitted[t] = self.level + self.trend + s

            # The three update equations
            new_level = self.alpha * (series[t] - s) \
                      + (1 - self.alpha) * (self.level + self.trend)
            new_trend = self.beta * (new_level - self.level) \
                      + (1 - self.beta) * self.trend
            new_seasonal = self.gamma * (series[t] - new_level) \
                         + (1 - self.gamma) * s

            self.level = new_level
            self.trend = new_trend
            self.seasonal[t % p] = new_seasonal
        self.fitted = fitted
        return fitted

    def forecast(self, steps):
        """Generate multi-step forecast with trend + seasonality."""
        preds = np.zeros(steps)
        for h in range(steps):
            s = self.seasonal[(len(self.fitted) + h) % self.period]
            preds[h] = self.level + (h + 1) * self.trend + s
        return preds

# --- Fit to monthly retail data with annual seasonality ---
model = HoltWinters(alpha=0.3, beta=0.05, gamma=0.3, period=12)
fitted = model.fit(sales)     # monthly data with annual pattern
future = model.forecast(12)   # 12-month ahead forecast

mape = np.mean(np.abs((sales - fitted) / sales)) * 100
print(f"Training MAPE: {mape:.1f}%")
print(f"Next 3 months: {future[:3].round(0)}")

Three hyperparameters, three equations, and you've got a forecaster that captures trend and seasonality. The M-competition results (Makridakis et al., 2018) showed that this approach outperforms neural networks on a majority of univariate time series benchmarks. The lesson: never skip the simple baseline.

3. Autoregressive Models — Learning from Your Own Past

Exponential smoothing is powerful but its structure is fixed — we hand-wrote the three equations. What if we could learn the forecasting rules from data instead? This is the bridge from classical statistics to machine learning.

An autoregressive (AR) model makes a simple bet: the next value is a linear combination of the previous p values. If you squint, it's just linear regression where the features are lagged versions of the target variable. The connection to neural networks is direct: an AR(p) model is a single linear layer with p inputs.

ARIMA adds two more ideas. The I (Integrated) component differences the series to make it stationary — subtract each value from the previous one to remove trend. The MA (Moving Average of errors) component learns from previous forecast mistakes, correcting systematic biases. Together, ARIMA(p,d,q) is a principled framework for fitting linear time series models.

In practice, choosing the right p and d comes from two diagnostic plots: the ACF (autocorrelation function) shows correlation at each lag, and the PACF (partial autocorrelation) isolates the direct effect of each lag. A sharp PACF cutoff at lag p suggests AR(p).

import numpy as np

def difference(series, d=1):
    """Make series stationary by differencing d times."""
    result = series.copy()
    for _ in range(d):
        result = result[1:] - result[:-1]
    return result

def undifference(diffs, last_original):
    """Reverse differencing: cumulative sum from the last original value."""
    return last_original + np.cumsum(diffs)

def fit_ar(series, p):
    """Fit AR(p) coefficients using ordinary least squares.

    Builds a matrix of lagged values and solves the normal equations:
    coeffs = (X^T X)^{-1} X^T y
    """
    n = len(series)
    X = np.column_stack([series[p - i - 1:n - i - 1] for i in range(p)])
    y = series[p:]
    coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
    return coeffs

def arima_forecast(series, p=3, d=1, steps=10):
    """ARIMA(p, d, 0) forecast from scratch."""
    # Step 1: Difference to remove trend
    stationary = difference(series, d)

    # Step 2: Fit AR on the differenced series
    coeffs = fit_ar(stationary, p)

    # Step 3: Forecast on the differenced scale
    buffer = list(stationary[-p:])
    forecasts_diff = []
    for _ in range(steps):
        pred = np.dot(coeffs, buffer[-p:][::-1])
        forecasts_diff.append(pred)
        buffer.append(pred)

    # Step 4: Undo differencing to recover original scale
    forecasts = undifference(np.array(forecasts_diff), series[-1])
    return forecasts

preds = arima_forecast(traffic, p=7, d=1, steps=30)
print(f"Next 7 days: {preds[:7].round(0)}")

ARIMA's key limitation is the word linear. It assumes the relationship between past and future can be captured by a weighted sum. When the real dynamics are nonlinear — when Monday's traffic depends on Tuesday's in a way that changes with the season — ARIMA's forecasts degrade. This is exactly the gap that neural networks fill.

4. Recurrent Neural Networks — Learning Temporal Features

ARIMA makes us hand-pick lags and assumes linearity. What if we could let a neural network discover its own temporal features, capturing nonlinear patterns without manual engineering? This is what recurrent neural networks do.

The LSTM (Long Short-Term Memory) cell is the workhorse architecture. It maintains a cell state — a memory conveyor belt — and three gates that control information flow: the forget gate decides what to discard from memory, the input gate decides what new information to store, and the output gate decides what part of memory to expose as the hidden state. These gates are learned, not hand-designed.

The key advantage over ARIMA: an LSTM can learn nonlinear temporal dependencies automatically. It doesn't need stationarity. It doesn't assume linearity. It discovers patterns from data. But there's a cost: LSTMs need much more data (hundreds of points minimum, versus ARIMA's dozens), they're prone to overfitting on short series, and their forecasts can diverge beyond the training horizon.

A critical training detail: teacher forcing feeds ground-truth values during training, but at inference time the model must feed its own predictions back as input. Models trained only with teacher forcing can collapse when generating long forecasts because they never learned to cope with their own errors compounding. As we explored in the backpropagation post, gradients must flow through time — and in RNNs, that path can be very long.

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -15, 15)))

class LSTMCell:
    """A single LSTM cell with forget, input, and output gates."""
    def __init__(self, input_dim, hidden_dim):
        scale = np.sqrt(2 / (input_dim + hidden_dim))
        d = input_dim + hidden_dim
        # Concatenated weight matrices: [W_x | W_h] for each gate
        self.W_f = np.random.randn(hidden_dim, d) * scale  # Forget
        self.W_i = np.random.randn(hidden_dim, d) * scale  # Input
        self.W_c = np.random.randn(hidden_dim, d) * scale  # Candidate
        self.W_o = np.random.randn(hidden_dim, d) * scale  # Output
        self.b_f = np.ones(hidden_dim)   # Bias toward remembering
        self.b_i = np.zeros(hidden_dim)
        self.b_c = np.zeros(hidden_dim)
        self.b_o = np.zeros(hidden_dim)
        self.hidden_dim = hidden_dim

    def forward(self, x_t, h_prev, c_prev):
        """One timestep: (input, hidden, cell) -> (new_hidden, new_cell)"""
        combined = np.concatenate([x_t, h_prev])      # (input+hidden,)

        f_t = sigmoid(self.W_f @ combined + self.b_f)  # What to discard
        i_t = sigmoid(self.W_i @ combined + self.b_i)  # What to store
        c_hat = np.tanh(self.W_c @ combined + self.b_c) # Candidate memory
        o_t = sigmoid(self.W_o @ combined + self.b_o)  # What to reveal

        c_t = f_t * c_prev + i_t * c_hat               # New cell state
        h_t = o_t * np.tanh(c_t)                       # New hidden state
        return h_t, c_t

# --- Sequence-to-one forecasting ---
cell = LSTMCell(input_dim=1, hidden_dim=16)
h = np.zeros(16)
c = np.zeros(16)

# Forward pass through training data (teacher forcing)
for t in range(len(train_series) - 1):
    x_t = np.array([train_series[t]])          # Scalar input -> (1,)
    h, c = cell.forward(x_t, h, c)             # h: (16,), c: (16,)

# Autoregressive forecast: feed predictions back as input
W_out = np.random.randn(1, 16) * 0.1           # Readout layer
predictions = []
last_val = train_series[-1]
for _ in range(forecast_horizon):
    x_t = np.array([last_val])
    h, c = cell.forward(x_t, h, c)
    pred = (W_out @ h)[0]                       # Linear readout
    predictions.append(pred)
    last_val = pred                             # Feed own prediction

Notice the autoregressive loop at the end: the model feeds its own predictions back as input. Any error in step h propagates into step h+1, and so on. This is why long-horizon neural forecasts can diverge — a problem that classical methods like Holt-Winters sidestep entirely because their forecast equations are closed-form.

5. Temporal Convolutional Networks — Parallelism Meets Sequences

LSTMs process sequences one timestep at a time — the hidden state at step t depends on step t−1. This sequential bottleneck makes training slow. What if we could process the entire sequence in parallel? That's exactly what temporal convolutional networks (TCNs) do.

Two ideas make TCNs work for time series. First, causal convolutions: each output only depends on current and past inputs, never the future. This prevents information leakage. Second, dilated convolutions: instead of sliding the kernel over consecutive positions, we skip positions with exponentially increasing gaps (dilation 1, 2, 4, 8...). This gives an exponentially growing receptive field with only logarithmically many layers. As we explored in the convolutions post, a kernel extracts local patterns — dilation lets those patterns span long time ranges without a huge kernel.

import numpy as np

def causal_dilated_conv(x, weights, dilation):
    """Apply a causal dilated 1D convolution.

    x:        (seq_len,) input signal
    weights:  (kernel_size,) filter weights
    dilation: spacing between kernel elements
    Returns:  (seq_len,) output, same length via causal padding
    """
    k = len(weights)
    pad_size = (k - 1) * dilation
    padded = np.concatenate([np.zeros(pad_size), x])

    out = np.zeros(len(x))
    for t in range(len(x)):
        val = 0.0
        for j in range(k):
            idx = t + pad_size - j * dilation
            val += weights[j] * padded[idx]
        out[t] = val
    return out

class TCNBlock:
    """One TCN block: dilated causal conv -> ReLU -> residual."""
    def __init__(self, kernel_size=3, dilation=1):
        scale = np.sqrt(2 / kernel_size)
        self.weights = np.random.randn(kernel_size) * scale
        self.dilation = dilation

    def forward(self, x):
        conv_out = causal_dilated_conv(x, self.weights, self.dilation)
        activated = np.maximum(0, conv_out)     # ReLU
        return activated + x                     # Residual connection

# Stack blocks with exponentially increasing dilation
blocks = [TCNBlock(kernel_size=3, dilation=2**i) for i in range(4)]
# Dilations: 1, 2, 4, 8

signal = np.random.randn(100)
out = signal.copy()
for block in blocks:
    out = block.forward(out)

receptive_field = 1 + (3 - 1) * sum(2**i for i in range(4))
print(f"Receptive field: {receptive_field} timesteps with only 4 layers")
print(f"Input shape:  {signal.shape}")
print(f"Output shape: {out.shape}")   # Same length: causal padding preserves it

With 4 layers and kernel size 3, our TCN sees 31 timesteps of history. Double the layers to 8, and the receptive field jumps to 511 — over a year of daily data, with just 8 lightweight layers. Compare this to an LSTM, which must process all 511 steps sequentially. TCNs train 5–10x faster because every timestep computes in parallel, and they avoid the vanishing gradient problem entirely — the path from any input to the output is O(log n) layers, not O(n) steps. The ReLU activations in the residual connections keep gradients healthy.

6. Transformers for Time Series — Attention Over Time

The final architecture brings the most powerful idea from NLP to temporal data: attention. Instead of a fixed receptive field (TCN) or a compressed hidden state (LSTM), a transformer lets each timestep directly attend to every past timestep, weighted by learned relevance.

Applying transformers to time series requires three key adaptations. First, time embeddings: instead of discrete position indices (word 1, word 2, ...), we use continuous sinusoidal encodings that naturally represent time — the same idea from positional encoding, but applied to actual timestamps. This handles irregularly-sampled data gracefully. Second, input projection: each timestep is a scalar value (or scalar plus covariates like day-of-week), which we project into a high-dimensional embedding. Third, causal masking: the attention matrix is masked so each position can only attend to earlier positions, preventing future information leakage.

import numpy as np

def time_embedding(positions, d_model):
    """Continuous sinusoidal encoding for time positions."""
    pe = np.zeros((len(positions), d_model))
    for i in range(0, d_model, 2):
        freq = 1 / (10000 ** (i / d_model))
        pe[:, i] = np.sin(positions * freq)
        if i + 1 < d_model:
            pe[:, i + 1] = np.cos(positions * freq)
    return pe

def temporal_self_attention(X, W_q, W_k, W_v, causal=True):
    """Scaled dot-product attention with optional causal mask.

    X: (seq_len, d_model) input embeddings
    Returns: (seq_len, d_model) attended output, (seq, seq) weights
    """
    Q = X @ W_q                                    # (seq, d_k)
    K = X @ W_k                                    # (seq, d_k)
    V = X @ W_v                                    # (seq, d_v)
    d_k = Q.shape[1]

    scores = (Q @ K.T) / np.sqrt(d_k)             # (seq, seq)

    if causal:
        mask = np.triu(np.ones_like(scores), k=1)
        scores[mask > 0] = -1e9                    # Block future positions

    # Softmax each row
    exp_s = np.exp(scores - scores.max(axis=1, keepdims=True))
    attention = exp_s / exp_s.sum(axis=1, keepdims=True)

    return attention @ V, attention                # Output + weights

# --- Build a temporal transformer for weekly data ---
seq_len, d_model = 30, 16
np.random.seed(42)

values = np.sin(np.arange(seq_len) * 2 * np.pi / 7)  # Weekly pattern
W_proj = np.random.randn(1, d_model) * 0.1

X = values.reshape(-1, 1) @ W_proj                     # (30, 16)
X += time_embedding(np.arange(seq_len, dtype=float), d_model)

W_q = np.random.randn(d_model, d_model) * 0.1
W_k = np.random.randn(d_model, d_model) * 0.1
W_v = np.random.randn(d_model, d_model) * 0.1

output, weights = temporal_self_attention(X, W_q, W_k, W_v)
print(f"Input shape:     {X.shape}")          # (30, 16)
print(f"Output shape:    {output.shape}")     # (30, 16)
print(f"Attention shape: {weights.shape}")    # (30, 30)

The attention weight matrix weights is the interpretability payoff. On a trained model with weekly data, you'd see each timestep attending strongly to the same day of the previous week (annual seasonality) and the immediately preceding timesteps (short-term trend). This matches exactly what our decomposition in Section 1 identified — but the transformer discovered it from data, without being told about weekly cycles.

Transformers shine on multivariate time series: when you have dozens of correlated variables (multiple server metrics, stock prices across a portfolio, weather stations in a region), attention can learn cross-variable dependencies that univariate methods miss entirely.

Try It: Forecasting Method Playground

Generate a time series, pick a method, and tune its parameters. The dashed red line is the forecast — watch how different methods handle the same data.

Trend: 1.0 Seasonality: 2.5 Noise: 0.5

7. Evaluation — The Forecasting Pitfalls

Playing with the demo above reveals something important: more complex models don't always win. But how do we measure "winning" properly? Time series evaluation is fundamentally different from standard machine learning, and getting it wrong is the number-one source of misleading results.

Four common mistakes that make forecasting results look better than they are:

Random train/test split. In standard ML, you randomly split data. In time series, this causes future data leakage — the model trains on Tuesday and predicts Monday. Always split temporally.
Using only RMSE. Root mean squared error penalizes large errors heavily, hiding systematic bias. A model that's consistently off by 5% looks better under RMSE than one that's perfect 90% of the time but occasionally misses by 20%.
Evaluating only one step ahead. A model that nails the next-day forecast (h=1) can fail catastrophically at two-week forecasts (h=14). Always evaluate across multiple horizons.
Not comparing against naive baselines. A model with 5% MAPE sounds great until you learn the seasonal naive baseline (repeat last week's values) achieves 6%. Your complex model barely beats a lookup table.

The right approach: expanding window cross-validation. Start with a minimum training set, forecast h steps ahead, record the errors, then expand the training window by k steps and repeat. This simulates real deployment where you always train on the past and predict the future.

import numpy as np

def expanding_window_cv(series, model_fn, min_train=50, step=10, horizon=7):
    """Time series cross-validation with expanding training window."""
    errors = {h: [] for h in range(1, horizon + 1)}
    t = min_train

    while t + horizon <= len(series):
        train = series[:t]
        model = model_fn(train)
        preds = model.forecast(horizon)

        for h in range(horizon):
            actual = series[t + h]
            errors[h + 1].append(abs(preds[h] - actual))
        t += step

    return errors

def compute_metrics(actual, predicted):
    """Calculate four key forecasting metrics."""
    errors = actual - predicted
    abs_errors = np.abs(errors)

    mae  = np.mean(abs_errors)
    rmse = np.sqrt(np.mean(errors ** 2))
    mape = np.mean(abs_errors / np.abs(actual)) * 100

    # MASE: scaled by naive (random walk) baseline error
    naive_errors = np.abs(actual[1:] - actual[:-1])
    mase = mae / np.mean(naive_errors) if len(naive_errors) > 0 else float('inf')

    return {"MAE": mae, "RMSE": rmse, "MAPE": mape, "MASE": mase}

# --- Evaluate across horizons ---
metrics = compute_metrics(test_actual, test_predicted)
for name, val in metrics.items():
    print(f"{name:<6}: {val:.3f}")
# MAE   : 45.230
# RMSE  : 62.117
# MAPE  : 3.842
# MASE  : 0.876   <-- below 1.0 means better than naive!

MASE (Mean Absolute Scaled Error) is the most honest metric: it divides your model's error by the naive baseline's error. MASE above 1.0 means you'd be better off not using a model at all. It's the "are we actually helping?" check, and as discussed in the loss functions post, the choice of metric shapes what your model optimizes for.

8. The Complete Picture — Where Every Method Fits

We've built five forecasters, each with different strengths. The practical takeaway: start simple, add complexity only when it improves your specific metric on your specific data.

Method	Best for	Min Data	Nonlinear?	Multivariate?	Speed
Holt-Winters	Clean univariate with trend + season	2 cycles	No	No	Instant
ARIMA	Stationary or near-stationary series	50+ points	No	No	Fast
LSTM	Long sequences with complex dynamics	500+ points	Yes	Yes	Slow
TCN	Long sequences, faster alternative to LSTM	300+ points	Yes	Yes	Medium
Transformer	Multivariate with cross-variable patterns	1000+ points	Yes	Yes	Slow

Rules of thumb: if your series is univariate and under a few hundred points, try Holt-Winters or ARIMA first. You may be surprised how hard they are to beat. If you have thousands of data points with clear nonlinear dynamics, reach for a TCN or LSTM. If you have multivariate data with complex cross-variable dependencies, the temporal transformer earns its complexity.

Connections to the Series

Time series forecasting connects to nearly every topic in the elementary series:

Attention from Scratch — the mechanism that makes temporal transformers attend to relevant past timesteps
Transformer from Scratch — the full encoder architecture underlying Section 6
RNNs from Scratch — LSTM foundations and the vanishing gradient problem
ConvNets from Scratch — convolution intuition that powers temporal convolutional networks
Positional Encoding — why time embeddings work for representing temporal position
Loss Functions — MAPE, RMSE, MAE and how metric choice shapes training
Backpropagation — how gradients flow through time in recurrent networks
Activation Functions — ReLU in TCN residual connections, sigmoid/tanh in LSTM gates

References & Further Reading

Makridakis, Spiliotis & Assimakopoulos — The M4 Competition (2020) — statistical methods vs. ML on 100,000 time series
Hyndman & Athanasopoulos — Forecasting: Principles and Practice (3rd ed.) — the definitive free online textbook on forecasting
Hochreiter & Schmidhuber — Long Short-Term Memory (1997) — the original LSTM paper
Bai, Kolter & Koltun — An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (2018) — the TCN paper showing CNNs can match or beat RNNs
Lim et al. — Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting (2021) — state-of-the-art temporal transformer
Nie et al. — A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (PatchTST, 2023) — patching approach that outperforms previous transformer methods