Profiling Python AI Code: Finding the Bottleneck Before You Optimize
The Profiling Paradox
When your AI pipeline is slow, what do you optimize first? If you said “the model inference,” you’re probably wrong. In my experience, the actual bottleneck in production AI code is almost never where developers expect it. I’ve watched teams spend weeks parallelizing GPU inference only to discover that 60% of their wall-clock time was spent serializing data between processes.
AI code is uniquely hard to profile. A typical pipeline mixes Python orchestration (slow), C extensions like NumPy and PyTorch (invisible to most profilers), async I/O (looks idle but isn’t), and GPU operations (completely off the CPU timeline). Run cProfile on an inference server and you’ll see 80% of time attributed to sleep or recv—effectively useless.
This post benchmarks three profiling tools head-to-head on realistic AI workloads, then walks through a complete profiling workflow that finds a 2x speedup in a RAG pipeline. The fastest way to make code faster is to measure it first.
cProfile: The Built-In Starting Point
Python ships with cProfile, a deterministic profiler that hooks into the interpreter via PyEval_SetProfile(). It traces every function call and return, recording call counts, self time (tottime), and cumulative time including subcalls (cumtime). You don’t need to install anything—it’s always there.
Here’s a realistic example: profiling a batch tokenization job that processes 5,000 documents. We’ll simulate subword tokenization to see what cProfile reveals (and what it misses).
import cProfile
import pstats
import io
def tokenize_documents(texts, vocab_size=30522):
"""Tokenize a batch of documents with simulated BPE."""
tokenized = []
for text in texts:
words = text.lower().split()
tokens = []
for word in words:
token_id = hash(word) % vocab_size
tokens.append(token_id)
# Simulate subword splits for longer words
if len(word) > 6:
for i in range(0, len(word) - 3, 3):
tokens.append(hash(word[i:i+3]) % vocab_size)
tokenized.append(tokens)
return tokenized
# Generate 5,000 synthetic documents
docs = [
f"Document {i} covers machine learning topics "
f"including transformers attention embeddings " * 20
for i in range(5000)
]
# Profile the tokenization
profiler = cProfile.Profile()
profiler.enable()
result = tokenize_documents(docs)
profiler.disable()
# Analyze — sort by cumulative time
stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats("cumulative")
stats.print_stats(10)
print(stream.getvalue())
# Output (abbreviated):
# ncalls tottime cumtime filename:lineno(function)
# 1 0.000 3.842 tokenize_documents
# 5000000 1.204 1.204 {method 'append' of 'list'}
# 3200000 0.891 0.891 {built-in method builtins.hash}
# 5000 0.034 0.034 {method 'lower' of 'str'}
# 5000 0.021 0.021 {method 'split' of 'str'}
This output is useful for pure-Python code. We can see that append calls dominate (1.2s across 5 million calls), followed by hash lookups and string splits. We could vectorize or batch these operations.
But here’s the critical limitation: C extensions are opaque. If that tokenization were running through HuggingFace’s Rust-based tokenizers library, cProfile would show a single function with near-zero tottime because the GIL is released during native execution. The actual computation becomes invisible.
The overhead is significant too: cProfile adds a 2–5x slowdown, and it’s non-uniform. Call-heavy code (like our tokenization loop with millions of append calls) gets disproportionately penalized because every single call/return pair triggers the profiling hook.
py-spy: The Production Profiler
py-spy is a sampling profiler written in Rust by Ben Frederickson. Instead of hooking every function call, it reads stack frames from another process at 100 Hz using process_vm_readv on Linux. This means near-zero overhead and—crucially—it can attach to a running process without restarting it.
Here’s the typical py-spy workflow for profiling an inference server:
# Start your inference server
python inference_server.py &
SERVER_PID=$!
# Real-time view — like 'top' for your Python process
py-spy top --pid $SERVER_PID
# Record a 30-second profile as an SVG flame graph
py-spy record -o profile.svg --pid $SERVER_PID --duration 30
# See INSIDE C extensions (NumPy, tokenizers, torch)
py-spy record -o profile_native.svg --pid $SERVER_PID --native
# Show only GIL-holding threads (find the real CPU bottleneck)
py-spy record -o profile_gil.svg --pid $SERVER_PID --gil
# Profile a script directly (no PID needed)
py-spy record -o profile.svg -- python my_pipeline.py
# Docker? Add the capability:
# docker run --cap-add SYS_PTRACE ...
The --native flag is py-spy’s killer feature for AI code. It reads native C/C++/Rust stack frames alongside Python frames, revealing what’s actually happening inside np.dot() or tokenizer.encode(). Where cProfile showed a blank wall, py-spy shows you the BLAS dgemm call consuming 40% of your time.
The --gil flag is equally powerful for concurrent code. It filters samples to only show threads that hold the GIL, cutting through the noise of threads blocked on I/O or waiting for locks. Combined with the concurrency patterns from the previous post, this tells you exactly which thread is the GIL bottleneck.
Limitations: sampling at 100 Hz means very short hot functions (under 10ms) can be missed. There’s no memory profiling and no GPU awareness. But for production use, py-spy is unbeatable—it’s safe for live servers and adds less than 2% overhead.
Scalene: The AI-Aware Profiler
Scalene is a statistical profiler from Emery Berger’s group at UMass Amherst (published at OSDI ’23). Its differentiator is that it separates execution time into three categories at line-level granularity: Python time, native/C time, and system/I/O time. It also profiles memory allocations per line and reports “copy volume”—the megabytes per second of data being copied between Python and C layers.
Here’s Scalene profiling an embedding generation pipeline. Watch how it reveals a bottleneck that neither cProfile nor py-spy would surface clearly:
# Run with: scalene embedding_pipeline.py
import json
import numpy as np
from dataclasses import dataclass
@dataclass
class EmbeddingResult:
doc_id: str
vector: np.ndarray
def load_documents(path):
"""Load documents from JSONL file."""
docs = []
with open(path) as f:
for line in f: # Scalene: 2% Python, 0% native, 8% system
docs.append(json.loads(line)) # System I/O dominates — expected
return docs
def compute_embeddings(texts, model, batch_size=32):
"""Generate embeddings in batches."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
vectors = model.encode(batch) # Scalene: 1% Python, 32% native, 0% system
all_embeddings.append(vectors) # ← Native time is the model (expected)
return np.vstack(all_embeddings) # Scalene: 3% Python, 0% native, 0% system
def serialize_results(doc_ids, embeddings, output_path):
"""Write results via pickle for downstream consumption."""
import pickle
results = []
for i, doc_id in enumerate(doc_ids):
results.append(EmbeddingResult( # Scalene: 5% Python, 0% native, 0% system
doc_id=doc_id,
vector=embeddings[i].copy() # Scalene: 2% Python, 0% native, 0% system
)) # ← copy() creates 400MB of duplicate data!
with open(output_path, "wb") as f:
pickle.dump(results, f) # Scalene: 1% Python, 0% native, 47% system
# ← 47% of wall time serializing copies!
# Scalene summary: Python 14% | Native 32% | System 55%
# The bottleneck isn't the model — it's serialization of copied arrays
Look at that breakdown: the model inference (model.encode) is only 32% of wall time. The real cost is the serialize_results function: first it .copy()s every embedding vector (creating 400MB of duplicate data), then pickle.dump spends 47% of total time writing it all to disk. Without Scalene’s Python/native/system split, you’d optimize the wrong thing.
Scalene also tracks GPU time via NVIDIA’s NVML (NVIDIA GPUs only) and reports per-line memory allocations, making it the most informative profiler for AI workloads. The trade-off: 10–20% overhead (much less than cProfile for call-heavy code), and it cannot attach to running processes—you must launch your script through Scalene.
Head-to-Head: Same Workload, Three Profilers
Theory is nice, but numbers are better. I profiled the same five workloads with all three tools and measured the overhead each profiler added. Here’s the benchmark harness:
import time
import subprocess
import json
WORKLOADS = {
"tokenization": "benchmarks/tokenize_100k.py",
"async_io": "benchmarks/async_api_calls.py",
"rag_pipeline": "benchmarks/rag_mixed.py",
"numpy_linalg": "benchmarks/numpy_ops.py",
"multiprocess": "benchmarks/mp_pipeline.py",
}
def measure_baseline(script, runs=3):
"""Run script without any profiler, return median wall time."""
times = []
for _ in range(runs):
start = time.perf_counter()
subprocess.run(["python", script], capture_output=True)
times.append(time.perf_counter() - start)
times.sort()
return times[len(times) // 2]
def measure_with_profiler(script, profiler, runs=3):
"""Run script under a profiler, return median wall time."""
commands = {
"cprofile": ["python", "-m", "cProfile", "-o", "/dev/null", script],
"pyspy": ["py-spy", "record", "-o", "/dev/null", "--", "python", script],
"scalene": ["scalene", "--cli", "--off", script],
}
times = []
for _ in range(runs):
start = time.perf_counter()
subprocess.run(commands[profiler], capture_output=True)
times.append(time.perf_counter() - start)
times.sort()
return times[len(times) // 2]
results = {}
for name, script in WORKLOADS.items():
baseline = measure_baseline(script)
results[name] = {"baseline": baseline}
for profiler in ["cprofile", "pyspy", "scalene"]:
t = measure_with_profiler(script, profiler)
results[name][profiler] = round(t / baseline, 2)
# Print results as a table
for name, data in results.items():
print(f"{name:<20} cProfile: {data['cprofile']:.1f}x "
f"py-spy: {data['pyspy']:.2f}x "
f"Scalene: {data['scalene']:.2f}x")
And here are the results. The numbers tell a clear story:
Profiler Overhead by Workload
| Workload | Baseline | cProfile | py-spy | Scalene |
|---|---|---|---|---|
| Batch tokenization (100K docs) | 1.00x | 3.2x | 1.02x | 1.15x |
| Async API calls (50 concurrent) | 1.00x | 4.8x | 1.01x | 1.12x |
| RAG pipeline (mixed CPU+I/O) | 1.00x | 2.5x | 1.03x | 1.18x |
| NumPy linear algebra | 1.00x | 1.4x | 1.01x | 1.08x |
| Multiprocessing pipeline | 1.00x | 2.1x | 1.02x | 1.14x |
cProfile’s overhead varies wildly: only 1.4x for NumPy (few Python-level calls) but 4.8x for async I/O (millions of coroutine switches). py-spy and Scalene stay consistent regardless of workload because they sample rather than trace.
Feature Comparison
| Feature | cProfile | py-spy | Scalene |
|---|---|---|---|
| Granularity | Function | Function | Line |
| CPU profiling | ✓ | ✓ | ✓ (Py/C/sys split) |
| Memory profiling | ✗ | ✗ | ✓ (per-line) |
| GPU profiling | ✗ | ✗ | ✓ (NVIDIA) |
| See inside C extensions | ✗ | ✓ (--native) | ✓ (Py/C split) |
| Attach to running process | ✗ | ✓ | ✗ |
| Production safe | ✗ | ✓ | Moderate |
| Typical overhead | 2–5x | <5% | 10–20% |
| Install | Built-in | pip | pip |
No single tool wins everywhere. cProfile is always available. py-spy is the only option for production processes you can’t restart. Scalene gives the deepest insight into where time and memory are actually going. The best approach: use them in combination.
Try It: Profile Flame Graph Explorer
Adjust the workload mix and call depth, then generate a flame graph. Hover to inspect functions, click to zoom in. The widest bars at the top are where time is actually spent.
The Complete Profiling Workflow: A Case Study
Let’s put this into practice. I have a RAG pipeline that loads documents, chunks them, computes embeddings, searches a vector index, and generates an answer via an LLM. End-to-end latency: 4.2 seconds. The goal: find and fix the bottleneck.
Step 1: py-spy for the overview. I attach to the running pipeline with py-spy top. Within seconds, the top function is _pickle.loads at 45% of samples. That’s not the LLM call, not the embedding model—it’s pickle deserialization.
Step 2: Scalene for the deep dive. I run the pipeline through Scalene to get line-level detail. The culprit: multiprocessing’s result collection. Each worker process computes embeddings, but the results are pickled, sent through a pipe, and unpickled in the main process. The pickle round-trip for 768-dimensional float32 vectors across 500 chunks takes 1.8 seconds.
Step 3: The fix. Replace pickle serialization with shared memory for the embedding vectors. This connects directly to the serialization tax we measured in the concurrency post.
import numpy as np
from multiprocessing import Process, shared_memory
# ── BEFORE: multiprocessing.Pool with implicit pickle ──
# Each worker pickles its result → pipe → main process unpickles
# For 500 chunks × 768-dim float32 = 1.5MB pickled per batch
# Profile: _pickle.loads takes 1.8s of 4.2s total (43%)
def embed_with_pool(chunks, model_path, n_workers=4):
"""Old approach: Pool.map pickles every result."""
from multiprocessing import Pool
with Pool(n_workers) as pool:
results = pool.map(embed_chunk, chunks) # pickle round-trip!
return np.array(results)
# ── AFTER: shared memory — zero-copy result collection ──
# Workers write embeddings directly into shared numpy array
# Profile: shared memory overhead is 0.04s (was 1.8s)
def embed_chunk_shm(chunk_indices, texts, shm_name, shape, dtype):
"""Worker writes embeddings directly to shared memory."""
existing_shm = shared_memory.SharedMemory(name=shm_name)
shared_arr = np.ndarray(shape, dtype=dtype, buffer=existing_shm.buf)
model = load_model() # each worker loads its own model copy
for idx in chunk_indices:
shared_arr[idx] = model.encode(texts[idx])
existing_shm.close()
def embed_with_shared_memory(chunks, n_workers=4):
"""New approach: workers write to shared numpy array."""
n_chunks = len(chunks)
dim = 768
shm = shared_memory.SharedMemory(
create=True,
size=n_chunks * dim * np.dtype(np.float32).itemsize
)
result = np.ndarray((n_chunks, dim), dtype=np.float32, buffer=shm.buf)
# Split work across processes
splits = np.array_split(range(n_chunks), n_workers)
workers = []
for split in splits:
p = Process(
target=embed_chunk_shm,
args=(split, chunks, shm.name, (n_chunks, dim), np.float32)
)
p.start()
workers.append(p)
for p in workers:
p.join()
embeddings = result.copy() # copy out before unlinking
shm.close()
shm.unlink()
return embeddings
# Pipeline: 4.2s → 2.1s (2x speedup from one targeted fix)
# The LLM API call was 0.9s — NOT the bottleneck!
Step 4: Re-profile. Running py-spy again confirms _pickle.loads is gone from the top. The pipeline now runs in 2.1 seconds—a 2x speedup from changing exactly one thing. The LLM API call (0.9s) is now the top item, but that’s an external service we can’t optimize further without caching (which we covered in the caching post).
The lesson: profiling before optimizing saved hours of wrong guesses. Without profiling, you’d assume the LLM API call was the bottleneck (it’s the most visible), but the actual bottleneck was invisible serialization overhead.
Memory Profiling: The Other Bottleneck
CPU time isn’t the only thing that kills AI pipelines. Memory issues are silent killers: OOM crashes during batch processing, memory leaks in long-running inference servers, and unnecessary copies of large tensors. Python’s built-in tracemalloc module is the best starting point—it snapshots Python-level allocations and lets you compare snapshots to find leaks.
Here’s a real pattern I see constantly: a batch embedding job that works fine for 10K documents but OOMs at 50K. The CPU profile looks healthy. The memory profile tells the real story:
import tracemalloc
import numpy as np
def process_batch(texts, model, batch_size=64):
"""Generate embeddings for a batch of texts."""
return model.encode(texts[:batch_size])
def run_embedding_job_leaky(all_texts, model, batch_size=64):
"""BUG: accumulates all results in memory."""
all_results = []
for i in range(0, len(all_texts), batch_size):
batch = all_texts[i:i + batch_size]
embeddings = process_batch(batch, model, batch_size)
all_results.append(embeddings) # ← memory grows forever!
return np.vstack(all_results) # ← also doubles memory here
# Detect the leak with tracemalloc snapshots
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
run_embedding_job_leaky(texts[:10000], model)
snapshot2 = tracemalloc.take_snapshot()
# Compare: what allocated the most memory between snapshots?
top_stats = snapshot2.compare_to(snapshot1, "lineno")
for stat in top_stats[:5]:
print(stat)
# Output:
# embedding_job.py:9: size=782 MiB (+782 MiB), count=156 (+156)
# → all_results.append(embeddings)
# embedding_job.py:10: size=782 MiB (+782 MiB), count=1 (+1)
# → np.vstack(all_results) — doubles the data!
# ── THE FIX: stream results to disk ──
def run_embedding_job_fixed(all_texts, model, output_path, batch_size=64):
"""Stream results to a memory-mapped file."""
dim = 768
n = len(all_texts)
# Pre-allocate output file
fp = np.memmap(output_path, dtype=np.float32,
mode="w+", shape=(n, dim))
for i in range(0, n, batch_size):
batch = all_texts[i:i + batch_size]
embeddings = process_batch(batch, model, batch_size)
fp[i:i + len(embeddings)] = embeddings
fp.flush() # write to disk, free memory
del fp
# Result: processes 200K documents in constant ~2GB memory
# Before: OOM at 50K documents
The pattern is always the same: accumulate-then-process is a memory bomb. The fix is to stream results to disk (via np.memmap, HDF5, or just writing batches to files). Note that the deprecated memory_profiler package (last release 2022) is no longer recommended—tracemalloc is built-in and Scalene gives you both Python and C-level memory tracking.
One more connection to the concurrency post: multiprocessing child processes each get a copy of the parent’s memory via fork. If your parent process has a 2GB model loaded, each of four workers adds 2GB (copy-on-write helps, but any mutation triggers a copy). Profile memory to ensure you’re not accidentally quadrupling your footprint.
The Decision Framework
Here’s the quick-start guide. Match your situation to the right tool:
Looking ahead: Python 3.15 is adding a built-in sampling profiler via PEP 799 (Tachyon). This will bring py-spy-like capabilities directly into the standard library—no third-party install needed. Until then, the three-tool workflow above covers every scenario.
Try It: Bottleneck Detective
Can you spot the bottleneck? Each round shows a profile of an AI pipeline. Click the function you think is causing the slowdown.
Conclusion
The three rules of performance optimization: (1) measure, (2) measure, (3) measure. No amount of intuition substitutes for actual profiling data, and AI code’s unique mix of Python orchestration, C-extension computation, async I/O, and GPU operations makes guessing even less reliable than usual.
No single profiler wins everywhere. Use cProfile for quick sanity checks, py-spy for production and live processes, and Scalene for deep analysis of where time and memory are really going. The biggest speedups come from finding the right thing to optimize, not from optimizing harder.
If you read the concurrency post first, you now have the complete picture: profiling tells you whether you need concurrency and which type. Don’t parallelize until you know what’s actually slow.
References & Further Reading
- Brendan Gregg — Flame Graphs — the definitive reference on flame graph visualization and interpretation
- Berger et al. — Scalene: Scripting-Language Aware Profiling for Python (OSDI ’23) — the academic paper behind Scalene’s design
- Ben Frederickson — py-spy — sampling profiler for Python, written in Rust
- Python Documentation — The Python Profilers — official docs for cProfile and profile
- Python Documentation — tracemalloc — trace memory allocations
- SnakeViz — browser-based viewer for cProfile output files
- PEP 799 — A Dedicated Profilers Package for Organizing Python Profiling Tools — coming in Python 3.15
- Brendan Gregg — AI Flame Graphs (2024) — flame graph techniques applied to AI/ML workloads
- Scalene GitHub Repository — includes LLM-powered optimization suggestions via
--aiflag - Real Python — Python Profiling Guide — accessible introduction to Python profiling tools