Running LLMs on Your Own Machine
Your Models Are One Command Away
Every applied post on this site assumes the same thing: you have an API key and you're sending HTTP requests to someone else's server. That works great — until it doesn't. Maybe you're iterating on a prompt and burning through credits. Maybe your data can't leave your machine. Maybe you want to experiment at 2 AM without worrying about rate limits. Or maybe you just want to feel a 7-billion-parameter model running on your own hardware.
Local inference is a different world from API calls. You're managing GPU memory, choosing quantization formats, and making tradeoffs between speed, quality, and memory that API users never see. But the tools have gotten remarkably good. In 2024, running a capable LLM locally meant compiling C++ from source and praying. In 2026, it's three commands.
This post is a practical guide. We'll go from zero to a locally-served LLM with an OpenAI-compatible API, understand the memory math that governs what fits on your hardware, compare three inference engines head-to-head, and build a calculator that answers the question everyone googles: "will this model fit on my GPU?"
When to Run Locally vs. Call an API
This isn't a religious debate — it's an engineering decision with five axes:
- Cost: API pricing is per-token. Local is a fixed hardware cost. If you're running thousands of inference calls per day, local wins on cost within weeks. For occasional use, the API is cheaper than buying a GPU.
- Latency: Local eliminates the network round-trip (typically 50-200ms). But a cloud A100 generates tokens faster than your RTX 4070. The sweet spot: local inference feels faster for interactive use because TTFT is near-zero.
- Privacy: Your data never leaves your machine. This isn't a feature — it's a requirement for healthcare, legal, finance, and any workflow involving proprietary code.
- Availability: No rate limits, no outages, works offline. Your model is as reliable as your power supply.
- Quality: API providers offer frontier models (GPT-4o, Claude Opus, Gemini Ultra) that vastly outperform anything that fits on consumer hardware. For tasks where the best model matters, APIs win.
The practical answer for most developers: local for development and iteration, API for production and frontier quality. Prototype locally where inference is free, then switch to an API when you need the best model or horizontal scaling.
The Local Inference Stack
Local inference has three layers, and understanding them prevents a lot of confusion:
Layer 1: Model Format — how the weights are stored on disk. The format determines which engines can load the model and what quantization is applied:
GGUF— llama.cpp's native format. Supports CPU + GPU inference and dozens of quantization levels (Q4_K_M, Q5_K_M, Q6_K, Q8_0). The most portable format — runs on everything from a Raspberry Pi to an A100.SafeTensors— HuggingFace's standard format. Used by vLLM, TGI, and transformers. Typically full precision (FP16/BF16) or quantized via GPTQ/AWQ.AWQ— Activation-Aware Weight Quantization. GPU-only, better quality than GPTQ at the same bit width. The new default for GPU-quantized models.
If you've read our quantization from scratch post, you know the theory behind INT4, NF4, and GPTQ. Here we care about the practical question: which format does your engine need?
Layer 2: Inference Engine — the software that loads the model and runs forward passes:
Ollama— think "Docker for LLMs." Pull a model, run it, serve it. Wraps llama.cpp under the hood. Best for getting started.llama.cpp— the C++ engine that started the local LLM revolution. Maximum control, CPU + GPU, GGUF format. When Ollama isn't enough.vLLM— production-grade Python engine from UC Berkeley. Continuous batching, PagedAttention (see our KV cache post), GPU-only. Best for serving multiple users.
Layer 3: Serving Interface — how your code talks to the engine. All three engines support the OpenAI-compatible API format. This means every code example in every applied post on this site works with local models — you just change the base_url.
Your First Local LLM: Ollama Quickstart
Three commands. That's it.
# Install Ollama (Linux/macOS — also available on Windows)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (Llama 3.2 3B — small, fast, fits anywhere)
ollama pull llama3.2
# Run it — you're now chatting with a local LLM
ollama run llama3.2
That third command drops you into an interactive chat. But the real power is the API server that Ollama starts automatically. It listens on localhost:11434 and speaks the OpenAI API format. Here's how to call it from Python using the same openai library you'd use for GPT-4:
from openai import OpenAI
# Same library, same interface — just point to localhost
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # Ollama doesn't require auth
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain KV caching in one paragraph."}
],
temperature=0.7,
max_tokens=256
)
print(response.choices[0].message.content)
That's the same code from our streaming, structured output, and agents posts — with one line changed. Every pattern you've learned with APIs transfers directly to local inference.
Under the hood, Ollama stores models in ~/.ollama/models, auto-detects your GPU (CUDA on Linux/Windows, Metal on macOS), and handles memory management for you. When you need customization, a Modelfile gives you control over system prompts, parameters, and model configuration:
# Modelfile — like a Dockerfile, but for LLMs
FROM llama3.2
# Bake in a system prompt
SYSTEM """You are a senior Python developer.
Always include type hints. Prefer functional patterns.
Keep explanations concise."""
# Set inference parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
Build and run your custom model with ollama create mydev -f Modelfile and then ollama run mydev. This is particularly useful for development workflows: bake your coding conventions into the system prompt and have a local assistant that already knows your preferences.
Which model to start with? llama3.2 (3B) fits on anything and is great for testing. mistral (7B) is the sweet spot for quality vs. speed. codellama or deepseek-coder for code-specific tasks. phi-3 (3.8B) punches above its weight on reasoning.
Memory Math: Will Your Model Fit?
This is the question that governs local LLM deployment. The formula is straightforward:
VRAM ≈ Model Weights + KV Cache + Overhead
Model Weights are the big number. A 7B-parameter model at 4-bit quantization uses roughly 7 × 109 × 4 bits ÷ 8 = 3.5 GB for the raw weights, plus about 25% more for embedding tables, layer norms, and file metadata — bringing it to around 4.4 GB. The scaling is linear: double the parameters, double the memory.
KV Cache grows with context length. Every token in the context window stores key and value vectors for every attention layer and head. For a 7B model at 4K context, that's roughly 256 MB. At 32K context, it's 2 GB. At 128K context, it's 8 GB — suddenly your "fits on an 8GB GPU" model doesn't fit. See our KV cache from scratch post for why this grows so fast.
Overhead covers CUDA context, computation buffers, and engine internals. A safe estimate is 500 MB–1 GB.
Here's the Python function that does the math:
def estimate_vram_gb(
params_billions: float,
bits_per_weight: float,
context_length: int,
overhead_gb: float = 0.8
) -> dict:
"""Estimate VRAM needed for local LLM inference."""
# Model weights: params * bits / 8, converted to GB (decimal)
weight_gb = (params_billions * 1e9 * bits_per_weight) / (8 * 1e9)
# Add ~25% for embedding tables, layer norms, and GGUF metadata
weight_gb *= 1.25
# KV cache heuristic: ~0.5 GB per 7B params per 4K context
kv_gb = (params_billions / 7) * (context_length / 4096) * 0.5
total = weight_gb + kv_gb + overhead_gb
return {
"weights_gb": round(weight_gb, 1),
"kv_cache_gb": round(kv_gb, 1),
"overhead_gb": overhead_gb,
"total_gb": round(total, 1)
}
# Examples
configs = [
("3B Q4", 3, 4.0, 4096),
("7B Q4", 7, 4.0, 4096),
("7B Q8", 7, 8.0, 4096),
("13B Q4", 13, 4.0, 4096),
("7B Q4 32K", 7, 4.0, 32768),
("70B Q4", 70, 4.0, 4096),
]
print(f"{'Config':<14} {'Weights':>8} {'KV Cache':>9} {'Total':>7}")
print("-" * 42)
for name, params, bits, ctx in configs:
r = estimate_vram_gb(params, bits, ctx)
print(f"{name:<14} {r['weights_gb']:>7.1f}G {r['kv_cache_gb']:>8.1f}G {r['total_gb']:>6.1f}G")
Running this produces the cheat sheet for local deployment:
| Config | Weights | KV Cache | Total | Fits On |
|---|---|---|---|---|
| 3B Q4_K_M | 1.9 GB | 0.2 GB | 2.9 GB | Any GPU, or CPU-only |
| 7B Q4_K_M | 4.4 GB | 0.5 GB | 5.7 GB | 6 GB GPU (RTX 2060, M1 8GB) |
| 7B Q8_0 | 8.8 GB | 0.5 GB | 10.1 GB | 12 GB GPU (RTX 3060/4070, M1 Pro 16GB) |
| 13B Q4_K_M | 8.1 GB | 0.9 GB | 9.8 GB | 12 GB GPU (RTX 3060/4070) |
| 7B Q4 @ 32K ctx | 4.4 GB | 4.0 GB | 9.2 GB | 10+ GB — context eats memory! |
| 70B Q4_K_M | 43.8 GB | 5.0 GB | 49.6 GB | 64+ GB (M2 Ultra, 2× A6000) |
The row that surprises people: a 7B model at Q4 with 32K context uses almost the same memory as a 13B model at 4K context. Context length is a hidden memory tax. When you see "128K context" in a model's spec sheet, ask yourself whether your hardware can actually afford it.
The rule of thumb: leave 10% VRAM headroom. If your GPU has 12 GB, plan for 10.8 GB usable. CUDA context, memory fragmentation, and the unexpected will eat the rest.
llama.cpp: Maximum Control
Ollama is llama.cpp in a tuxedo. When you need control over exactly how inference runs — which layers go on GPU vs. CPU, how many threads to use, what sampling parameters to apply — go to the engine directly.
# Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download a GGUF model (e.g., from HuggingFace)
# Then launch the server
./build/bin/llama-server \
-m models/llama-3.1-8b-Q4_K_M.gguf \
-ngl 35 \
-c 8192 \
--host 0.0.0.0 \
--port 8080 \
-t 8 \
--mlock
# -ngl 35: offload 35 layers to GPU (all of them for 7B)
# -c 8192: context window size
# -t 8: CPU threads (for any layers not on GPU)
# --mlock: lock model in RAM — prevent swapping
The key parameters you'll tune:
-ngl(num GPU layers): the CPU/GPU offloading knob. Set it to the total layer count to put everything on GPU. Set it lower to split the model — some layers on GPU (fast), the rest on CPU (slower but uses system RAM instead of VRAM).-c(context size): bigger context = more KV cache memory. Start with 4096, increase if you need it.--mlock: keeps the model pinned in RAM so the OS doesn't swap it to disk. Essential for consistent performance.-t(threads): CPU thread count. Only matters for layers not offloaded to GPU. Match it to your physical cores, not logical threads.
The llama.cpp server exposes an OpenAI-compatible API at /v1/chat/completions, so the same Python code from the Ollama section works here too — just change the port to 8080.
vLLM: Production-Grade Serving
Ollama and llama.cpp handle one request at a time (or batch naively). When you need to serve multiple users from one GPU — a shared dev server, a team tool, a production service — you need an engine designed for throughput. That's vLLM.
What makes vLLM different: continuous batching (new requests start immediately instead of waiting for the current batch to finish) and PagedAttention (manages KV cache like virtual memory pages instead of pre-allocating a contiguous block). If you've read our KV cache post, you know why PagedAttention matters — it eliminates the memory fragmentation that wastes 60-80% of KV cache space in naive implementations.
# Install and launch vLLM (GPU required, SafeTensors/AWQ/GPTQ format)
# pip install vllm
# vllm serve meta-llama/Llama-3.1-8B --max-model-len 8192
# Benchmark concurrent throughput with asyncio
import asyncio
import time
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
async def single_request(prompt: str) -> dict:
"""Send one request and measure timing."""
start = time.perf_counter()
response = await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B",
messages=[{"role": "user", "content": prompt}],
max_tokens=128,
temperature=0.7
)
elapsed = time.perf_counter() - start
tokens = response.usage.completion_tokens
return {"tokens": tokens, "time": elapsed, "tok_per_sec": tokens / elapsed}
async def benchmark_concurrent(num_requests: int):
"""Fire N requests simultaneously and measure aggregate throughput."""
prompts = [f"Write a haiku about the number {i}" for i in range(num_requests)]
start = time.perf_counter()
results = await asyncio.gather(*[single_request(p) for p in prompts])
wall_time = time.perf_counter() - start
total_tokens = sum(r["tokens"] for r in results)
print(f"Concurrent requests: {num_requests}")
print(f" Wall time: {wall_time:.2f}s")
print(f" Total tokens: {total_tokens}")
print(f" Aggregate throughput: {total_tokens / wall_time:.1f} tok/s")
print(f" Avg per-request: {sum(r['tok_per_sec'] for r in results) / len(results):.1f} tok/s")
print()
async def main():
for n in [1, 5, 10, 20]:
await benchmark_concurrent(n)
asyncio.run(main())
The results are dramatic. On a single A100 with an 8B model, representative numbers look like this (your hardware will vary):
| Engine | 1 Request | 5 Concurrent | 10 Concurrent | 20 Concurrent |
|---|---|---|---|---|
| Ollama | 45 tok/s | ~9 tok/s each | ~4.5 tok/s each | ~2.2 tok/s each |
| llama.cpp | 48 tok/s | ~10 tok/s each | ~5 tok/s each | ~2.5 tok/s each |
| vLLM | 52 tok/s | ~42 tok/s each | ~35 tok/s each | ~25 tok/s each |
Single-request performance is similar. Under concurrent load, vLLM's continuous batching maintains near-peak throughput per request while Ollama and llama.cpp degrade linearly because they serialize requests. The aggregate throughput tells the story: vLLM with 20 concurrent requests produces ~500 tok/s total; Ollama produces ~44 tok/s total. That's a 10x difference. If you're serving a team, vLLM pays for itself immediately.
Key vLLM parameters to know:
--max-model-len: maximum context length. Lower this to save VRAM.--gpu-memory-utilization: fraction of VRAM to use (default: 0.9). Lower it if you're running other GPU workloads.--tensor-parallel-size: split the model across N GPUs. Set to 2 for 2-GPU setups.
Important caveat: vLLM is GPU-only and works with SafeTensors/AWQ/GPTQ formats, not GGUF. If you need CPU inference or GGUF models, stick with Ollama or llama.cpp.
The Benchmark: Three Engines, One Model, Real Numbers
Let's make the comparison rigorous. Here's a benchmark harness that tests all three engines on the same prompts with the same model (adapted for each engine's format):
import time
import requests
def benchmark_engine(base_url: str, model: str, prompts: list[str],
max_tokens: int = 128) -> dict:
"""Benchmark a single engine with sequential requests."""
results = []
for prompt in prompts:
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.7,
"stream": False
}
start = time.perf_counter()
resp = requests.post(f"{base_url}/v1/chat/completions", json=payload)
elapsed = time.perf_counter() - start
data = resp.json()
tokens = data["usage"]["completion_tokens"]
results.append({
"tokens": tokens,
"total_time": elapsed,
"tokens_per_sec": tokens / elapsed
})
avg_tps = sum(r["tokens_per_sec"] for r in results) / len(results)
avg_time = sum(r["total_time"] for r in results) / len(results)
return {"avg_tok_per_sec": round(avg_tps, 1), "avg_latency": round(avg_time, 2)}
# Test prompts — mix of short and medium generation tasks
prompts = [
"What is the capital of France? Answer in one sentence.",
"Explain the difference between a list and a tuple in Python.",
"Write a function that checks if a string is a palindrome.",
"Summarize the key ideas behind MapReduce in 3 bullet points.",
"What are the tradeoffs between SQL and NoSQL databases?",
]
engines = {
"Ollama": ("http://localhost:11434", "llama3.1:8b"),
"llama.cpp": ("http://localhost:8080", "llama3.1:8b"),
"vLLM": ("http://localhost:8000", "meta-llama/Llama-3.1-8B"),
}
print(f"{'Engine':<12} {'Avg tok/s':>10} {'Avg latency':>12}")
print("-" * 36)
for name, (url, model) in engines.items():
try:
result = benchmark_engine(url, model, prompts)
print(f"{name:<12} {result['avg_tok_per_sec']:>9.1f} {result['avg_latency']:>10.2f}s")
except Exception as e:
print(f"{name:<12} {'error':>10} — {e}")
Representative results on mid-range hardware (RTX 4070 12GB, 8B model at Q4):
| Metric | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Tokens/sec (single) | 38 | 40 | 44 |
| TTFT | ~85 ms | ~80 ms | ~65 ms |
| Peak VRAM | 5.8 GB | 5.6 GB | 5.2 GB |
| 5 concurrent (agg.) | 42 tok/s | 45 tok/s | 185 tok/s |
| Setup complexity | 1 command | Build from source | pip install |
| CPU inference | Yes | Yes | No |
The takeaway is clear: Ollama for development (simplest setup, model management built in), llama.cpp for customization (build flags, CPU/GPU splitting, bleeding-edge features), vLLM for serving (continuous batching makes it the only serious choice for multi-user workloads). For details on how we measure TTFT and throughput, see our LLM latency benchmarks post.
Choosing Your Setup
Here's the decision tree:
- Just experimenting? → Ollama. Install, pull, run. Done.
- Building a prototype? → Ollama + Modelfile. Bake in your system prompt and parameters.
- Serving multiple users? → vLLM. Continuous batching is non-negotiable for shared workloads.
- Need CPU inference? → Ollama or llama.cpp. vLLM is GPU-only.
- Maximum control? → llama.cpp from source. Custom CUDA kernels, bleeding-edge quant formats, granular GPU layer offloading.
- Want a GUI? → LM Studio. It's llama.cpp with a nice interface.
- Apple Silicon? → Ollama (Metal support built in). The M-series unified memory means your "VRAM" is your total RAM — an M2 Ultra with 192 GB can run a 70B model at full precision.
The models you've studied throughout our from-scratch series — the attention heads, the feed-forward layers, the KV caches, the quantized weights — they're no longer abstractions. They're files on your disk, running on your hardware, answering your questions. That's the payoff of understanding how things work from the ground up: when you run ollama run llama3.2, you know exactly what's happening inside.
Try It: Will It Fit? Memory Calculator
Select a model configuration and see if it fits on your GPU.
References & Further Reading
- Georgi Gerganov — llama.cpp — the C++ inference engine that started the local LLM revolution
- Ollama — the easiest way to run LLMs locally, built on llama.cpp
- vLLM Documentation — production-grade LLM serving from UC Berkeley
- Kwon et al. 2023 — Efficient Memory Management for Large Language Model Serving with PagedAttention — the vLLM paper that introduced paged KV cache management
- GGUF Format Specification — HuggingFace documentation on the GGUF model format
- Quantization from Scratch (DadOps) — the theory behind INT4, NF4, GPTQ, and why fewer bits can still work
- KV Cache from Scratch (DadOps) — why PagedAttention matters and how KV cache memory scales
- LLM API Latency Benchmarks (DadOps) — benchmarking methodology and API vs. local comparison