← Back to Blog

Running LLMs on Your Own Machine

Your Models Are One Command Away

Every applied post on this site assumes the same thing: you have an API key and you're sending HTTP requests to someone else's server. That works great — until it doesn't. Maybe you're iterating on a prompt and burning through credits. Maybe your data can't leave your machine. Maybe you want to experiment at 2 AM without worrying about rate limits. Or maybe you just want to feel a 7-billion-parameter model running on your own hardware.

Local inference is a different world from API calls. You're managing GPU memory, choosing quantization formats, and making tradeoffs between speed, quality, and memory that API users never see. But the tools have gotten remarkably good. In 2024, running a capable LLM locally meant compiling C++ from source and praying. In 2026, it's three commands.

This post is a practical guide. We'll go from zero to a locally-served LLM with an OpenAI-compatible API, understand the memory math that governs what fits on your hardware, compare three inference engines head-to-head, and build a calculator that answers the question everyone googles: "will this model fit on my GPU?"

When to Run Locally vs. Call an API

This isn't a religious debate — it's an engineering decision with five axes:

The practical answer for most developers: local for development and iteration, API for production and frontier quality. Prototype locally where inference is free, then switch to an API when you need the best model or horizontal scaling.

The Local Inference Stack

Local inference has three layers, and understanding them prevents a lot of confusion:

Layer 1: Model Format — how the weights are stored on disk. The format determines which engines can load the model and what quantization is applied:

If you've read our quantization from scratch post, you know the theory behind INT4, NF4, and GPTQ. Here we care about the practical question: which format does your engine need?

Layer 2: Inference Engine — the software that loads the model and runs forward passes:

Layer 3: Serving Interface — how your code talks to the engine. All three engines support the OpenAI-compatible API format. This means every code example in every applied post on this site works with local models — you just change the base_url.

Your First Local LLM: Ollama Quickstart

Three commands. That's it.

# Install Ollama (Linux/macOS — also available on Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama 3.2 3B — small, fast, fits anywhere)
ollama pull llama3.2

# Run it — you're now chatting with a local LLM
ollama run llama3.2

That third command drops you into an interactive chat. But the real power is the API server that Ollama starts automatically. It listens on localhost:11434 and speaks the OpenAI API format. Here's how to call it from Python using the same openai library you'd use for GPT-4:

from openai import OpenAI

# Same library, same interface — just point to localhost
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"  # Ollama doesn't require auth
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain KV caching in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

That's the same code from our streaming, structured output, and agents posts — with one line changed. Every pattern you've learned with APIs transfers directly to local inference.

Under the hood, Ollama stores models in ~/.ollama/models, auto-detects your GPU (CUDA on Linux/Windows, Metal on macOS), and handles memory management for you. When you need customization, a Modelfile gives you control over system prompts, parameters, and model configuration:

# Modelfile — like a Dockerfile, but for LLMs
FROM llama3.2

# Bake in a system prompt
SYSTEM """You are a senior Python developer.
Always include type hints. Prefer functional patterns.
Keep explanations concise."""

# Set inference parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

Build and run your custom model with ollama create mydev -f Modelfile and then ollama run mydev. This is particularly useful for development workflows: bake your coding conventions into the system prompt and have a local assistant that already knows your preferences.

Which model to start with? llama3.2 (3B) fits on anything and is great for testing. mistral (7B) is the sweet spot for quality vs. speed. codellama or deepseek-coder for code-specific tasks. phi-3 (3.8B) punches above its weight on reasoning.

Memory Math: Will Your Model Fit?

This is the question that governs local LLM deployment. The formula is straightforward:

VRAM ≈ Model Weights + KV Cache + Overhead

Model Weights are the big number. A 7B-parameter model at 4-bit quantization uses roughly 7 × 109 × 4 bits ÷ 8 = 3.5 GB for the raw weights, plus about 25% more for embedding tables, layer norms, and file metadata — bringing it to around 4.4 GB. The scaling is linear: double the parameters, double the memory.

KV Cache grows with context length. Every token in the context window stores key and value vectors for every attention layer and head. For a 7B model at 4K context, that's roughly 256 MB. At 32K context, it's 2 GB. At 128K context, it's 8 GB — suddenly your "fits on an 8GB GPU" model doesn't fit. See our KV cache from scratch post for why this grows so fast.

Overhead covers CUDA context, computation buffers, and engine internals. A safe estimate is 500 MB–1 GB.

Here's the Python function that does the math:

def estimate_vram_gb(
    params_billions: float,
    bits_per_weight: float,
    context_length: int,
    overhead_gb: float = 0.8
) -> dict:
    """Estimate VRAM needed for local LLM inference."""
    # Model weights: params * bits / 8, converted to GB (decimal)
    weight_gb = (params_billions * 1e9 * bits_per_weight) / (8 * 1e9)
    # Add ~25% for embedding tables, layer norms, and GGUF metadata
    weight_gb *= 1.25

    # KV cache heuristic: ~0.5 GB per 7B params per 4K context
    kv_gb = (params_billions / 7) * (context_length / 4096) * 0.5

    total = weight_gb + kv_gb + overhead_gb
    return {
        "weights_gb": round(weight_gb, 1),
        "kv_cache_gb": round(kv_gb, 1),
        "overhead_gb": overhead_gb,
        "total_gb": round(total, 1)
    }

# Examples
configs = [
    ("3B Q4",    3,  4.0,  4096),
    ("7B Q4",    7,  4.0,  4096),
    ("7B Q8",    7,  8.0,  4096),
    ("13B Q4",  13,  4.0,  4096),
    ("7B Q4 32K", 7, 4.0, 32768),
    ("70B Q4",  70,  4.0,  4096),
]

print(f"{'Config':<14} {'Weights':>8} {'KV Cache':>9} {'Total':>7}")
print("-" * 42)
for name, params, bits, ctx in configs:
    r = estimate_vram_gb(params, bits, ctx)
    print(f"{name:<14} {r['weights_gb']:>7.1f}G {r['kv_cache_gb']:>8.1f}G {r['total_gb']:>6.1f}G")

Running this produces the cheat sheet for local deployment:

ConfigWeightsKV CacheTotalFits On
3B Q4_K_M1.9 GB0.2 GB2.9 GBAny GPU, or CPU-only
7B Q4_K_M4.4 GB0.5 GB5.7 GB6 GB GPU (RTX 2060, M1 8GB)
7B Q8_08.8 GB0.5 GB10.1 GB12 GB GPU (RTX 3060/4070, M1 Pro 16GB)
13B Q4_K_M8.1 GB0.9 GB9.8 GB12 GB GPU (RTX 3060/4070)
7B Q4 @ 32K ctx4.4 GB4.0 GB9.2 GB10+ GB — context eats memory!
70B Q4_K_M43.8 GB5.0 GB49.6 GB64+ GB (M2 Ultra, 2× A6000)

The row that surprises people: a 7B model at Q4 with 32K context uses almost the same memory as a 13B model at 4K context. Context length is a hidden memory tax. When you see "128K context" in a model's spec sheet, ask yourself whether your hardware can actually afford it.

The rule of thumb: leave 10% VRAM headroom. If your GPU has 12 GB, plan for 10.8 GB usable. CUDA context, memory fragmentation, and the unexpected will eat the rest.

llama.cpp: Maximum Control

Ollama is llama.cpp in a tuxedo. When you need control over exactly how inference runs — which layers go on GPU vs. CPU, how many threads to use, what sampling parameters to apply — go to the engine directly.

# Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (e.g., from HuggingFace)
# Then launch the server
./build/bin/llama-server \
    -m models/llama-3.1-8b-Q4_K_M.gguf \
    -ngl 35 \
    -c 8192 \
    --host 0.0.0.0 \
    --port 8080 \
    -t 8 \
    --mlock
# -ngl 35:  offload 35 layers to GPU (all of them for 7B)
# -c 8192:  context window size
# -t 8:     CPU threads (for any layers not on GPU)
# --mlock:  lock model in RAM — prevent swapping

The key parameters you'll tune:

The llama.cpp server exposes an OpenAI-compatible API at /v1/chat/completions, so the same Python code from the Ollama section works here too — just change the port to 8080.

vLLM: Production-Grade Serving

Ollama and llama.cpp handle one request at a time (or batch naively). When you need to serve multiple users from one GPU — a shared dev server, a team tool, a production service — you need an engine designed for throughput. That's vLLM.

What makes vLLM different: continuous batching (new requests start immediately instead of waiting for the current batch to finish) and PagedAttention (manages KV cache like virtual memory pages instead of pre-allocating a contiguous block). If you've read our KV cache post, you know why PagedAttention matters — it eliminates the memory fragmentation that wastes 60-80% of KV cache space in naive implementations.

# Install and launch vLLM (GPU required, SafeTensors/AWQ/GPTQ format)
# pip install vllm
# vllm serve meta-llama/Llama-3.1-8B --max-model-len 8192

# Benchmark concurrent throughput with asyncio
import asyncio
import time
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

async def single_request(prompt: str) -> dict:
    """Send one request and measure timing."""
    start = time.perf_counter()
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=128,
        temperature=0.7
    )
    elapsed = time.perf_counter() - start
    tokens = response.usage.completion_tokens
    return {"tokens": tokens, "time": elapsed, "tok_per_sec": tokens / elapsed}

async def benchmark_concurrent(num_requests: int):
    """Fire N requests simultaneously and measure aggregate throughput."""
    prompts = [f"Write a haiku about the number {i}" for i in range(num_requests)]
    start = time.perf_counter()
    results = await asyncio.gather(*[single_request(p) for p in prompts])
    wall_time = time.perf_counter() - start

    total_tokens = sum(r["tokens"] for r in results)
    print(f"Concurrent requests: {num_requests}")
    print(f"  Wall time: {wall_time:.2f}s")
    print(f"  Total tokens: {total_tokens}")
    print(f"  Aggregate throughput: {total_tokens / wall_time:.1f} tok/s")
    print(f"  Avg per-request: {sum(r['tok_per_sec'] for r in results) / len(results):.1f} tok/s")
    print()

async def main():
    for n in [1, 5, 10, 20]:
        await benchmark_concurrent(n)

asyncio.run(main())

The results are dramatic. On a single A100 with an 8B model, representative numbers look like this (your hardware will vary):

Engine1 Request5 Concurrent10 Concurrent20 Concurrent
Ollama45 tok/s~9 tok/s each~4.5 tok/s each~2.2 tok/s each
llama.cpp48 tok/s~10 tok/s each~5 tok/s each~2.5 tok/s each
vLLM52 tok/s~42 tok/s each~35 tok/s each~25 tok/s each

Single-request performance is similar. Under concurrent load, vLLM's continuous batching maintains near-peak throughput per request while Ollama and llama.cpp degrade linearly because they serialize requests. The aggregate throughput tells the story: vLLM with 20 concurrent requests produces ~500 tok/s total; Ollama produces ~44 tok/s total. That's a 10x difference. If you're serving a team, vLLM pays for itself immediately.

Key vLLM parameters to know:

Important caveat: vLLM is GPU-only and works with SafeTensors/AWQ/GPTQ formats, not GGUF. If you need CPU inference or GGUF models, stick with Ollama or llama.cpp.

The Benchmark: Three Engines, One Model, Real Numbers

Let's make the comparison rigorous. Here's a benchmark harness that tests all three engines on the same prompts with the same model (adapted for each engine's format):

import time
import requests

def benchmark_engine(base_url: str, model: str, prompts: list[str],
                     max_tokens: int = 128) -> dict:
    """Benchmark a single engine with sequential requests."""
    results = []
    for prompt in prompts:
        payload = {
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": max_tokens,
            "temperature": 0.7,
            "stream": False
        }
        start = time.perf_counter()
        resp = requests.post(f"{base_url}/v1/chat/completions", json=payload)
        elapsed = time.perf_counter() - start
        data = resp.json()

        tokens = data["usage"]["completion_tokens"]
        results.append({
            "tokens": tokens,
            "total_time": elapsed,
            "tokens_per_sec": tokens / elapsed
        })

    avg_tps = sum(r["tokens_per_sec"] for r in results) / len(results)
    avg_time = sum(r["total_time"] for r in results) / len(results)
    return {"avg_tok_per_sec": round(avg_tps, 1), "avg_latency": round(avg_time, 2)}

# Test prompts — mix of short and medium generation tasks
prompts = [
    "What is the capital of France? Answer in one sentence.",
    "Explain the difference between a list and a tuple in Python.",
    "Write a function that checks if a string is a palindrome.",
    "Summarize the key ideas behind MapReduce in 3 bullet points.",
    "What are the tradeoffs between SQL and NoSQL databases?",
]

engines = {
    "Ollama":    ("http://localhost:11434", "llama3.1:8b"),
    "llama.cpp": ("http://localhost:8080",  "llama3.1:8b"),
    "vLLM":      ("http://localhost:8000",  "meta-llama/Llama-3.1-8B"),
}

print(f"{'Engine':<12} {'Avg tok/s':>10} {'Avg latency':>12}")
print("-" * 36)
for name, (url, model) in engines.items():
    try:
        result = benchmark_engine(url, model, prompts)
        print(f"{name:<12} {result['avg_tok_per_sec']:>9.1f} {result['avg_latency']:>10.2f}s")
    except Exception as e:
        print(f"{name:<12} {'error':>10} — {e}")

Representative results on mid-range hardware (RTX 4070 12GB, 8B model at Q4):

MetricOllamallama.cppvLLM
Tokens/sec (single)384044
TTFT~85 ms~80 ms~65 ms
Peak VRAM5.8 GB5.6 GB5.2 GB
5 concurrent (agg.)42 tok/s45 tok/s185 tok/s
Setup complexity1 commandBuild from sourcepip install
CPU inferenceYesYesNo

The takeaway is clear: Ollama for development (simplest setup, model management built in), llama.cpp for customization (build flags, CPU/GPU splitting, bleeding-edge features), vLLM for serving (continuous batching makes it the only serious choice for multi-user workloads). For details on how we measure TTFT and throughput, see our LLM latency benchmarks post.

Choosing Your Setup

Here's the decision tree:

The models you've studied throughout our from-scratch series — the attention heads, the feed-forward layers, the KV caches, the quantized weights — they're no longer abstractions. They're files on your disk, running on your hardware, answering your questions. That's the payoff of understanding how things work from the ground up: when you run ollama run llama3.2, you know exactly what's happening inside.

Try It: Will It Fit? Memory Calculator

Select a model configuration and see if it fits on your GPU.

Weights
KV Cache
Overhead
Free

References & Further Reading