Building Multimodal AI Apps: Vision, Documents, and Beyond with Modern VLMs
The Three Generations of Vision AI
Two years ago, extracting data from a restaurant receipt required an OCR pipeline, text cleaning, regex parsing, and a prayer. Today it's one API call with an image attached. That shift — from multi-step brittle pipelines to single-prompt understanding — is the story of Vision-Language Models (VLMs), and it changes how we build every application that touches images.
We've gone through three generations of vision AI:
- Traditional CV (2010s) — OpenCV, hand-crafted features, Haar cascades. Each task required its own algorithm and weeks of tuning.
- Specialized deep learning (2018-2023) — YOLO for detection, ResNet for classification, Tesseract for OCR. Better accuracy, but still one model per task. Want to extract text AND understand the layout? That's two models and glue code.
- Foundation VLMs (2024-present) — GPT-4o, Claude, Gemini. One model that sees an image and understands it the way you do — reading text, interpreting charts, recognizing objects, and answering questions in natural language.
The old pipeline for document extraction looked like this:
The new pipeline:
As of February 2026, the landscape is rich with options. GPT-4o and GPT-4.1 lead on general vision tasks at $2.50/M input tokens. Claude Sonnet 4.5 excels at structured document extraction at $3/M. Gemini 2.5 Pro offers the best value at $1.25/M with a massive 1M+ token context window. And open-source models like Qwen2.5-VL-72B and InternVL3-78B now perform within 5-10% of proprietary models, making self-hosting viable for high-volume workloads.
This post builds four application patterns — document intelligence, visual QA, batch image processing, and production orchestration — with working Python code for each. If you want the theoretical foundations, check out our posts on CLIP (how models learn to connect images and text) and Vision Transformers (the architecture that makes this possible). Here, we focus on the practical: what you can build today, what breaks in production, and what it costs.
Document Intelligence: OCR-Free Extraction
Document processing is the highest-value enterprise VLM use case. Invoices, contracts, receipts, forms — businesses drown in documents that need to be read, parsed, and turned into structured data. Traditional OCR pipelines (Tesseract, AWS Textract) work well for clean, predictable layouts but crumble when they encounter handwriting, complex tables, or documents that don't match the template you built for.
VLMs sidestep this entirely. They don't extract text and then parse it — they understand the document as a whole, the same way you do when you glance at an invoice and immediately know the total is in the bottom-right corner.
Here's the core pattern: encode the image as base64, send it to a VLM with a structured extraction prompt, and validate the output with Pydantic (following the patterns from our structured output post):
import base64
import json
from pathlib import Path
from openai import OpenAI
from pydantic import BaseModel
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
class InvoiceData(BaseModel):
invoice_number: str
date: str
vendor_name: str
total_amount: float
line_items: list[LineItem]
def extract_document(image_path: str, schema: type[BaseModel]) -> BaseModel:
"""Extract structured data from a document image using a VLM."""
image_bytes = Path(image_path).read_bytes()
b64_image = base64.b64encode(image_bytes).decode("utf-8")
suffix = Path(image_path).suffix.lstrip(".")
media_type = f"image/{'jpeg' if suffix == 'jpg' else suffix}"
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": [
{"type": "text", "text": (
f"Extract the following fields from this document image. "
f"Return valid JSON matching this schema:\n"
f"{json.dumps(schema.model_json_schema(), indent=2)}"
)},
{"type": "image_url", "image_url": {
"url": f"data:{media_type};base64,{b64_image}",
"detail": "high"
}}
]
}],
max_tokens=2000
)
raw = json.loads(response.choices[0].message.content)
return schema.model_validate(raw)
# Usage
invoice = extract_document("invoice.png", InvoiceData)
print(f"Invoice #{invoice.invoice_number}: ${invoice.total_amount:.2f}")
for item in invoice.line_items:
print(f" {item.description}: {item.quantity} x ${item.unit_price:.2f}")
The detail: "high" parameter matters — it tells GPT-4o to use high-resolution mode, which tiles the image into 512×512 patches for fine-grained text reading. Low detail mode is 10× cheaper but misses small text. For documents, always use high detail.
For multi-page documents like contracts or multi-page invoices, you need a pipeline that processes each page and merges the results:
import fitz # PyMuPDF
from dataclasses import dataclass, field
@dataclass
class PageResult:
page_number: int
extracted: dict
confidence: float
@dataclass
class DocumentResult:
pages: list[PageResult] = field(default_factory=list)
merged: dict = field(default_factory=dict)
def process_multipage_document(pdf_path: str, prompt: str) -> DocumentResult:
"""Process a multi-page PDF by converting each page to an image."""
doc = fitz.open(pdf_path)
result = DocumentResult()
client = OpenAI()
for page_num in range(len(doc)):
# Render page at 200 DPI — good balance of quality vs token cost
pix = doc[page_num].get_pixmap(dpi=200)
img_bytes = pix.tobytes("png")
b64_image = base64.b64encode(img_bytes).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": [
{"type": "text", "text": (
f"{prompt}\n\nThis is page {page_num + 1} of {len(doc)}. "
f"Also include a 'confidence' field (0-1) rating your "
f"extraction confidence."
)},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{b64_image}",
"detail": "high"
}}
]
}],
max_tokens=2000
)
page_data = json.loads(response.choices[0].message.content)
confidence = page_data.pop("confidence", 0.8)
result.pages.append(PageResult(page_num + 1, page_data, confidence))
print(f" Page {page_num + 1}/{len(doc)}: confidence {confidence:.0%}")
doc.close()
# Merge: combine line items, take the latest totals, flag low-confidence pages
result.merged = merge_page_results(result.pages)
return result
def merge_page_results(pages: list[PageResult]) -> dict:
"""Merge extracted data across pages, combining lists and flagging conflicts."""
merged = {}
for page in pages:
for key, value in page.extracted.items():
if isinstance(value, list):
merged.setdefault(key, []).extend(value)
else:
merged[key] = value # Later pages override earlier ones
merged["low_confidence_pages"] = [
p.page_number for p in pages if p.confidence < 0.7
]
return merged
A few hard-won lessons from production document extraction:
- Accuracy on clean printed docs: 95-99%. VLMs nail standard invoices, receipts, and typed forms.
- Accuracy on handwritten/degraded docs: 70-85%. Still useful, but you need human review for the bottom 15-30%.
- Cost: $0.01-0.05 per page at high detail, depending on the model. Tesseract is free but fragile. The VLM cost is insurance against layout changes breaking your pipeline.
- The confidence trick is essential. Ask the VLM to rate its own confidence. Fields below 0.7 get routed to human review. This turns 85% accuracy into a 98% accurate system with a 15% human review rate.
Visual QA: Reading Charts and Monitoring Dashboards
Here's an application that surprises people: VLMs can read charts. Send a screenshot of a bar chart and ask "What month had the highest revenue?" and you'll get the right answer 80-90% of the time. This opens up a new category of tools — natural language interfaces for data that only exists as images.
The key insight: don't just ask the question. Ask the model to describe the chart first, then answer. This forces it to ground its reasoning in what it actually sees, reducing hallucination:
import anthropic
def analyze_chart(image_path: str, question: str) -> dict:
"""Analyze a chart image with multi-turn visual reasoning."""
image_bytes = Path(image_path).read_bytes()
b64_image = base64.b64encode(image_bytes).decode("utf-8")
suffix = Path(image_path).suffix.lstrip(".")
media_type = f"image/{'jpeg' if suffix == 'jpg' else suffix}"
client = anthropic.Anthropic()
# Step 1: Describe the chart structure (grounding)
description_response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=1000,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": media_type,
"data": b64_image
}},
{"type": "text", "text": (
"Describe this chart in detail: chart type, axes labels, "
"data series, approximate values for each data point, "
"and any notable trends."
)}
]
}]
)
chart_description = description_response.content[0].text
# Step 2: Answer the question grounded in the description
answer_response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=500,
messages=[
{"role": "user", "content": [
{"type": "image", "source": {
"type": "base64", "media_type": media_type,
"data": b64_image
}},
{"type": "text", "text": (
"Describe this chart in detail."
)}
]},
{"role": "assistant", "content": chart_description},
{"role": "user", "content": (
f"Based on your description of the chart, answer this question: "
f"{question}\n\nProvide the answer and your confidence (low/medium/high)."
)}
]
)
return {
"chart_description": chart_description,
"answer": answer_response.content[0].text,
"tokens_used": (description_response.usage.input_tokens +
description_response.usage.output_tokens +
answer_response.usage.input_tokens +
answer_response.usage.output_tokens)
}
The two-step "describe then answer" pattern costs twice the API calls but dramatically improves accuracy — especially for numerical questions where the model needs to read specific values off axes.
This pattern extends naturally to dashboard monitoring. Imagine a lightweight alerting system that doesn't need any metrics integration — it just looks at your dashboard the way you would:
import time
import subprocess
from datetime import datetime
def capture_dashboard(url: str, output_path: str) -> str:
"""Capture a dashboard screenshot using a headless browser."""
subprocess.run([
"chromium", "--headless", "--disable-gpu",
f"--screenshot={output_path}", f"--window-size=1920,1080",
url
], capture_output=True, timeout=30)
return output_path
def monitor_dashboard(url: str, interval_minutes: int = 15):
"""Periodically screenshot a dashboard and check for anomalies."""
previous_summary = None
while True:
screenshot_path = f"/tmp/dashboard_{int(time.time())}.png"
capture_dashboard(url, screenshot_path)
image_bytes = Path(screenshot_path).read_bytes()
b64_image = base64.b64encode(image_bytes).decode("utf-8")
context = f"Previous check summary: {previous_summary}" if previous_summary else ""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=800,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png",
"data": b64_image
}},
{"type": "text", "text": (
f"You are monitoring a dashboard. {context}\n\n"
f"1. Briefly summarize the current state of all visible metrics.\n"
f"2. Flag any anomalies, spikes, or concerning trends.\n"
f"3. Rate overall system health: HEALTHY / WARNING / CRITICAL.\n"
f"4. If WARNING or CRITICAL, explain what needs attention."
)}
]
}]
)
result = response.content[0].text
previous_summary = result[:500]
if "CRITICAL" in result or "WARNING" in result:
print(f"[{datetime.now():%H:%M}] ALERT:\n{result}")
# send_slack_alert(result) # Wire up your alerting
Path(screenshot_path).unlink() # Clean up
time.sleep(interval_minutes * 60)
The VLM understands spatial layout — it can tell you "the top-right panel shows a spike in error rate" without any DOM parsing or metric API integration. It's not a replacement for proper observability (see our observability post), but it's a surprisingly effective complement — especially for dashboards that don't have API-accessible metrics.
Batch Processing and Image Comparison
Single-image extraction is useful, but production systems need to handle thousands of images. The patterns from our batch processing post apply directly: rate limiting, concurrent requests, retry logic, and progress tracking.
Here's a production batch processor that classifies and extracts metadata from a folder of images:
import asyncio
from openai import AsyncOpenAI
CATEGORIES = ["screenshot", "photograph", "diagram", "document", "chart", "other"]
async def process_single_image(
client: AsyncOpenAI, image_path: str, semaphore: asyncio.Semaphore
) -> dict:
"""Process one image with rate limiting via semaphore."""
async with semaphore:
image_bytes = Path(image_path).read_bytes()
b64_image = base64.b64encode(image_bytes).decode("utf-8")
suffix = Path(image_path).suffix.lstrip(".")
media_type = f"image/{'jpeg' if suffix == 'jpg' else suffix}"
for attempt in range(3):
try:
response = await client.chat.completions.create(
model="gpt-4o-mini", # Cheaper model for classification
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": [
{"type": "text", "text": (
"Analyze this image and return JSON with:\n"
f"- category: one of {CATEGORIES}\n"
"- subject: brief description (10 words max)\n"
"- has_text: boolean\n"
"- text_content: extracted text if any (empty string if none)\n"
"- dominant_colors: list of 1-3 color names"
)},
{"type": "image_url", "image_url": {
"url": f"data:{media_type};base64,{b64_image}",
"detail": "low" # Low detail for classification
}}
]
}],
max_tokens=500
)
result = json.loads(response.choices[0].message.content)
result["file"] = image_path
result["status"] = "success"
return result
except Exception as e:
if attempt == 2:
return {"file": image_path, "status": "error", "error": str(e)}
await asyncio.sleep(2 ** attempt) # Exponential backoff
async def batch_process_images(image_dir: str, max_concurrent: int = 5) -> list:
"""Process all images in a directory with controlled concurrency."""
image_paths = [
str(p) for p in Path(image_dir).iterdir()
if p.suffix.lower() in {".png", ".jpg", ".jpeg", ".webp", ".gif"}
]
print(f"Processing {len(image_paths)} images (max {max_concurrent} concurrent)...")
client = AsyncOpenAI()
semaphore = asyncio.Semaphore(max_concurrent)
tasks = [process_single_image(client, p, semaphore) for p in image_paths]
results = []
for i, coro in enumerate(asyncio.as_completed(tasks)):
result = await coro
results.append(result)
status = "OK" if result["status"] == "success" else "FAIL"
print(f" [{i+1}/{len(tasks)}] {Path(result['file']).name}: {status}")
successes = sum(1 for r in results if r["status"] == "success")
print(f"\nDone: {successes}/{len(results)} succeeded")
return results
Notice a few production details: gpt-4o-mini instead of gpt-4o for classification (5× cheaper, nearly as accurate for simple categorization), detail: "low" to minimize token cost, and an asyncio.Semaphore to stay under rate limits without queuing logic.
One of the most powerful VLM patterns is comparing two images. This enables visual regression testing, document versioning, and change detection:
def compare_images(image_path_1: str, image_path_2: str) -> dict:
"""Compare two images and identify differences using a VLM."""
images = []
for path in [image_path_1, image_path_2]:
img_bytes = Path(path).read_bytes()
b64 = base64.b64encode(img_bytes).decode("utf-8")
suffix = Path(path).suffix.lstrip(".")
media_type = f"image/{'jpeg' if suffix == 'jpg' else suffix}"
images.append({"type": "image_url", "image_url": {
"url": f"data:{media_type};base64,{b64}", "detail": "high"
}})
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Image 1 (before):"},
images[0],
{"type": "text", "text": "Image 2 (after):"},
images[1],
{"type": "text", "text": (
"Compare these two images carefully.\n"
"1. First, describe Image 1 in detail.\n"
"2. Then, describe Image 2 in detail.\n"
"3. List every difference you can find.\n\n"
"Return JSON with: description_1, description_2, "
"differences (list of strings), severity "
"(none/minor/major), summary (one sentence)."
)}
]
}],
max_tokens=1500
)
return json.loads(response.choices[0].message.content)
# Visual regression testing example
diff = compare_images("screenshots/before_deploy.png", "screenshots/after_deploy.png")
if diff["severity"] != "none":
print(f"Visual changes detected ({diff['severity']}):")
for change in diff["differences"]:
print(f" - {change}")
Prompt engineering tip: The "describe each image independently, then compare" pattern is critical. If you simply ask "What's different?", the model tends to focus on the most obvious change and miss subtler ones. Forcing independent descriptions means it attends to each image fully before comparing.
Production Hardening: The Patterns That Matter
The code above works for demos. Production systems need five additional patterns: image preprocessing, prompt templating, confidence estimation, fallback chains, and cost tracking. Here's an orchestrator that combines them all:
from dataclasses import dataclass
from PIL import Image
import io
@dataclass
class VLMResult:
data: dict
model_used: str
image_tokens: int
cost_usd: float
confidence: float
# Token costs per million (Feb 2026 pricing)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"claude-sonnet": {"input": 3.00, "output": 15.00},
"gemini-pro": {"input": 1.25, "output": 5.00},
}
def preprocess_image(image_path: str, max_dimension: int = 2048) -> bytes:
"""Resize image to optimal dimensions for VLM processing."""
img = Image.open(image_path)
width, height = img.size
# Skip resize if already within bounds
if max(width, height) <= max_dimension:
buf = io.BytesIO()
img.save(buf, format="PNG")
return buf.getvalue()
# Scale down preserving aspect ratio
scale = max_dimension / max(width, height)
new_size = (int(width * scale), int(height * scale))
img = img.resize(new_size, Image.LANCZOS)
buf = io.BytesIO()
img.save(buf, format="PNG")
return buf.getvalue()
def estimate_image_tokens_gpt4o(width: int, height: int, detail: str) -> int:
"""Estimate token count for an image sent to GPT-4o."""
if detail == "low":
return 85
# High detail: scale long side to 2048, then short side to 768, then tile 512x512
scale = min(2048 / max(width, height), 1.0)
scaled_w, scaled_h = int(width * scale), int(height * scale)
short_scale = min(768 / min(scaled_w, scaled_h), 1.0)
scaled_w, scaled_h = int(scaled_w * short_scale), int(scaled_h * short_scale)
tiles_w = (scaled_w + 511) // 512
tiles_h = (scaled_h + 511) // 512
return 85 + (tiles_w * tiles_h * 170)
def extract_with_fallback(
image_path: str, prompt: str,
models: list[str] = None, confidence_threshold: float = 0.7
) -> VLMResult:
"""Extract data with image preprocessing, fallback chain, and cost tracking."""
if models is None:
models = ["gpt-4o", "claude-sonnet"]
image_bytes = preprocess_image(image_path)
b64_image = base64.b64encode(image_bytes).decode("utf-8")
# Get dimensions for token estimation
img = Image.open(io.BytesIO(image_bytes))
img_tokens = estimate_image_tokens_gpt4o(img.width, img.height, "high")
augmented_prompt = (
f"{prompt}\n\nAlso include a 'confidence' field (0.0 to 1.0) rating "
f"how confident you are in the extraction accuracy."
)
for model_name in models:
try:
# call_openai / call_anthropic wrap the API patterns shown earlier
if model_name.startswith("gpt"):
result, output_tokens = call_openai(b64_image, augmented_prompt, model_name)
elif model_name.startswith("claude"):
result, output_tokens = call_anthropic(b64_image, augmented_prompt, model_name)
else:
continue
confidence = result.pop("confidence", 0.5)
pricing = PRICING.get(model_name, PRICING["gpt-4o"])
cost = (img_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
if confidence >= confidence_threshold:
return VLMResult(result, model_name, img_tokens, cost, confidence)
print(f" {model_name}: low confidence ({confidence:.0%}), trying next...")
except Exception as e:
print(f" {model_name} failed: {e}, trying next...")
# All models failed or had low confidence — return best attempt
return VLMResult(result, model_name, img_tokens, cost, confidence)
The biggest cost lever is the detail parameter. At high detail, a 1920×1080 screenshot costs ~1,105 tokens (6 tiles × 170 + 85 base). At low detail, the same image costs just 85 tokens — a 13× reduction. For classification and simple categorization tasks, low detail is nearly as accurate. Reserve high detail for document text extraction and fine-grained reading. Smart preprocessing (resizing images to 1024–2048px before upload) prevents oversized images from generating unnecessary tiles.
| Model | Best For | Input Cost/M | Context Window |
|---|---|---|---|
| GPT-4o | Photos, general vision | $2.50 | 128K tokens |
| GPT-4o-mini | Classification, simple extraction | $0.15 | 128K tokens |
| Claude Sonnet 4.5 | Documents, structured extraction | $3.00 | 200K tokens |
| Gemini 2.5 Pro | Long documents, cost-sensitive | $1.25 | 1M+ tokens |
| Qwen2.5-VL-72B | Self-hosted, high-volume | Compute only | 32K tokens |
Limitations: When NOT to Use VLMs
VLMs are impressive, but they're not magic. Being honest about failure modes saves you from building systems that look great in demos and fail in production:
- Precise measurements: VLMs approximate. They'll tell you a bar chart value is "around 450" when it's actually 427. Don't use them for pixel-perfect comparisons or exact numerical extraction from charts.
- Small text: Text under ~10px in the original image is unreliable. Receipt fine print, image watermarks, and dense spreadsheet cells often get misread or hallucinated.
- Spatial reasoning in dense layouts: "Which item is third from the left?" in a grid of 20 similar items trips up even the best models. Counting and precise positioning are weak spots.
- Hallucination on ambiguous images: Blurry photos, low contrast, partially obscured text — VLMs will confidently describe things that aren't there. Always pair VLM extraction with confidence estimation and human review for ambiguous inputs.
- Latency: VLM calls with images take 2-10 seconds. That's fine for document processing but too slow for real-time video analysis or interactive applications that need sub-second responses.
The decision framework is simple: use VLMs for understanding and reasoning about images (what does this document say? what changed? what's anomalous?). Use traditional CV for detection, counting, and real-time tasks (YOLO at 60fps vs VLM at 0.1fps). The hybrid approach works best in practice: use YOLO to detect and crop objects of interest, then send the cropped regions to a VLM for detailed understanding.
Try It: VLM Capability Explorer
Select a sample image and question to see real VLM responses from different models. (Pre-computed responses — no API calls needed.)
Try It: Image Token Cost Calculator
Configure image dimensions and model to see how VLMs tokenize images and what it costs at scale.
References & Further Reading
- OpenAI — GPT-4V System Card (2023) — capabilities and limitations of GPT-4 with vision
- Anthropic — Claude 3 Model Card (2024) — multimodal architecture and safety evaluation
- Google — Gemini 1.5 Technical Report (2024) — long-context multimodal capabilities
- Liu et al. — Visual Instruction Tuning (LLaVA, 2023) — the foundational open-source VLM approach
- Faysse et al. — ColPali: Efficient Document Retrieval with Vision Language Models (2024) — OCR-free document retrieval
- Zhang et al. — The Dawn of LMMs: A Survey (2024) — comprehensive survey of large multimodal models
DadOps posts referenced: CLIP from Scratch (contrastive pretraining), Vision Transformers from Scratch (architecture), LLM Receipt Parser (the original multimodal demo), Structured Output from LLMs (JSON extraction patterns), Guardrails for LLM Apps (safety for VLM outputs), Batch Processing with LLMs (rate limiting and concurrency).