Building AI Code Review Tools
Why AI Code Review?
Your senior engineers spend 20–30% of their time reading other people's code. Reviews take 4–24 hours of wall-clock latency. And after about 200 lines, reviewer fatigue kicks in — the bugs on line 350 sail through unnoticed.
LLMs can't replace human reviewers. They miss architectural concerns, business logic errors, and team-specific conventions that require deep context about why the code exists. But LLMs can handle the mechanical checklist: style consistency, common bug patterns, missing error handling, security anti-patterns, and documentation gaps. The goal is to let human reviewers focus on design and correctness while the LLM handles the tedious line-by-line sweep.
The market agrees. CodeRabbit has processed over 2 million repositories and 13 million pull requests. GitHub Copilot Code Review went GA in April 2025. Sourcery, Qodo PR-Agent, and others are all competing to be the default AI reviewer on your team.
But here's the thing — you can build your own in an afternoon. And by building it yourself, you'll understand exactly what these tools do (and don't do), how to tune them for your codebase, and where they break. In this post, we'll build four progressively sophisticated code review tools:
- A single-file analyzer that catches bugs and security issues
- A diff-aware PR reviewer that understands what changed and why it matters
- A multi-agent pipeline where specialized reviewers focus on security, performance, and style separately
- A GitHub integration that posts inline comments directly on your pull requests
If you've read Structured Output from LLMs, you already know how to get reliable JSON from language models. If you've explored Multi-Agent Orchestration, you've seen the fan-out/merge pattern we'll use for specialized reviewers. This post puts those tools to work on one of the highest-value developer workflows there is.
Single-File Code Analyzer
Let's start simple: send a single file to an LLM and ask it to find problems. The key is structured output — we want a list of findings with specific line numbers, severities, and categories, not a wall of prose.
Here's a finding that will shape our entire approach: Li et al. (2024) tested various prompt strategies for LLM code review and found that adding a persona like "You are a senior software engineer" actually hurt review accuracy by 1–54%. Skip the role-playing — just describe the task directly and provide an example of good output.
import json
from pydantic import BaseModel
from anthropic import Anthropic
class ReviewFinding(BaseModel):
line: int
severity: str # "critical", "warning", "info"
category: str # "bug", "security", "performance", "style"
description: str
suggestion: str
class FileReview(BaseModel):
findings: list[ReviewFinding]
summary: str
REVIEW_PROMPT = """Review this code file for bugs, security issues,
performance problems, and style issues. Return JSON matching this schema:
{{"findings": [{{"line": int, "severity": "critical|warning|info",
"category": "bug|security|performance|style",
"description": "what's wrong", "suggestion": "how to fix"}}],
"summary": "one-line overall assessment"}}
Example finding:
{{"line": 15, "severity": "critical", "category": "security",
"description": "SQL query built with string formatting is vulnerable to injection",
"suggestion": "Use parameterized queries: cursor.execute('SELECT * FROM users WHERE id = ?', (user_id,))"}}
File: {filename}
```
{code}
```"""
def review_file(filepath: str) -> FileReview:
client = Anthropic()
code = open(filepath).read()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{"role": "user",
"content": REVIEW_PROMPT.format(
filename=filepath, code=code)}]
)
data = json.loads(response.content[0].text)
return FileReview(**data)
review = review_file("app/auth.py")
for f in review.findings:
print(f"Line {f.line} [{f.severity}] {f.category}: {f.description}")
A few details matter here. We include the filename in the prompt — this helps the model apply language-specific conventions (Python files get PEP 8 feedback, JavaScript gets ESLint-style suggestions). We ask for line numbers to anchor findings to specific locations. And we provide a concrete example of a well-formatted finding, because few-shot prompting beats zero-shot by 46–659% on code review tasks according to the same Li et al. study.
The Pydantic model FileReview validates the output structure. If the LLM returns malformed JSON, Pydantic raises a validation error instead of silently passing garbage downstream. For production use, you'd wrap this in a retry with a reminder to the model about the expected schema — see our Guardrails for LLM Apps post for retry strategies.
Diff-Aware PR Reviewer
Reviewing entire files wastes tokens and attention. When a developer opens a pull request, the relevant changes are in the diff — the added, modified, and deleted lines. But reviewing diffs is actually harder than reviewing full files: the model must understand both the old code and the new code, and reason about whether the change is correct in context.
The trick is to send the diff as the primary input but include surrounding context so the model understands what the change modifies:
import subprocess
def get_pr_diff(base_branch: str = "main") -> str:
"""Get the unified diff for the current branch vs base."""
result = subprocess.run(
["git", "diff", f"{base_branch}...HEAD", "--unified=5"],
capture_output=True, text=True, check=True
)
return result.stdout
DIFF_REVIEW_PROMPT = """Review this pull request diff. Focus on what CHANGED,
not on pre-existing code. For each issue found, reference the exact file
and line number from the diff.
Return JSON: {{"findings": [...], "summary": "..."}}
Same schema as before but add "file": "path/to/file" to each finding.
Diff:
```
{diff}
```"""
def review_pr(base_branch: str = "main") -> FileReview:
client = Anthropic()
diff = get_pr_diff(base_branch)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{"role": "user",
"content": DIFF_REVIEW_PROMPT.format(diff=diff)}]
)
return FileReview(**json.loads(response.content[0].text))
This works, but it has a blind spot: the model sees the changes but not the rest of the file. If a function signature changed, the model can't tell whether three callers in other files also need updating. Let's fix that with context enrichment.
Adding Repository Context
The biggest gap between a naive LLM wrapper and a production tool like CodeRabbit is context. CodeRabbit uses AST analysis, vector search over the codebase, and 40+ integrated linters alongside the LLM. We can capture the most important context with Python's built-in ast module:
import ast
def extract_file_context(filepath: str) -> dict:
"""Extract structural context: imports, function signatures, classes."""
source = open(filepath).read()
tree = ast.parse(source)
context = {"imports": [], "functions": [], "classes": []}
for node in ast.walk(tree):
if isinstance(node, (ast.Import, ast.ImportFrom)):
context["imports"].append(ast.get_source_segment(source, node))
elif isinstance(node, ast.FunctionDef):
args = [a.arg for a in node.args.args]
context["functions"].append(
f"def {node.name}({', '.join(args)}) -> line {node.lineno}"
)
elif isinstance(node, ast.ClassDef):
context["classes"].append(
f"class {node.name} -> line {node.lineno}"
)
return context
def enrich_diff_with_context(diff_text: str) -> str:
"""Add file structure context to each changed file in the diff."""
enriched_parts = [diff_text, "\n--- Repository Context ---"]
seen_files = set()
for line in diff_text.splitlines():
if line.startswith("+++ b/") and line[6:].endswith(".py"):
filepath = line[6:]
if filepath not in seen_files:
seen_files.add(filepath)
try:
ctx = extract_file_context(filepath)
enriched_parts.append(f"\n{filepath}:")
enriched_parts.append(f" Imports: {ctx['imports'][:10]}")
enriched_parts.append(f" Functions: {ctx['functions']}")
enriched_parts.append(f" Classes: {ctx['classes']}")
except (FileNotFoundError, SyntaxError):
pass
return "\n".join(enriched_parts)
The improvement is dramatic. Without context, the reviewer generates generic feedback like "consider adding error handling." With context, it catches that a renamed parameter in authenticate() doesn't match the three call sites in routes.py, or that a new branch in process_order() has no corresponding test in test_orders.py. This is the same pattern discussed in our Context Window Strategies post — strategic context selection beats brute-force full-file inclusion every time.
Multi-Agent Review Pipeline
A single general-purpose reviewer is decent, but it spreads its attention across security, performance, style, and correctness all at once. Research from the SWR-Bench benchmark (2025) found that multi-review aggregation boosts F1 by 43.67% — running multiple specialized passes and merging results dramatically outperforms a single pass.
Here's the architecture: three specialist agents run in parallel on the same diff, then an aggregator merges and deduplicates their findings.
import asyncio
SPECIALIST_PROMPTS = {
"security": """Review ONLY for security vulnerabilities. Check for:
- SQL injection, XSS, command injection
- Hardcoded secrets or credentials
- Authentication/authorization flaws
- Path traversal, SSRF, insecure deserialization
- Missing input validation at trust boundaries
Return JSON: {{"findings": [...], "summary": "..."}}""",
"performance": """Review ONLY for performance issues. Check for:
- O(n^2) or worse algorithms where O(n) is possible
- N+1 database queries
- Missing caching opportunities
- Unnecessary memory allocations in loops
- Blocking I/O in async code paths
Return JSON: {{"findings": [...], "summary": "..."}}""",
"style": """Review ONLY for code quality and readability. Check for:
- Unclear variable or function names
- Missing error handling for external calls
- Dead code or unreachable branches
- Functions doing too many things (SRP violations)
- Missing or misleading docstrings on public APIs
Return JSON: {{"findings": [...], "summary": "..."}}"""
}
async def run_specialist(client, name: str, diff: str) -> list[dict]:
"""Run one specialist reviewer on the diff."""
prompt = SPECIALIST_PROMPTS[name] + f"\n\nDiff:\n```\n{diff}\n```"
response = await asyncio.to_thread(
client.messages.create,
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
data = json.loads(response.content[0].text)
for finding in data["findings"]:
finding["source_agent"] = name
return data["findings"]
async def multi_agent_review(diff: str) -> list[dict]:
"""Run all specialists in parallel, then aggregate."""
client = Anthropic()
tasks = [run_specialist(client, name, diff)
for name in SPECIALIST_PROMPTS]
all_findings = await asyncio.gather(*tasks)
# Flatten and deduplicate by (file, line, category)
merged = {}
for findings in all_findings:
for f in findings:
key = (f.get("file", ""), f["line"], f["category"])
if key not in merged or f["severity"] == "critical":
merged[key] = f
# Sort: critical first, then warning, then info
severity_order = {"critical": 0, "warning": 1, "info": 2}
results = sorted(merged.values(),
key=lambda f: severity_order.get(f["severity"], 3))
return results[:10] # Cap at 10 comments per review
The [:10] cap at the end is deliberate. Nothing kills developer trust in a review bot faster than 47 comments on a 30-line PR, most of them nitpicks. Limit yourself to the 10 most impactful findings, prioritized by severity. The security agent catches vulnerabilities the generalist misses because its entire prompt budget and attention focuses on OWASP patterns. The aggregator's deduplication prevents the common failure mode where all three agents flag the same obvious issue, inflating the comment count.
This fan-out/merge pattern is the same architecture we built in Multi-Agent Orchestration, applied to a concrete problem where specialization genuinely improves results.
GitHub Integration — Posting Review Comments
Our review tools print to the console, but developers live in GitHub. Let's post review comments directly on pull requests, right on the specific lines where issues were found.
The GitHub API for inline PR comments has a critical quirk: you can only comment on lines that are part of the diff, and you must specify a diff position rather than an absolute line number. The position counts lines from each @@ hunk header in the unified diff output.
import re
import requests
def map_line_to_diff_position(diff_text: str) -> dict:
"""Build a mapping from (file, line_number) to diff position."""
position_map = {}
current_file = None
diff_position = 0
current_line = 0
for raw_line in diff_text.splitlines():
if raw_line.startswith("+++ b/"):
current_file = raw_line[6:]
diff_position = 0
elif raw_line.startswith("@@ "):
match = re.search(r"\+(\d+)", raw_line)
current_line = int(match.group(1)) - 1 if match else 0
diff_position += 1
elif current_file and diff_position > 0:
diff_position += 1
if not raw_line.startswith("-"):
current_line += 1
if raw_line.startswith("+"):
position_map[(current_file, current_line)] = diff_position
return position_map
def post_github_review(owner: str, repo: str, pr_number: int,
findings: list[dict], diff_text: str,
token: str, confidence_threshold: float = 0.7):
"""Post findings as inline GitHub PR review comments."""
pos_map = map_line_to_diff_position(diff_text)
headers = {"Authorization": f"token {token}",
"Accept": "application/vnd.github.v3+json"}
comments = []
for f in findings:
confidence = f.get("confidence", 0.8)
if confidence < confidence_threshold:
continue # Skip low-confidence findings to reduce noise
key = (f.get("file", ""), f["line"])
position = pos_map.get(key)
if position is None:
continue # Line not in diff — can't comment here
severity_emoji = {"critical": "🔴", "warning": "🟡", "info": "🔵"}
body = (f"{severity_emoji.get(f['severity'], '⚪')} "
f"**{f['severity'].upper()}** ({f['category']})\n\n"
f"{f['description']}\n\n"
f"**Suggestion:** {f['suggestion']}")
comments.append({"path": f["file"], "position": position,
"body": body})
if not comments:
return None
review_body = {
"body": f"AI Review: {len(comments)} issue(s) found. "
f"Review generated by automated analysis.",
"event": "COMMENT",
"comments": comments[:10]
}
url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}/reviews"
resp = requests.post(url, json=review_body, headers=headers)
resp.raise_for_status()
return resp.json()
Several design choices here reflect lessons from production tools:
- Confidence threshold (0.7): We skip findings the model is less than 70% confident about. False positives destroy developer trust faster than missed bugs.
- COMMENT, not REQUEST_CHANGES: The bot assists — it doesn't gate. GitHub Copilot's review bot also never posts APPROVE, only COMMENT.
- Batched comments: All inline comments go in a single
/reviewsAPI call. This is both rate-limit friendly and creates a clean review thread. - Capped at 10: Focus on the most impactful findings. A helpful review bot is concise; a pedantic one gets disabled.
The developer experience principle: your bot should feel like a helpful colleague who catches things you missed, not a pedantic linter that nitpicks every line.
Evaluation and Calibration
How good is our reviewer, actually? Let's be honest about the numbers. The SWR-Bench benchmark (2025) tested the best LLM review tools on 1,000 real GitHub PRs and found the best single-pass F1 was 19.38%. Most tools had precision below 10%. That's sobering.
But benchmarks measure everything — subtle logic bugs, architectural issues, and nuanced edge cases. On the categories where LLMs actually shine (missing error handling, obvious security issues, style violations), practical precision is much higher: 60–80%. The key is knowing what your tool is good at and calibrating your expectations accordingly.
def evaluate_reviewer(reviewer_fn, test_cases: list[dict]) -> dict:
"""Evaluate a code reviewer against known-buggy examples.
Each test_case has: {"code": str, "known_bugs": [{"line": int, "category": str}]}
"""
true_positives = 0
false_positives = 0
false_negatives = 0
for case in test_cases:
findings = reviewer_fn(case["code"])
found_lines = {(f["line"], f["category"]) for f in findings}
known_lines = {(b["line"], b["category"]) for b in case["known_bugs"]}
true_positives += len(found_lines & known_lines)
false_positives += len(found_lines - known_lines)
false_negatives += len(known_lines - found_lines)
precision = true_positives / max(true_positives + false_positives, 1)
recall = true_positives / max(true_positives + false_negatives, 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-9)
return {"precision": precision, "recall": recall, "f1": f1,
"true_positives": true_positives,
"false_positives": false_positives,
"false_negatives": false_negatives}
# Example: evaluate on 5 known-buggy snippets
test_suite = [
{"code": "query = f'SELECT * FROM users WHERE id = {uid}'",
"known_bugs": [{"line": 1, "category": "security"}]},
{"code": "data = json.loads(request.data)\nreturn data['key']",
"known_bugs": [{"line": 2, "category": "bug"}]},
# ... more test cases
]
results = evaluate_reviewer(review_file, test_suite)
print(f"Precision: {results['precision']:.1%}")
print(f"Recall: {results['recall']:.1%}")
print(f"F1: {results['f1']:.1%}")
One powerful technique for improving precision: multi-pass aggregation. Run the review 3 times independently with temperature > 0, then keep only findings that appear in 2 or more passes. This filters out hallucinated findings (which tend to be random across runs) while preserving real issues (which the model consistently identifies). The SWR-Bench study found this approach improved F1 by 43.67% — a significant gain from a simple technique.
There's also the calibration problem: LLMs are overconfident in their code review findings. A finding marked "high confidence" might only be correct 70% of the time. Track your tool's actual accuracy by category and severity, and adjust your confidence threshold until the false positive rate is tolerable for your team. For more on building evaluation systems, see our Evaluating LLM Systems post.
| Review Category | LLM Strength | Typical Precision |
|---|---|---|
| Security (OWASP Top 10) | High | 70–85% |
| Style / Conventions | High | 75–90% |
| Missing Error Handling | High | 65–80% |
| Performance Issues | Moderate | 50–70% |
| Logic Bugs | Low–Moderate | 30–50% |
| Architectural Concerns | Low | 20–40% |
Try It: Code Review Simulator
Select a code sample and watch AI review comments appear inline. Toggle between single-agent and multi-agent review to see how specialization catches more issues.
Try It: Review Precision Calibrator
Judge 10 AI review findings as "Correct," "False Positive," or "Style Only." See how your judgments compare to the ground truth and discover the optimal confidence threshold.
Conclusion
We've built four code review tools, each more capable than the last: a single-file analyzer, a diff-aware PR reviewer with repository context, a multi-agent pipeline with specialized security/performance/style reviewers, and a GitHub integration that posts inline comments directly on pull requests.
The honest assessment: AI code review today is a net-positive assistant, not a replacement. It catches the mechanical stuff — missing error handling, obvious security patterns, style inconsistencies — reliably enough to be worth deploying. It misses the hard stuff — logic bugs, architectural concerns, business rule violations — frequently enough that you still need human reviewers for anything important.
The most impactful technique we covered isn't more sophisticated prompting — it's multi-agent specialization with aggressive filtering. Specialized agents with focused prompts outperform generalists. Multi-pass aggregation filters noise. And capping at 10 comments with a confidence threshold keeps the bot helpful instead of annoying.
Start with the single-file analyzer. Run it on your last five PRs and see what it catches. If the signal-to-noise ratio is good enough, wire up the GitHub integration and let it comment on your next PR. Your senior engineers will thank you for the help — and you'll have built something that understands exactly how (and how not) to trust AI with your code.
References & Further Reading
- Li et al. — "CodeReviewer: Pre-Training for Automating Code Review Activities" (2022) — Pre-trained model for code review with multi-task objectives
- Li et al. — "Fine-Tuning and Prompt Engineering for LLM-Based Code Review" (2024) — Found that personas hurt and few-shot helps for code review prompting
- "An Insight into Security Code Review with LLMs" (2024) — CWE-grounded prompts and chain-of-thought for security review
- SWR-Bench — "Benchmarking LLM-Based Code Review" (2025) — 1,000 real PRs, best F1 of 19.38%, multi-pass aggregation +43.67%
- "Rethinking Code Review Workflows with LLM Assistance" (2025) — GPT-assisted PRs reduced median resolution time by 60%+
- GitHub REST API — Pull Request Reviews — API reference for posting inline review comments
- CodeRabbit Architecture — Production AI code review with AST analysis and 40+ integrated linters
- Qodo PR-Agent (open-source) — Self-hostable AI PR reviewer with single-call-per-tool architecture