← Back to Blog

Building AI Code Review Tools

Why AI Code Review?

Your senior engineers spend 20–30% of their time reading other people's code. Reviews take 4–24 hours of wall-clock latency. And after about 200 lines, reviewer fatigue kicks in — the bugs on line 350 sail through unnoticed.

LLMs can't replace human reviewers. They miss architectural concerns, business logic errors, and team-specific conventions that require deep context about why the code exists. But LLMs can handle the mechanical checklist: style consistency, common bug patterns, missing error handling, security anti-patterns, and documentation gaps. The goal is to let human reviewers focus on design and correctness while the LLM handles the tedious line-by-line sweep.

The market agrees. CodeRabbit has processed over 2 million repositories and 13 million pull requests. GitHub Copilot Code Review went GA in April 2025. Sourcery, Qodo PR-Agent, and others are all competing to be the default AI reviewer on your team.

But here's the thing — you can build your own in an afternoon. And by building it yourself, you'll understand exactly what these tools do (and don't do), how to tune them for your codebase, and where they break. In this post, we'll build four progressively sophisticated code review tools:

  1. A single-file analyzer that catches bugs and security issues
  2. A diff-aware PR reviewer that understands what changed and why it matters
  3. A multi-agent pipeline where specialized reviewers focus on security, performance, and style separately
  4. A GitHub integration that posts inline comments directly on your pull requests

If you've read Structured Output from LLMs, you already know how to get reliable JSON from language models. If you've explored Multi-Agent Orchestration, you've seen the fan-out/merge pattern we'll use for specialized reviewers. This post puts those tools to work on one of the highest-value developer workflows there is.

Single-File Code Analyzer

Let's start simple: send a single file to an LLM and ask it to find problems. The key is structured output — we want a list of findings with specific line numbers, severities, and categories, not a wall of prose.

Here's a finding that will shape our entire approach: Li et al. (2024) tested various prompt strategies for LLM code review and found that adding a persona like "You are a senior software engineer" actually hurt review accuracy by 1–54%. Skip the role-playing — just describe the task directly and provide an example of good output.

import json
from pydantic import BaseModel
from anthropic import Anthropic

class ReviewFinding(BaseModel):
    line: int
    severity: str   # "critical", "warning", "info"
    category: str   # "bug", "security", "performance", "style"
    description: str
    suggestion: str

class FileReview(BaseModel):
    findings: list[ReviewFinding]
    summary: str

REVIEW_PROMPT = """Review this code file for bugs, security issues,
performance problems, and style issues. Return JSON matching this schema:
{{"findings": [{{"line": int, "severity": "critical|warning|info",
"category": "bug|security|performance|style",
"description": "what's wrong", "suggestion": "how to fix"}}],
"summary": "one-line overall assessment"}}

Example finding:
{{"line": 15, "severity": "critical", "category": "security",
"description": "SQL query built with string formatting is vulnerable to injection",
"suggestion": "Use parameterized queries: cursor.execute('SELECT * FROM users WHERE id = ?', (user_id,))"}}

File: {filename}
```
{code}
```"""

def review_file(filepath: str) -> FileReview:
    client = Anthropic()
    code = open(filepath).read()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user",
                   "content": REVIEW_PROMPT.format(
                       filename=filepath, code=code)}]
    )
    data = json.loads(response.content[0].text)
    return FileReview(**data)

review = review_file("app/auth.py")
for f in review.findings:
    print(f"Line {f.line} [{f.severity}] {f.category}: {f.description}")

A few details matter here. We include the filename in the prompt — this helps the model apply language-specific conventions (Python files get PEP 8 feedback, JavaScript gets ESLint-style suggestions). We ask for line numbers to anchor findings to specific locations. And we provide a concrete example of a well-formatted finding, because few-shot prompting beats zero-shot by 46–659% on code review tasks according to the same Li et al. study.

The Pydantic model FileReview validates the output structure. If the LLM returns malformed JSON, Pydantic raises a validation error instead of silently passing garbage downstream. For production use, you'd wrap this in a retry with a reminder to the model about the expected schema — see our Guardrails for LLM Apps post for retry strategies.

Diff-Aware PR Reviewer

Reviewing entire files wastes tokens and attention. When a developer opens a pull request, the relevant changes are in the diff — the added, modified, and deleted lines. But reviewing diffs is actually harder than reviewing full files: the model must understand both the old code and the new code, and reason about whether the change is correct in context.

The trick is to send the diff as the primary input but include surrounding context so the model understands what the change modifies:

import subprocess

def get_pr_diff(base_branch: str = "main") -> str:
    """Get the unified diff for the current branch vs base."""
    result = subprocess.run(
        ["git", "diff", f"{base_branch}...HEAD", "--unified=5"],
        capture_output=True, text=True, check=True
    )
    return result.stdout

DIFF_REVIEW_PROMPT = """Review this pull request diff. Focus on what CHANGED,
not on pre-existing code. For each issue found, reference the exact file
and line number from the diff.

Return JSON: {{"findings": [...], "summary": "..."}}
Same schema as before but add "file": "path/to/file" to each finding.

Diff:
```
{diff}
```"""

def review_pr(base_branch: str = "main") -> FileReview:
    client = Anthropic()
    diff = get_pr_diff(base_branch)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user",
                   "content": DIFF_REVIEW_PROMPT.format(diff=diff)}]
    )
    return FileReview(**json.loads(response.content[0].text))

This works, but it has a blind spot: the model sees the changes but not the rest of the file. If a function signature changed, the model can't tell whether three callers in other files also need updating. Let's fix that with context enrichment.

Adding Repository Context

The biggest gap between a naive LLM wrapper and a production tool like CodeRabbit is context. CodeRabbit uses AST analysis, vector search over the codebase, and 40+ integrated linters alongside the LLM. We can capture the most important context with Python's built-in ast module:

import ast

def extract_file_context(filepath: str) -> dict:
    """Extract structural context: imports, function signatures, classes."""
    source = open(filepath).read()
    tree = ast.parse(source)

    context = {"imports": [], "functions": [], "classes": []}
    for node in ast.walk(tree):
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            context["imports"].append(ast.get_source_segment(source, node))
        elif isinstance(node, ast.FunctionDef):
            args = [a.arg for a in node.args.args]
            context["functions"].append(
                f"def {node.name}({', '.join(args)}) -> line {node.lineno}"
            )
        elif isinstance(node, ast.ClassDef):
            context["classes"].append(
                f"class {node.name} -> line {node.lineno}"
            )
    return context

def enrich_diff_with_context(diff_text: str) -> str:
    """Add file structure context to each changed file in the diff."""
    enriched_parts = [diff_text, "\n--- Repository Context ---"]
    seen_files = set()

    for line in diff_text.splitlines():
        if line.startswith("+++ b/") and line[6:].endswith(".py"):
            filepath = line[6:]
            if filepath not in seen_files:
                seen_files.add(filepath)
                try:
                    ctx = extract_file_context(filepath)
                    enriched_parts.append(f"\n{filepath}:")
                    enriched_parts.append(f"  Imports: {ctx['imports'][:10]}")
                    enriched_parts.append(f"  Functions: {ctx['functions']}")
                    enriched_parts.append(f"  Classes: {ctx['classes']}")
                except (FileNotFoundError, SyntaxError):
                    pass

    return "\n".join(enriched_parts)

The improvement is dramatic. Without context, the reviewer generates generic feedback like "consider adding error handling." With context, it catches that a renamed parameter in authenticate() doesn't match the three call sites in routes.py, or that a new branch in process_order() has no corresponding test in test_orders.py. This is the same pattern discussed in our Context Window Strategies post — strategic context selection beats brute-force full-file inclusion every time.

Multi-Agent Review Pipeline

A single general-purpose reviewer is decent, but it spreads its attention across security, performance, style, and correctness all at once. Research from the SWR-Bench benchmark (2025) found that multi-review aggregation boosts F1 by 43.67% — running multiple specialized passes and merging results dramatically outperforms a single pass.

Here's the architecture: three specialist agents run in parallel on the same diff, then an aggregator merges and deduplicates their findings.

PR Diff Security Agent
PR Diff Performance Agent Aggregator
PR Diff Style Agent
import asyncio

SPECIALIST_PROMPTS = {
    "security": """Review ONLY for security vulnerabilities. Check for:
- SQL injection, XSS, command injection
- Hardcoded secrets or credentials
- Authentication/authorization flaws
- Path traversal, SSRF, insecure deserialization
- Missing input validation at trust boundaries
Return JSON: {{"findings": [...], "summary": "..."}}""",

    "performance": """Review ONLY for performance issues. Check for:
- O(n^2) or worse algorithms where O(n) is possible
- N+1 database queries
- Missing caching opportunities
- Unnecessary memory allocations in loops
- Blocking I/O in async code paths
Return JSON: {{"findings": [...], "summary": "..."}}""",

    "style": """Review ONLY for code quality and readability. Check for:
- Unclear variable or function names
- Missing error handling for external calls
- Dead code or unreachable branches
- Functions doing too many things (SRP violations)
- Missing or misleading docstrings on public APIs
Return JSON: {{"findings": [...], "summary": "..."}}"""
}

async def run_specialist(client, name: str, diff: str) -> list[dict]:
    """Run one specialist reviewer on the diff."""
    prompt = SPECIALIST_PROMPTS[name] + f"\n\nDiff:\n```\n{diff}\n```"
    response = await asyncio.to_thread(
        client.messages.create,
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}]
    )
    data = json.loads(response.content[0].text)
    for finding in data["findings"]:
        finding["source_agent"] = name
    return data["findings"]

async def multi_agent_review(diff: str) -> list[dict]:
    """Run all specialists in parallel, then aggregate."""
    client = Anthropic()
    tasks = [run_specialist(client, name, diff)
             for name in SPECIALIST_PROMPTS]
    all_findings = await asyncio.gather(*tasks)

    # Flatten and deduplicate by (file, line, category)
    merged = {}
    for findings in all_findings:
        for f in findings:
            key = (f.get("file", ""), f["line"], f["category"])
            if key not in merged or f["severity"] == "critical":
                merged[key] = f

    # Sort: critical first, then warning, then info
    severity_order = {"critical": 0, "warning": 1, "info": 2}
    results = sorted(merged.values(),
                     key=lambda f: severity_order.get(f["severity"], 3))
    return results[:10]  # Cap at 10 comments per review

The [:10] cap at the end is deliberate. Nothing kills developer trust in a review bot faster than 47 comments on a 30-line PR, most of them nitpicks. Limit yourself to the 10 most impactful findings, prioritized by severity. The security agent catches vulnerabilities the generalist misses because its entire prompt budget and attention focuses on OWASP patterns. The aggregator's deduplication prevents the common failure mode where all three agents flag the same obvious issue, inflating the comment count.

This fan-out/merge pattern is the same architecture we built in Multi-Agent Orchestration, applied to a concrete problem where specialization genuinely improves results.

GitHub Integration — Posting Review Comments

Our review tools print to the console, but developers live in GitHub. Let's post review comments directly on pull requests, right on the specific lines where issues were found.

The GitHub API for inline PR comments has a critical quirk: you can only comment on lines that are part of the diff, and you must specify a diff position rather than an absolute line number. The position counts lines from each @@ hunk header in the unified diff output.

import re
import requests

def map_line_to_diff_position(diff_text: str) -> dict:
    """Build a mapping from (file, line_number) to diff position."""
    position_map = {}
    current_file = None
    diff_position = 0
    current_line = 0

    for raw_line in diff_text.splitlines():
        if raw_line.startswith("+++ b/"):
            current_file = raw_line[6:]
            diff_position = 0
        elif raw_line.startswith("@@ "):
            match = re.search(r"\+(\d+)", raw_line)
            current_line = int(match.group(1)) - 1 if match else 0
            diff_position += 1
        elif current_file and diff_position > 0:
            diff_position += 1
            if not raw_line.startswith("-"):
                current_line += 1
            if raw_line.startswith("+"):
                position_map[(current_file, current_line)] = diff_position

    return position_map

def post_github_review(owner: str, repo: str, pr_number: int,
                       findings: list[dict], diff_text: str,
                       token: str, confidence_threshold: float = 0.7):
    """Post findings as inline GitHub PR review comments."""
    pos_map = map_line_to_diff_position(diff_text)
    headers = {"Authorization": f"token {token}",
               "Accept": "application/vnd.github.v3+json"}

    comments = []
    for f in findings:
        confidence = f.get("confidence", 0.8)
        if confidence < confidence_threshold:
            continue  # Skip low-confidence findings to reduce noise

        key = (f.get("file", ""), f["line"])
        position = pos_map.get(key)
        if position is None:
            continue  # Line not in diff — can't comment here

        severity_emoji = {"critical": "🔴", "warning": "🟡", "info": "🔵"}
        body = (f"{severity_emoji.get(f['severity'], '⚪')} "
                f"**{f['severity'].upper()}** ({f['category']})\n\n"
                f"{f['description']}\n\n"
                f"**Suggestion:** {f['suggestion']}")
        comments.append({"path": f["file"], "position": position,
                         "body": body})

    if not comments:
        return None

    review_body = {
        "body": f"AI Review: {len(comments)} issue(s) found. "
                f"Review generated by automated analysis.",
        "event": "COMMENT",
        "comments": comments[:10]
    }

    url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}/reviews"
    resp = requests.post(url, json=review_body, headers=headers)
    resp.raise_for_status()
    return resp.json()

Several design choices here reflect lessons from production tools:

The developer experience principle: your bot should feel like a helpful colleague who catches things you missed, not a pedantic linter that nitpicks every line.

Evaluation and Calibration

How good is our reviewer, actually? Let's be honest about the numbers. The SWR-Bench benchmark (2025) tested the best LLM review tools on 1,000 real GitHub PRs and found the best single-pass F1 was 19.38%. Most tools had precision below 10%. That's sobering.

But benchmarks measure everything — subtle logic bugs, architectural issues, and nuanced edge cases. On the categories where LLMs actually shine (missing error handling, obvious security issues, style violations), practical precision is much higher: 60–80%. The key is knowing what your tool is good at and calibrating your expectations accordingly.

def evaluate_reviewer(reviewer_fn, test_cases: list[dict]) -> dict:
    """Evaluate a code reviewer against known-buggy examples.

    Each test_case has: {"code": str, "known_bugs": [{"line": int, "category": str}]}
    """
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for case in test_cases:
        findings = reviewer_fn(case["code"])
        found_lines = {(f["line"], f["category"]) for f in findings}
        known_lines = {(b["line"], b["category"]) for b in case["known_bugs"]}

        true_positives += len(found_lines & known_lines)
        false_positives += len(found_lines - known_lines)
        false_negatives += len(known_lines - found_lines)

    precision = true_positives / max(true_positives + false_positives, 1)
    recall = true_positives / max(true_positives + false_negatives, 1)
    f1 = 2 * precision * recall / max(precision + recall, 1e-9)

    return {"precision": precision, "recall": recall, "f1": f1,
            "true_positives": true_positives,
            "false_positives": false_positives,
            "false_negatives": false_negatives}

# Example: evaluate on 5 known-buggy snippets
test_suite = [
    {"code": "query = f'SELECT * FROM users WHERE id = {uid}'",
     "known_bugs": [{"line": 1, "category": "security"}]},
    {"code": "data = json.loads(request.data)\nreturn data['key']",
     "known_bugs": [{"line": 2, "category": "bug"}]},
    # ... more test cases
]
results = evaluate_reviewer(review_file, test_suite)
print(f"Precision: {results['precision']:.1%}")
print(f"Recall:    {results['recall']:.1%}")
print(f"F1:        {results['f1']:.1%}")

One powerful technique for improving precision: multi-pass aggregation. Run the review 3 times independently with temperature > 0, then keep only findings that appear in 2 or more passes. This filters out hallucinated findings (which tend to be random across runs) while preserving real issues (which the model consistently identifies). The SWR-Bench study found this approach improved F1 by 43.67% — a significant gain from a simple technique.

There's also the calibration problem: LLMs are overconfident in their code review findings. A finding marked "high confidence" might only be correct 70% of the time. Track your tool's actual accuracy by category and severity, and adjust your confidence threshold until the false positive rate is tolerable for your team. For more on building evaluation systems, see our Evaluating LLM Systems post.

Review Category LLM Strength Typical Precision
Security (OWASP Top 10) High 70–85%
Style / Conventions High 75–90%
Missing Error Handling High 65–80%
Performance Issues Moderate 50–70%
Logic Bugs Low–Moderate 30–50%
Architectural Concerns Low 20–40%

Try It: Code Review Simulator

Select a code sample and watch AI review comments appear inline. Toggle between single-agent and multi-agent review to see how specialization catches more issues.

Try It: Review Precision Calibrator

Judge 10 AI review findings as "Correct," "False Positive," or "Style Only." See how your judgments compare to the ground truth and discover the optimal confidence threshold.

Finding 1 of 10

Conclusion

We've built four code review tools, each more capable than the last: a single-file analyzer, a diff-aware PR reviewer with repository context, a multi-agent pipeline with specialized security/performance/style reviewers, and a GitHub integration that posts inline comments directly on pull requests.

The honest assessment: AI code review today is a net-positive assistant, not a replacement. It catches the mechanical stuff — missing error handling, obvious security patterns, style inconsistencies — reliably enough to be worth deploying. It misses the hard stuff — logic bugs, architectural concerns, business rule violations — frequently enough that you still need human reviewers for anything important.

The most impactful technique we covered isn't more sophisticated prompting — it's multi-agent specialization with aggressive filtering. Specialized agents with focused prompts outperform generalists. Multi-pass aggregation filters noise. And capping at 10 comments with a confidence threshold keeps the bot helpful instead of annoying.

Start with the single-file analyzer. Run it on your last five PRs and see what it catches. If the signal-to-noise ratio is good enough, wire up the GitHub integration and let it comment on your next PR. Your senior engineers will thank you for the help — and you'll have built something that understands exactly how (and how not) to trust AI with your code.

References & Further Reading