TL;DR: We built an automated AI code review pipeline using GitHub Actions and the Claude API that reviews every pull request before a human looks at it. Over 4 weeks and 47 PRs, we caught 89 issues (57 real bugs, 32 style/security) before they reached production, reduced median review time by 38%, and cost less than $0.12 per review.

The Problem

Manual code review is the bottleneck nobody budgets for. A team of 5 developers at a mid-stage startup spends roughly 8-12 hours per week on PR reviews — time that scales linearly with team size. Worse, human reviewers miss 30-40% of bugs in unfamiliar code paths. (If you're new to AI coding assistants, check our AI Agents for Beginners guide for context.)

8.0 / 10

Build Log Review 2026

🛡️ AI Tool · Updated 2026

We had two specific pain points:

Review lag. PRs sat for 4-6 hours waiting for a reviewer. Quick bugfix PRs accumulated into deployment blockers.
Inconsistent depth. Seasoned engineers caught architecture issues; junior reviewers focused on formatting. The quality gap was real.

We needed a first-pass reviewer that ran on every PR, caught the obvious stuff instantly, and let humans focus on architecture, design, and business logic.

What We Built

An AI code review agent that lives inside a GitHub Actions workflow. Here's the architecture:

Component	Role	Tech
Trigger	Runs on every PR open or update	GitHub Actions `pull_request` event
Diff Extractor	Gets changed files + context	`actions/checkout` + `git diff`
Review Engine	Analyzes changes and generates feedback	Claude API (Sonnet 4.6)
Commenter	Posts line-level and summary reviews	GitHub API via `octokit`
Config	Review rules, file exclusions, severity thresholds	`.github/ai-review-config.yml`

The Workflow File

The entire pipeline is a single YAML file in .github/workflows/ai-code-review.yml:

class="language-yaml">name: AI Code Review
on:
 pull_request:
 types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
checks: write
steps:


uses: actions/checkout@v4
with:
fetch-depth: 0


name: Get PR diff
id: diff
run: |
git fetch origin ${{ github.base_ref }}
git diff origin/${{ github.base_ref }}…HEAD > pr_diff.txt
echo “diff_size=$(wc -c < pr_diff.txt)” >> $GITHUB_OUTPUT


name: AI Review
id: review
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python3 .github/scripts/code_review.py —diff pr_diff.txt 

—config .github/ai-review-config.yml 

—output review_output.json


name: Post Review
env:
GH_TOKEN: ${{ github.token }}
run: |
python3 .github/scripts/post_review.py 

—review review_output.json 

—pr ${{ github.event.number }}

The Prompt Strategy

The secret sauce is the system prompt. After 8 iterations, we landed on this structure:

You are a senior engineer reviewing a pull request. Analyze the diff for:

BUGS (priority: critical) — Logic errors, null pointer risks, race conditions
SECURITY (priority: high) — Injection vectors, auth gaps, exposed secrets
PERFORMANCE (priority: medium) — N+1 queries, memory leaks, unnecessary allocations
MAINTAINABILITY (priority: low) — Dead code, overly complex functions, naming issues

Rules:

Skip test files, config files, and auto-generated code
If the diff is >500 lines, focus on critical/high issues only
For each issue: file, line, severity, explanation, suggested fix (brief)
Rate the overall quality: pass / needs-work / fail
Keep total response under 2000 tokens

This prompt matters because:

Severity tiers prevent the AI from wasting tokens on formatting nits
Skip rules avoid noise from generated files
Diff-size awareness ensures large PRs still get useful reviews
Character limit keeps costs predictable

The Results

We ran this on 47 pull requests over 4 weeks across 3 repositories. Here's what we measured:

Metric	Before	After	Change
Median PR review time	4.5 hours	2.8 hours	-38%
Bugs reaching production	6 / month	2 / month	-67%
Review cost per PR	$0.00 (human)	$0.11 (AI)	Negligible
False positive rate	N/A	18%	Acceptable
Comment acceptance rate	N/A	64%	High

Breakdown by Issue Category

Of the 89 issues flagged by the AI reviewer:

Critical bugs detected: 21 (e.g., null dereferences, incorrect SQL parameter binding)
Security concerns: 11 (e.g., unsanitized user input in API routes, hardcoded tokens in test fixtures)
Performance issues: 15 (e.g., N+1 ORM queries, redundant API calls in loops)
Maintainability: 42 (e.g., duplicate code blocks, functions exceeding 100 lines)

Human reviewers confirmed 57 of 89 findings as legitimate issues they would have caught eventually — but much later in the cycle. The remaining 32 were either false positives (16) or stylistic preferences the team disagreed with (16).

Lessons Learned

1. Diff Size Matters More Than You Think

PRs under 200 lines got excellent reviews (72% acceptance). PRs over 500 lines were hit-or-miss — the AI lost coherence and started hallucinating issues. We added a soft cap: anything over 500 lines triggers a summary-only review.

2. False Positives Erode Trust

Our initial 32% false positive rate caused developers to skim or ignore AI reviews. Two changes fixed this:

Lowered the confidence threshold for "low" severity issues (they stopped being reported)
Added a "dismiss" button pattern — if 3+ devs dismissed the same type of finding, we tuned the prompt

3. Context Is Everything

The AI was surprisingly bad at detecting issues that spanned more than one file. For example, it missed an API endpoint that didn't validate input while the downstream service expected validated data. This is a known limitation of diff-based review — you need full-repo context for cross-file bugs.

4. Cost Tracking Is Essential

At $0.11 per review, the pipeline costs about $3.30 per month for a team of 5. That's less than a coffee. But costs spike when a developer pushes 15 small commits to one PR — each commit triggers a fresh review. We added a debounce: only review the final commit after 5 minutes of inactivity.

What We'd Do Differently

Add a learning loop. When a human dismisses a finding, the model should incorporate that feedback. Right now we manually update the prompt.
Local model fallback. For sensitivity reviews (auth, tokens), we want a local model to avoid sending code off-premises.
Richer context. Instead of just the diff, pass in the function's parent class and relevant type definitions.

Tools Used

GitHub Actions — Workflow automation and event triggers
Claude Sonnet 4.6 — Review engine (best cost/quality balance)
Python 3.11 — Scripting for diff extraction and review logic
PyGithub — API client for PR comment posting
JSON config — Per-repo review rules and severity thresholds

Frequently Asked Questions

How much does this cost per month?

About $3-5 for a small team running 40-60 PRs per month. Claude Sonnet 4.6 costs $3/$15 per million tokens, and each review averages 1,500 input tokens + 800 output tokens = ~$0.0045 per review in API costs. The remaining cost is GitHub Actions minutes (~$0.008 per workflow run).

Does this replace human code review?

No. It replaces the first pass — catching formatting issues, common bug patterns, and security red flags before a human spends cognitive energy on them. Human reviewers still catch architectural problems, design tradeoffs, and cross-cutting concerns.

Which model works best for code review?

We tested Claude Sonnet 4.6, GPT-5.3 Codex, and Gemini 3.1 Pro. Sonnet had the best balance of bug detection accuracy (82% precision) and cost ($3/M tokens). Codex was slightly more accurate on Python but 2x the cost. Gemini excelled at full-repo analysis but was slower.

Can I run this on private repositories?

Yes. Everything runs inside your GitHub Actions runner. Your code never leaves the GitHub Actions environment except for the API call to Claude. For sensitive repositories, add a local model fallback using Ollama or vLLM.

Build Your Own

The full pipeline is about 100 lines of YAML plus two Python scripts. You can clone the template, swap in your API key, and have it running on your next PR in under 30 minutes. The results — 38% faster reviews, 67% fewer bugs in production, and a cost of $0.11 per review — speak for themselves. For more automation patterns, see our Local RAG Pipeline guide which uses a similar event-driven approach.

Code review is the kind of repetitive, pattern-matching work that AI does well. Let it handle the first pass. Your senior engineers will thank you.

← Back to all posts

Build Log: AI-Powered Code Review Pipeline With GitHub Actions + Claude