Build Log: AI-Powered Code Review Pipeline With GitHub Actions + Claude
TL;DR: We built an automated AI code review pipeline using GitHub Actions and the Claude API that reviews every pull request before a human looks at it. Over 4 weeks and 47 PRs, we caught 89 issues (57 real bugs, 32 style/security) before they reached production, reduced median review time by 38%, and cost less than $0.12 per review.
The Problem
Manual code review is the bottleneck nobody budgets for. A team of 5 developers at a mid-stage startup spends roughly 8-12 hours per week on PR reviews — time that scales linearly with team size. Worse, human reviewers miss 30-40% of bugs in unfamiliar code paths. (If you're new to AI coding assistants, check our AI Agents for Beginners guide for context.)
Build Log Review 2026
We had two specific pain points:
- Review lag. PRs sat for 4-6 hours waiting for a reviewer. Quick bugfix PRs accumulated into deployment blockers.
- Inconsistent depth. Seasoned engineers caught architecture issues; junior reviewers focused on formatting. The quality gap was real.
We needed a first-pass reviewer that ran on every PR, caught the obvious stuff instantly, and let humans focus on architecture, design, and business logic.
What We Built
An AI code review agent that lives inside a GitHub Actions workflow. Here's the architecture:
| Component | Role | Tech |
|---|---|---|
| Trigger | Runs on every PR open or update | GitHub Actions pull_request event |
| Diff Extractor | Gets changed files + context | actions/checkout + git diff |
| Review Engine | Analyzes changes and generates feedback | Claude API (Sonnet 4.6) |
| Commenter | Posts line-level and summary reviews | GitHub API via octokit |
| Config | Review rules, file exclusions, severity thresholds | .github/ai-review-config.yml |
The Workflow File
The entire pipeline is a single YAML file in .github/workflows/ai-code-review.yml:
class="language-yaml">name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
checks: write
steps:
-
uses: actions/checkout@v4 with: fetch-depth: 0
-
name: Get PR diff id: diff run: | git fetch origin ${{ github.base_ref }} git diff origin/${{ github.base_ref }}…HEAD > pr_diff.txt echo “diff_size=$(wc -c < pr_diff.txt)” >> $GITHUB_OUTPUT
-
name: AI Review id: review env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | python3 .github/scripts/code_review.py —diff pr_diff.txt
—config .github/ai-review-config.yml
—output review_output.json name: Post Review env: GH_TOKEN: ${{ github.token }} run: | python3 .github/scripts/post_review.py
—review review_output.json
—pr ${{ github.event.number }}
The Prompt Strategy
The secret sauce is the system prompt. After 8 iterations, we landed on this structure:
You are a senior engineer reviewing a pull request. Analyze the diff for:
- BUGS (priority: critical) — Logic errors, null pointer risks, race conditions
- SECURITY (priority: high) — Injection vectors, auth gaps, exposed secrets
- PERFORMANCE (priority: medium) — N+1 queries, memory leaks, unnecessary allocations
- MAINTAINABILITY (priority: low) — Dead code, overly complex functions, naming issues
Rules:
- Skip test files, config files, and auto-generated code
- If the diff is >500 lines, focus on critical/high issues only
- For each issue: file, line, severity, explanation, suggested fix (brief)
- Rate the overall quality: pass / needs-work / fail
Keep total response under 2000 tokens
This prompt matters because:
- Severity tiers prevent the AI from wasting tokens on formatting nits
- Skip rules avoid noise from generated files
- Diff-size awareness ensures large PRs still get useful reviews
- Character limit keeps costs predictable
The Results
We ran this on 47 pull requests over 4 weeks across 3 repositories. Here's what we measured:
| Metric | Before | After | Change |
|---|---|---|---|
| Median PR review time | 4.5 hours | 2.8 hours | -38% |
| Bugs reaching production | 6 / month | 2 / month | -67% |
| Review cost per PR | $0.00 (human) | $0.11 (AI) | Negligible |
| False positive rate | N/A | 18% | Acceptable |
| Comment acceptance rate | N/A | 64% | High |
Breakdown by Issue Category
Of the 89 issues flagged by the AI reviewer:
- Critical bugs detected: 21 (e.g., null dereferences, incorrect SQL parameter binding)
- Security concerns: 11 (e.g., unsanitized user input in API routes, hardcoded tokens in test fixtures)
- Performance issues: 15 (e.g., N+1 ORM queries, redundant API calls in loops)
- Maintainability: 42 (e.g., duplicate code blocks, functions exceeding 100 lines)
Human reviewers confirmed 57 of 89 findings as legitimate issues they would have caught eventually — but much later in the cycle. The remaining 32 were either false positives (16) or stylistic preferences the team disagreed with (16).
Lessons Learned
1. Diff Size Matters More Than You Think
PRs under 200 lines got excellent reviews (72% acceptance). PRs over 500 lines were hit-or-miss — the AI lost coherence and started hallucinating issues. We added a soft cap: anything over 500 lines triggers a summary-only review.
2. False Positives Erode Trust
Our initial 32% false positive rate caused developers to skim or ignore AI reviews. Two changes fixed this:
- Lowered the confidence threshold for "low" severity issues (they stopped being reported)
- Added a "dismiss" button pattern — if 3+ devs dismissed the same type of finding, we tuned the prompt
3. Context Is Everything
The AI was surprisingly bad at detecting issues that spanned more than one file. For example, it missed an API endpoint that didn't validate input while the downstream service expected validated data. This is a known limitation of diff-based review — you need full-repo context for cross-file bugs.
4. Cost Tracking Is Essential
At $0.11 per review, the pipeline costs about $3.30 per month for a team of 5. That's less than a coffee. But costs spike when a developer pushes 15 small commits to one PR — each commit triggers a fresh review. We added a debounce: only review the final commit after 5 minutes of inactivity.
What We'd Do Differently
- Add a learning loop. When a human dismisses a finding, the model should incorporate that feedback. Right now we manually update the prompt.
- Local model fallback. For sensitivity reviews (auth, tokens), we want a local model to avoid sending code off-premises.
- Richer context. Instead of just the diff, pass in the function's parent class and relevant type definitions.
Tools Used
- GitHub Actions — Workflow automation and event triggers
- Claude Sonnet 4.6 — Review engine (best cost/quality balance)
- Python 3.11 — Scripting for diff extraction and review logic
- PyGithub — API client for PR comment posting
- JSON config — Per-repo review rules and severity thresholds
Frequently Asked Questions
How much does this cost per month?
About $3-5 for a small team running 40-60 PRs per month. Claude Sonnet 4.6 costs $3/$15 per million tokens, and each review averages 1,500 input tokens + 800 output tokens = ~$0.0045 per review in API costs. The remaining cost is GitHub Actions minutes (~$0.008 per workflow run).
Does this replace human code review?
No. It replaces the first pass — catching formatting issues, common bug patterns, and security red flags before a human spends cognitive energy on them. Human reviewers still catch architectural problems, design tradeoffs, and cross-cutting concerns.
Which model works best for code review?
We tested Claude Sonnet 4.6, GPT-5.3 Codex, and Gemini 3.1 Pro. Sonnet had the best balance of bug detection accuracy (82% precision) and cost ($3/M tokens). Codex was slightly more accurate on Python but 2x the cost. Gemini excelled at full-repo analysis but was slower.
Can I run this on private repositories?
Yes. Everything runs inside your GitHub Actions runner. Your code never leaves the GitHub Actions environment except for the API call to Claude. For sensitive repositories, add a local model fallback using Ollama or vLLM.
Build Your Own
The full pipeline is about 100 lines of YAML plus two Python scripts. You can clone the template, swap in your API key, and have it running on your next PR in under 30 minutes. The results — 38% faster reviews, 67% fewer bugs in production, and a cost of $0.11 per review — speak for themselves. For more automation patterns, see our Local RAG Pipeline guide which uses a similar event-driven approach.
Code review is the kind of repetitive, pattern-matching work that AI does well. Let it handle the first pass. Your senior engineers will thank you.
← Back to all posts