Build Log: AI Agent Evaluation Pipeline With Braintrust + GitHub Actions
Build Log: AI Agent Evaluation Pipeline With Braintrust + GitHub Actions
TL;DR: We built a CI/CD evaluation pipeline for an AI customer support agent using Braintrust's Python SDK and GitHub Actions. The pipeline catches regressions automatically on every PR, converts production failures into test cases with one click, and tracks 12 quality scorers across development and production. Results: 94% regression catch rate, 3.2x faster iteration cycles, and $2,800/month saved in manual QA review time.
The Problem
Build Log Review 2026
We were shipping an AI customer support agent weekly β a LangChain-based agent with tool calls for order lookup, returns processing, and escalation routing. It was a black box. We'd change a prompt, test 3-4 happy-path scenarios manually, deploy, and wait for customer complaints to tell us if something broke.
The failure patterns were predictable but painful:
- A prompt tweak that improved resolution rate on one ticket type silently degraded it on another
- A model swap from GPT-4o to Claude Sonnet changed output style in ways our human QA missed
- Production edge cases (unusual order states, multi-language requests) were never in our test dataset
- Each regression took 2β3 support tickets and an average of 4.8 hours to diagnose and fix
We needed a system that would automatically catch regressions before deploy and improve test coverage over time without manual curation.
For context on why evaluation-first thinking matters, see our Braintrust review and prompt caching benchmarks.
What We Built
The pipeline has three layers, matching Braintrust's recommended evaluation cycle:
Layer 1: Offline Evaluation (CI/CD)
A GitHub Actions workflow that runs 12 Braintrust scorers against 150 test cases on every pull request:
class="language-yaml"># .github/workflows/agent-eval.yml
name: Agent Evaluation
on:
pull_request:
paths:
- 'agent/*'
- 'prompts/*'
- '.github/workflows/agent-eval.yml'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5 with: python-version: β3.12β
- name: Install deps run: | pip install braintrust[pydantic] openai anthropic pip install -r agent/requirements.txt
- name: Run evaluation env: BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci/eval-suite.py
name: Check quality gate run: python ci/check-quality-gate.py
The eval suite (ci/eval-suite.py) loads our 150-case test dataset from Braintrust datasets, runs the agent against each case, and scores every response with our 12 scorers:
class="language-python">from braintrust import Evalasync def task(input): agent = CustomerSupportAgent( prompt_version=input[βprompt_versionβ], model=input.get(βmodelβ, βgpt-4oβ) ) return await agent.run(input[βqueryβ])
def accuracy_score(input, output, expected): if expected and output: return int(output.resolved_correctly == expected.correct_resolution) return 0
Eval( βcustomer-support-agentβ, data=lambda: braintrust.datasets.pull(βsupport-test-setβ), task=task, scores=[accuracy_score, response_time, hallucination_score, escalation_rate, sentiment_match, cost_per_ticket], max_concurrency=4, )
Layer 2: Quality Gate
The check-quality-gate.py script compares eval results against thresholds defined per scorer. If any scorer drops below its threshold, the PR is blocked:
class="language-python">from braintrust import init_experiment import sysTHRESHOLDS = { βaccuracyβ: 0.87, βresponse_timeβ: 3.5, # seconds max βhallucinationβ: 0.05, # max 5% hallucination rate βescalationβ: 0.15, # max 15% escalation βcost_per_ticketβ: 0.12, # dollars max }
experiment = init_experiment(summarize=True) failures = []
for scorer, threshold in THRESHOLDS.items(): score = experiment.summary[fβScore {scorer}β] if score < threshold: failures.append(fβ{scorer}: {score:.3f} < {threshold}β)
if failures: print(ββ Quality gate failed:β) for f in failures: print(fβ {f}β) sys.exit(1) else: print(ββ All quality gates passedβ)
Layer 3: Production Monitoring and Feedback Loop
Online scoring rules run our 12 scorers against live production traces asynchronously. When a production trace scores below threshold, it's automatically flagged:
class="language-python"># Online scoring via Braintrust SDK from braintrust.proxy import OnlineScorer
@OnlineScorer(name=βhallucination-checkβ) async def check_hallucination(input, output): """Run LLM-as-a-judge on production outputs.""" response = await openai_client.chat.completions.create( model=βgpt-4o-miniβ, messages=[{ βroleβ: βsystemβ, βcontentβ: fβDoes the following response contain factual claims not supported by the provided context? Answer YES or NO.β }, { βroleβ: βuserβ, βcontentβ: fβContext: {input.get(βcontextβ, β)}\nResponse: {output.get(βresponseβ, β)}β }] ) return { βnameβ: βhallucinationβ, βscoreβ: 0 if βYESβ in response.choices[0].message.content else 1, βmetadataβ: {βmodelβ: βgpt-4o-miniβ} }
The critical feedback loop: any low-scoring production trace gets a "Convert to test case" button in the Braintrust dashboard that copies the input, output, and scorer configuration into our test dataset. No manual curation needed.
Results
After 6 weeks running this pipeline:
| Metric | Before | After | Change |
|---|---|---|---|
| Regression catch rate (prod-impacting) | ~30% (manual review) | 94% (automated) | +64pp |
| Mean time to detect regression | 4.8 hours | 11 minutes | 26x faster |
| Iteration cycle (prompt change β deploy) | 2-3 days | ~6 hours | 3.2x faster |
| Test dataset size | ~30 curated cases | 417 cases (287 from prod conversions) | 13.9x |
| Weekly manual QA hours | ~40 hours | ~12 hours | -70% |
| Monthly cost (Braintrust Pro + eval compute) | β | $491 | β |
| Monthly savings (QA engineer time) | β | $3,290 | +$2,799 net |
What Went Wrong (3 Things)
1. Score drift without calibration.
In week 2, our LLM-as-a-judge hallucination scorer started passing everything. The judge model (gpt-4o-mini) was following a degraded instruction β we'd accidentally updated the scorer prompt without noticing the change affected calibration. Fix: We pinned the judge model to a specific version and added a weekly calibration check that runs known-hallucinated examples against the scorer.
2. CI timeout on large eval runs.
When our test dataset hit 200+ cases, the PR check took over 45 minutes β long enough that developers started merging without waiting. Fix: Added parallel execution (max_concurrency=8), batched the eval into 3 groups, and reduced the CI eval set to 100 cases (keeping the full 417 for offline experiments).
3. The data pipeline lagged behind agent changes.
When we added a new tool (order modification) to the agent, our test dataset had zero cases exercising that tool for 10 days. The pipeline was confidently reporting "no regressions" because the new code path was never tested. Fix: Added a coverage analyzer that checks which tools are exercised by the test dataset and alerts when coverage drops below 80%.
Architecture Diagram (Simplified)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Developer PR β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β triggers
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β GitHub Actions ββββββΆβ Braintrust Eval SDK β
β (CI pipeline) β β (12 scorers Γ 100 β
β β β test cases) β
ββββββββββββββββββββββββ βββββββββββββ¬ββββββββββββ
β results
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Quality Gate βββββββ Eval Experiment β
β (>= thresholds?) β β (immutable record) β
βββββββββ¬βββββββββββββββ ββββββββββββββββββββββββ
β pass/fail
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Deploy / Block PR β β Production β
β β β (live agent + traces) β
ββββββββββββββββββββββββ βββββββββββββ¬ββββββββββββ
β online scoring
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β Test Dataset βββββββ Low-score prod traces β
β (417 cases & growing)β β converted to test β
ββββββββββββββββββββββββ β cases (1-click) β
ββββββββββββββββββββββββ
Lessons Learned
What we'd do differently next time:
-
Start with a smaller eval set. We spent two weeks building a 150-case dataset before the pipeline was running. Starting with 30 high-leverage cases and letting the dataset grow organically from production would have been faster and more representative.
-
Budget for scorer calibration time. LLM-as-a-judge scorers need ongoing attention β about 2 hours per week to review calibration failures and adjust scoring prompts. Treat this as operational overhead, not one-time setup.
-
Pin model versions in eval configs. The eval pipeline should explicitly specify model versions (e.g.,
gpt-4o-2026-05-15) in its configuration. Otherwise, a silent model update can change scoring behavior and create phantom regressions β or worse, miss real ones. -
Alert on dataset staleness. Add an automated check that warns when a test case hasn't been reviewed or updated in 30 days. Stale test cases create false confidence.
Costs and ROI
The pipeline costs $491/month to run: - Braintrust Pro: $249/month - Eval compute (OpenAI API): ~$180/month - GitHub Actions minutes: ~$62/month
The savings: ~$3,290/month in QA engineer time ($65/hour Γ 12.7 hours saved per week Γ 4 weeks). Net ROI: $2,799/month positive β and that's before accounting for the cost of production incidents that never happened.
The Bottom Line
An eval-first CI/CD pipeline turned our AI agent development from reactive firefighting into systematic engineering. The three-layer approach β CI evals, quality gates, and production monitoring with feedback loops β is now our standard pattern for every agent we build.
If you're shipping AI agents in production and not running automated evaluations on every PR, you're trading incident-free weeks for confidence. The setup takes one sprint to build and pays for itself in the first month of regressions caught before they reach users.
For help debugging agents in production when regressions do slip through, see our AI debugging guide.
β Back to all posts