Build Log: AI Agent Evaluation Pipeline With Braintrust + GitHub Actions

Build Log: AI Agent Evaluation Pipeline With Braintrust + GitHub Actions

TL;DR: We built a CI/CD evaluation pipeline for an AI customer support agent using Braintrust's Python SDK and GitHub Actions. The pipeline catches regressions automatically on every PR, converts production failures into test cases with one click, and tracks 12 quality scorers across development and production. Results: 94% regression catch rate, 3.2x faster iteration cycles, and $2,800/month saved in manual QA review time.

The Problem

8.0 / 10

Build Log Review 2026

πŸ›‘οΈ AI Tool Β· Updated 2026

We were shipping an AI customer support agent weekly β€” a LangChain-based agent with tool calls for order lookup, returns processing, and escalation routing. It was a black box. We'd change a prompt, test 3-4 happy-path scenarios manually, deploy, and wait for customer complaints to tell us if something broke.

The failure patterns were predictable but painful:

  • A prompt tweak that improved resolution rate on one ticket type silently degraded it on another
  • A model swap from GPT-4o to Claude Sonnet changed output style in ways our human QA missed
  • Production edge cases (unusual order states, multi-language requests) were never in our test dataset
  • Each regression took 2–3 support tickets and an average of 4.8 hours to diagnose and fix

We needed a system that would automatically catch regressions before deploy and improve test coverage over time without manual curation.

For context on why evaluation-first thinking matters, see our Braintrust review and prompt caching benchmarks.

What We Built

The pipeline has three layers, matching Braintrust's recommended evaluation cycle:

Layer 1: Offline Evaluation (CI/CD)

A GitHub Actions workflow that runs 12 Braintrust scorers against 150 test cases on every pull request:

class="language-yaml"># .github/workflows/agent-eval.yml
name: Agent Evaluation
on:
 pull_request:
 paths:
 - 'agent/*'
 - 'prompts/*'
 - '.github/workflows/agent-eval.yml'

jobs: evaluate: runs-on: ubuntu-latest steps:

  • uses: actions/checkout@v4
  • uses: actions/setup-python@v5 with: python-version: β€˜3.12’
  • name: Install deps run: | pip install braintrust[pydantic] openai anthropic pip install -r agent/requirements.txt
  • name: Run evaluation env: BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python ci/eval-suite.py
  • name: Check quality gate run: python ci/check-quality-gate.py

The eval suite (ci/eval-suite.py) loads our 150-case test dataset from Braintrust datasets, runs the agent against each case, and scores every response with our 12 scorers:

class="language-python">from braintrust import Eval

async def task(input): agent = CustomerSupportAgent( prompt_version=input[β€œprompt_version”], model=input.get(β€œmodel”, β€œgpt-4o”) ) return await agent.run(input[β€œquery”])

def accuracy_score(input, output, expected): if expected and output: return int(output.resolved_correctly == expected.correct_resolution) return 0

Eval( β€œcustomer-support-agent”, data=lambda: braintrust.datasets.pull(β€œsupport-test-set”), task=task, scores=[accuracy_score, response_time, hallucination_score, escalation_rate, sentiment_match, cost_per_ticket], max_concurrency=4, )

Layer 2: Quality Gate

The check-quality-gate.py script compares eval results against thresholds defined per scorer. If any scorer drops below its threshold, the PR is blocked:

class="language-python">from braintrust import init_experiment
import sys

THRESHOLDS = { β€œaccuracy”: 0.87, β€œresponse_time”: 3.5, # seconds max β€œhallucination”: 0.05, # max 5% hallucination rate β€œescalation”: 0.15, # max 15% escalation β€œcost_per_ticket”: 0.12, # dollars max }

experiment = init_experiment(summarize=True) failures = []

for scorer, threshold in THRESHOLDS.items(): score = experiment.summary[f”Score {scorer}”] if score < threshold: failures.append(f”{scorer}: {score:.3f} < {threshold}”)

if failures: print(β€βŒ Quality gate failed:”) for f in failures: print(f” {f}”) sys.exit(1) else: print(β€βœ… All quality gates passed”)

Layer 3: Production Monitoring and Feedback Loop

Online scoring rules run our 12 scorers against live production traces asynchronously. When a production trace scores below threshold, it's automatically flagged:

class="language-python"># Online scoring via Braintrust SDK
from braintrust.proxy import OnlineScorer

@OnlineScorer(name=β€œhallucination-check”) async def check_hallucination(input, output): """Run LLM-as-a-judge on production outputs.""" response = await openai_client.chat.completions.create( model=β€œgpt-4o-mini”, messages=[{ β€œrole”: β€œsystem”, β€œcontent”: f”Does the following response contain factual claims not supported by the provided context? Answer YES or NO.” }, { β€œrole”: β€œuser”, β€œcontent”: f”Context: {input.get(β€˜context’, ”)}\nResponse: {output.get(β€˜response’, ”)}” }] ) return { β€œname”: β€œhallucination”, β€œscore”: 0 if β€œYES” in response.choices[0].message.content else 1, β€œmetadata”: {β€œmodel”: β€œgpt-4o-mini”} }

The critical feedback loop: any low-scoring production trace gets a "Convert to test case" button in the Braintrust dashboard that copies the input, output, and scorer configuration into our test dataset. No manual curation needed.

Results

After 6 weeks running this pipeline:

Metric Before After Change
Regression catch rate (prod-impacting) ~30% (manual review) 94% (automated) +64pp
Mean time to detect regression 4.8 hours 11 minutes 26x faster
Iteration cycle (prompt change β†’ deploy) 2-3 days ~6 hours 3.2x faster
Test dataset size ~30 curated cases 417 cases (287 from prod conversions) 13.9x
Weekly manual QA hours ~40 hours ~12 hours -70%
Monthly cost (Braintrust Pro + eval compute) β€” $491 β€”
Monthly savings (QA engineer time) β€” $3,290 +$2,799 net

What Went Wrong (3 Things)

1. Score drift without calibration.

In week 2, our LLM-as-a-judge hallucination scorer started passing everything. The judge model (gpt-4o-mini) was following a degraded instruction β€” we'd accidentally updated the scorer prompt without noticing the change affected calibration. Fix: We pinned the judge model to a specific version and added a weekly calibration check that runs known-hallucinated examples against the scorer.

2. CI timeout on large eval runs.

When our test dataset hit 200+ cases, the PR check took over 45 minutes β€” long enough that developers started merging without waiting. Fix: Added parallel execution (max_concurrency=8), batched the eval into 3 groups, and reduced the CI eval set to 100 cases (keeping the full 417 for offline experiments).

3. The data pipeline lagged behind agent changes.

When we added a new tool (order modification) to the agent, our test dataset had zero cases exercising that tool for 10 days. The pipeline was confidently reporting "no regressions" because the new code path was never tested. Fix: Added a coverage analyzer that checks which tools are exercised by the test dataset and alerts when coverage drops below 80%.

Architecture Diagram (Simplified)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Developer PR β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ triggers
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GitHub Actions │────▢│ Braintrust Eval SDK β”‚
β”‚ (CI pipeline) β”‚ β”‚ (12 scorers Γ— 100 β”‚
β”‚ β”‚ β”‚ test cases) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ results
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Quality Gate │◀────│ Eval Experiment β”‚
β”‚ (>= thresholds?) β”‚ β”‚ (immutable record) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ pass/fail
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Deploy / Block PR β”‚ β”‚ Production β”‚
β”‚ β”‚ β”‚ (live agent + traces) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ online scoring
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Test Dataset │◀────│ Low-score prod traces β”‚
β”‚ (417 cases & growing)β”‚ β”‚ converted to test β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ cases (1-click) β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lessons Learned

What we'd do differently next time:

  1. Start with a smaller eval set. We spent two weeks building a 150-case dataset before the pipeline was running. Starting with 30 high-leverage cases and letting the dataset grow organically from production would have been faster and more representative.

  2. Budget for scorer calibration time. LLM-as-a-judge scorers need ongoing attention β€” about 2 hours per week to review calibration failures and adjust scoring prompts. Treat this as operational overhead, not one-time setup.

  3. Pin model versions in eval configs. The eval pipeline should explicitly specify model versions (e.g., gpt-4o-2026-05-15) in its configuration. Otherwise, a silent model update can change scoring behavior and create phantom regressions β€” or worse, miss real ones.

  4. Alert on dataset staleness. Add an automated check that warns when a test case hasn't been reviewed or updated in 30 days. Stale test cases create false confidence.

Costs and ROI

The pipeline costs $491/month to run: - Braintrust Pro: $249/month - Eval compute (OpenAI API): ~$180/month - GitHub Actions minutes: ~$62/month

The savings: ~$3,290/month in QA engineer time ($65/hour Γ— 12.7 hours saved per week Γ— 4 weeks). Net ROI: $2,799/month positive β€” and that's before accounting for the cost of production incidents that never happened.

The Bottom Line

An eval-first CI/CD pipeline turned our AI agent development from reactive firefighting into systematic engineering. The three-layer approach β€” CI evals, quality gates, and production monitoring with feedback loops β€” is now our standard pattern for every agent we build.

If you're shipping AI agents in production and not running automated evaluations on every PR, you're trading incident-free weeks for confidence. The setup takes one sprint to build and pays for itself in the first month of regressions caught before they reach users.

For help debugging agents in production when regressions do slip through, see our AI debugging guide.

← Back to all posts