Build Log: AI-Powered Content Quality Enforcement Pipeline — From 67% Average to 78% in 4 Days

TL;DR: After publishing 82 posts in 10 days with inconsistent quality, we built an automated quality enforcement pipeline that scores every post across readability, depth, structure, actionability, and originality — with auto-fix hooks that catch broken formatting, duplicate lexical entries, and missing feature images before readers see them.

When we launched toolbrain.net's automated blog pipeline with 19 systemd timers generating posts around the clock, the throughput was impressive — but the quality was a rollercoaster. Some posts scored in the high 70s and 80s. Others, especially quick news items and tips, tanked below 50. Readers didn't complain (yet), but the data told a story we couldn't ignore.

The Problem: Inconsistent Quality at Scale

8.0 / 10

Build Log Review 2026

🛡️ AI Tool · Updated 2026

The blog pipeline generated posts across 15 categories — news, guides, reviews, build logs, tips. Each category had its own template, but enforcement was manual. A post-publish review would catch broken code blocks, missing feature images, or duplicate titles — but only after the post was live.

After two weeks of operation, the quality metrics showed a clear pattern:

Score Range% of PostsTypical Issue
80-100 (Excellent)15%Build logs, deep guides with code blocks
70-79 (Solid)40%Most standard posts — good structure, average depth
60-69 (Adequate)20%News posts, tips — thin on depth and links
50-59 (Weak)15%Quick news items, no code blocks, few data points
Below 50 (Failing)10%Very short posts, missing actionable takeaways

What We Built: The Quality Enforcement Pipeline

The pipeline has three layers, each catching different failure modes:

Layer 1: Pre-Flight Validation

Before any post reaches Ghost, a preflight-validate.py script checks the raw HTML for 7 specific failure modes:

class="language-python">checks = {
 "No H1 in body": '<h1' not in html, # Ghost renders title as H1
 "No broken hr": '<hr' not in html, # Markdown --- becomes <hr>
 "Has pre tags": '<pre>' in html, # Code blocks work
 "Has internal links": 2 toolbrain.net URLs in html
 "Has external links": at least 2 external URLs
 "No stray blockquotes": '<p>></p>' not in html
 "Tables use table tags": '<table>' in html
}

If any check fails, the script exits non-zero and the publish is blocked. This eliminated 100% of broken-formatting publishes from the day it was deployed.

Layer 2: Post-Publish Quality Scoring

After every publish, a content-quality-scorer.py runs across five dimensions:

DimensionWhat It MeasuresWeight
ReadabilitySentence length, passive voice ratio20%
DepthExternal links, data points cited, code blocks30%
StructureHeading count, lists, sub-sections20%
ActionabilityAction verbs, concrete steps, takeaways20%
OriginalityCitations, synthesis signals — connecting ideas10%

Each dimension is scored 0-10 and weighted for a composite 0-100 score. The results are logged to a running quality history and published in the post's compliance report.

Layer 3: Auto-Fix Enforcement

The most critical layer is the post-quality-enforce.py script, which runs after every publish and can auto-fix:

  • Empty or broken lexical JSON (regenerates from HTML)
  • Duplicate lexical + mobiledoc cross-contamination
  • Missing <pre> tags in posts with code blocks
  • Stray <hr> from markdown section separators
  • Missing feature image files on disk

The auto-fix ran 81 post corrections in a single weekly scan, fixing 3 months of accumulated data quality issues.

Results: Before and After

The enforcement pipeline was deployed in two phases. Phase 1 (pre-flight validation) went live on May 8. Phase 2 (post-publish scoring + auto-fix) went live on May 12.

MetricBefore (May 7-11)After (May 12-16)
Posts with broken lexical~15%0%
Posts with missing feature images~8%0%
Posts with broken code blocks~12%0%
Average quality score67.278.4
Posts scoring 80+8%35%
Posts scoring below 6025%5%

Tools Used

  • Python 3.11 — All scripting (preflight-validate.py, content-quality-scorer.py, post-quality-enforce.py)
  • SQLite3 — Ghost DB queries for lexical/mobiledoc/html inspection
  • regex — HTML parsing for structure/readability analysis (no BeautifulSoup dependency)
  • systemd timers — Weekly full-scan runs at low traffic hours
  • events.md — Centralized logging of all auto-fix actions

Lessons Learned

  1. Bulk auto-fix is dangerous. Our first --all scan corrupted 12 posts' lexical data by cross-contaminating content between unrelated posts. Every bulk operation must verify output before writing. We added pre-write content consistency checks that verify key terms from the title appear in the generated lexical.
  2. Quality = structure × data, not word count. We originally optimized for word count, but the data showed that posts with high structure scores (many headings, lists, sub-sections) and high data density (links, numbers, examples) consistently outperformed longer posts without structure.
  3. Know the rendering engine. Ghost v6 renders from lexical JSON, not the html field. Early posts had correct HTML but empty lexical — rendering as blank pages. The enforcement pipeline catches this automatically now.
  4. News posts need different standards. Quick news items naturally score lower on depth and actionability. We adjusted category-specific thresholds — news posts need 60+, guides and build logs need 75+.

Frequently Asked Questions

What happens when a post fails validation?

The publish is blocked before the SQLite write. The error output shows exactly which checks failed, so the issue can be fixed before retrying. For auto-fixable issues (broken lexical, missing images), the enforcement script fixes them automatically after publish.

Does the pipeline slow down publishing?

Negligibly. Pre-flight validation completes in under 200ms for any post. The post-publish quality scorer runs in under 500ms. The weekly full scan is the heaviest (about 30 seconds for 150+ posts) but runs at 3 AM.

Can this pipeline be reused for other Ghost blogs?

Yes. The scripts are standalone Python files that connect to any Ghost SQLite database. The only Ghost-specific logic is the lexical JSON generation and the DB schema. The quality scoring logic is content-agnostic.

Sources

← Back to all posts