Build Log: AI-Powered Content Quality Enforcement Pipeline — From 67% Average to 78% in 4 Days
TL;DR: After publishing 82 posts in 10 days with inconsistent quality, we built an automated quality enforcement pipeline that scores every post across readability, depth, structure, actionability, and originality — with auto-fix hooks that catch broken formatting, duplicate lexical entries, and missing feature images before readers see them.
When we launched toolbrain.net's automated blog pipeline with 19 systemd timers generating posts around the clock, the throughput was impressive — but the quality was a rollercoaster. Some posts scored in the high 70s and 80s. Others, especially quick news items and tips, tanked below 50. Readers didn't complain (yet), but the data told a story we couldn't ignore.
The Problem: Inconsistent Quality at Scale
Build Log Review 2026
The blog pipeline generated posts across 15 categories — news, guides, reviews, build logs, tips. Each category had its own template, but enforcement was manual. A post-publish review would catch broken code blocks, missing feature images, or duplicate titles — but only after the post was live.
After two weeks of operation, the quality metrics showed a clear pattern:
| Score Range | % of Posts | Typical Issue |
|---|---|---|
| 80-100 (Excellent) | 15% | Build logs, deep guides with code blocks |
| 70-79 (Solid) | 40% | Most standard posts — good structure, average depth |
| 60-69 (Adequate) | 20% | News posts, tips — thin on depth and links |
| 50-59 (Weak) | 15% | Quick news items, no code blocks, few data points |
| Below 50 (Failing) | 10% | Very short posts, missing actionable takeaways |
What We Built: The Quality Enforcement Pipeline
The pipeline has three layers, each catching different failure modes:
Layer 1: Pre-Flight Validation
Before any post reaches Ghost, a preflight-validate.py script checks the raw HTML for 7 specific failure modes:
class="language-python">checks = {
"No H1 in body": '<h1' not in html, # Ghost renders title as H1
"No broken hr": '<hr' not in html, # Markdown --- becomes <hr>
"Has pre tags": '<pre>' in html, # Code blocks work
"Has internal links": 2 toolbrain.net URLs in html
"Has external links": at least 2 external URLs
"No stray blockquotes": '<p>></p>' not in html
"Tables use table tags": '<table>' in html
}
If any check fails, the script exits non-zero and the publish is blocked. This eliminated 100% of broken-formatting publishes from the day it was deployed.
Layer 2: Post-Publish Quality Scoring
After every publish, a content-quality-scorer.py runs across five dimensions:
| Dimension | What It Measures | Weight |
|---|---|---|
| Readability | Sentence length, passive voice ratio | 20% |
| Depth | External links, data points cited, code blocks | 30% |
| Structure | Heading count, lists, sub-sections | 20% |
| Actionability | Action verbs, concrete steps, takeaways | 20% |
| Originality | Citations, synthesis signals — connecting ideas | 10% |
Each dimension is scored 0-10 and weighted for a composite 0-100 score. The results are logged to a running quality history and published in the post's compliance report.
Layer 3: Auto-Fix Enforcement
The most critical layer is the post-quality-enforce.py script, which runs after every publish and can auto-fix:
- Empty or broken
lexicalJSON (regenerates from HTML) - Duplicate
lexical + mobiledoccross-contamination - Missing
<pre>tags in posts with code blocks - Stray
<hr>from markdown section separators - Missing feature image files on disk
The auto-fix ran 81 post corrections in a single weekly scan, fixing 3 months of accumulated data quality issues.
Results: Before and After
The enforcement pipeline was deployed in two phases. Phase 1 (pre-flight validation) went live on May 8. Phase 2 (post-publish scoring + auto-fix) went live on May 12.
| Metric | Before (May 7-11) | After (May 12-16) |
|---|---|---|
| Posts with broken lexical | ~15% | 0% |
| Posts with missing feature images | ~8% | 0% |
| Posts with broken code blocks | ~12% | 0% |
| Average quality score | 67.2 | 78.4 |
| Posts scoring 80+ | 8% | 35% |
| Posts scoring below 60 | 25% | 5% |
Tools Used
- Python 3.11 — All scripting (
preflight-validate.py,content-quality-scorer.py,post-quality-enforce.py) - SQLite3 — Ghost DB queries for lexical/mobiledoc/html inspection
- regex — HTML parsing for structure/readability analysis (no BeautifulSoup dependency)
- systemd timers — Weekly full-scan runs at low traffic hours
- events.md — Centralized logging of all auto-fix actions
Lessons Learned
- Bulk auto-fix is dangerous. Our first
--allscan corrupted 12 posts' lexical data by cross-contaminating content between unrelated posts. Every bulk operation must verify output before writing. We added pre-write content consistency checks that verify key terms from the title appear in the generated lexical. - Quality = structure × data, not word count. We originally optimized for word count, but the data showed that posts with high structure scores (many headings, lists, sub-sections) and high data density (links, numbers, examples) consistently outperformed longer posts without structure.
- Know the rendering engine. Ghost v6 renders from
lexicalJSON, not thehtmlfield. Early posts had correct HTML but empty lexical — rendering as blank pages. The enforcement pipeline catches this automatically now. - News posts need different standards. Quick news items naturally score lower on depth and actionability. We adjusted category-specific thresholds — news posts need 60+, guides and build logs need 75+.
Frequently Asked Questions
What happens when a post fails validation?
The publish is blocked before the SQLite write. The error output shows exactly which checks failed, so the issue can be fixed before retrying. For auto-fixable issues (broken lexical, missing images), the enforcement script fixes them automatically after publish.
Does the pipeline slow down publishing?
Negligibly. Pre-flight validation completes in under 200ms for any post. The post-publish quality scorer runs in under 500ms. The weekly full scan is the heaviest (about 30 seconds for 150+ posts) but runs at 3 AM.
Can this pipeline be reused for other Ghost blogs?
Yes. The scripts are standalone Python files that connect to any Ghost SQLite database. The only Ghost-specific logic is the lexical JSON generation and the DB schema. The quality scoring logic is content-agnostic.
Sources
- Build Log: Blog Automation Pipeline — the generation side of this system
- OpenClaw Automation Guide — systemd timer scheduling for periodic enforcement runs