The Deterministic Horizon (2026): Impossibility Results Are Design Rules for Trustworthy AI
The Bottom Line
Two papers reshape how we think about AI limits and evaluation. The Deterministic Horizon (Guo, 2026) proves every transformer has a computable accuracy ceiling — past a critical depth, no amount of data, fine-tuning, or adapters can push it higher. Token Arena (Gao et al., 2026) shows the same model on different serving endpoints can differ by 12.5 accuracy points and 6x energy per correct answer. Together: architecture sets hard bounds, and deployment infrastructure is nowhere near transparent enough for apples-to-apples comparison.
What Each Paper Proposes
The Deterministic Horizon (arXiv:2605.23024)
Dongxin Guo's PhD thesis from the University of Hong Kong (2026, 271 pages) flips a century of impossibility results — Turing undecidability, Arrow's theorem, No Free Lunch — from theoretical curiosities into actionable design specifications for trustworthy AI. The flagship result is the Deterministic Horizon: a provable accuracy ceiling for any transformer architecture, computable before deployment using only two numbers — layer count and embedding width.
The paper proves that past a critical reasoning depth, no amount of additional training data, no choice of adapter rank (LoRA, etc.), no sample size, and no loss function can push accuracy past this ceiling. Across twelve transformer architectures, the Deterministic Horizon falls between 19 and 31 layers. Fine-tuning on optimal-length traces recovers fewer than 4 percentage points. The mechanism is a capacity invariant of the residual stream — an information-theoretic bottleneck baked into the architecture itself.
Beyond the flagship result, Guo catalogs 16 impossibility-specification pairs spanning preference learning (sample complexity jumps discontinuously under misspecified models), multi-stage retrieval (requires at least as many independent metrics as stages), auction theory (truthful auctions fail when agents have prompt-dependent valuations), and zero-knowledge verification of neural inference (measured overhead of 110-190x per non-linear activation).
Token Arena (arXiv:2605.00300)
Token Arena by Gao, Wang, Yu et al. (May 2026) introduces a continuous benchmark for AI inference at endpoint granularity — the (provider, model, SKU) tuple that real deployment decisions actually use. Across 78 endpoints serving 12 model families (including GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro, Llama 4, and DeepSeek V3), they measure five axes: output speed, time-to-first-token, workload-blended price, effective context, and live-endpoint quality.
Key Findings
From the Deterministic Horizon
- Architecture sets absolute bounds. The residual stream has a finite information capacity. Once saturated, more layers or more training cannot improve accuracy — the ceiling is structural, not statistical.
- Horizon values are surprisingly narrow. Across twelve architectures — from small encoder-only models to large decoder-only LLMs — the deterministic horizon ranges from 19 to 31 layers. This means many deployed models are already at or near their architecture's fundamental limit.
- Super-exponential decay past the horizon. Adding layers beyond the horizon causes accuracy to collapse — not linearly, but super-exponentially. The paper derives an information-theoretic conversion that predicts this decay rate from the capacity invariant.
- General methodology. The same impossibility-to-specification pattern applies across subfields. Preference learning under model misspecification has a discontinuous jump in sample complexity — you cannot smoothly trade off misspecification for more data.
From Token Arena
- Same model, wildly different results. The same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity by up to 12 points, and in tail latency by an order of magnitude.
- Energy per correct answer varies 6.2x. Token Arena's modeled joules-per-correct-answer metric reveals massive efficiency differences between serving stacks that standard benchmarks ignore.
- Workload reshuffles the leaderboard. Under a chat preset (3:1 input:output), 7 of the top 10 endpoints fall out of the top 10 under a retrieval-augmented preset (20:1). The reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price.
A Key Insight in Code and Math
The Deterministic Horizon's core result can be expressed as a capacity bound on the residual stream. If a transformer has embedding dimension d and L layers, the maximum Shannon information that can propagate through the residual stream is bounded by:
C(L, d) ≤ d · log2(1 + P/N) · f(L)where:
- C = residual stream capacity (bits)
- d = embedding dimension
- P/N = signal-to-noise ratio at each layer
- f(L) → 0 super-exponentially as L → L₀ (the horizon)
The practical implication: if you are building a 40-layer transformer and your architecture's horizon is 24, layers 25-40 are structurally incapable of improving accuracy for any task. You are burning FLOPs on silent degradation.
Token Arena's composite metric:
J/CA = (modeled energy per token x tokens per output) / correct_answer_ratewhere J/CA = joules per correct answer, combining efficiency and quality into a single deployable number.
Why It Matters
For AI Architecture Design
The Deterministic Horizon provides something the field has lacked: a theoretical ceiling for transformer performance that can be computed in advance. If you know your architecture's horizon is 24 layers and you are at 32, you know with mathematical certainty that no amount of data scaling will help. The paper's impossibility-to-specification methodology gives designers 16 concrete rules — for each bound, there is a quantified violation cost and a constructive alternative. This is the difference between sailing by the stars and sailing by a map with the reefs marked.
For AI Procurement and Deployment
Token Arena exposes a problem the industry has been ignoring: public leaderboards compare at the model level, but deployment happens at the endpoint level. Quantization, decoding strategy, serving stack, and region all change model behavior in ways that matter for accuracy, latency, cost, and energy. A company choosing between two providers based on a leaderboard may get the wrong answer by 12 accuracy points. Token Arena's endpoint-level methodology is the minimum viable transparency that procurement needs.
For Trustworthy AI Regulation
Guo's thesis directly addresses the EU AI Act, US Executive Order on AI, and similar frameworks. Current regulation focuses on training data, model cards, and red-teaming — none of which capture the architectural ceilings proved here. A regulator could, using the tools in this paper, compute whether a model can meet a required accuracy threshold before it is even trained. Combined with Token Arena's endpoint-level evaluation, we move toward a regime where AI systems can be certified against provable bounds, not just tested against benchmarks that change every quarter.
Limitations
- Deterministic Horizon: The flagship bound applies to the residual stream of standard transformer architectures. It does not (yet) cover architectural variants with cross-attention, recurrence, or non-transformer alternatives (SSMs, liquid networks). The thesis is a methodology proposal — 16 specifications, with only 2 composed, 1 proven as an obstruction, and 4 left open. The horizon values (19-31) are measured across 12 architectures; larger or more exotic architectures may fall outside this range. As a thesis, it has not been peer-reviewed as a journal publication.
- Token Arena: The benchmark captures endpoint behavior at specific snapshot times. Model providers update endpoints continuously (quantization schemes, routing logic, batching strategies), so leaderboard positions are ephemeral. The energy model is an estimate, not a direct measurement. The quality component depends on specific eval sets (MATH, GPQA, LiveCodeBench, etc.) that may not generalize to all use cases.
- Combined: Neither paper addresses the interaction between architectural ceilings and endpoint degradation — if a model already at its deterministic horizon is also served on a suboptimal endpoint, the combined loss may compound. This is an open question worth investigating.
Links and References
- The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems — Guo, 2026. arXiv:2605.23024. PhD thesis, University of Hong Kong.
- Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference — Gao, Wang, Yu et al., 2026. arXiv:2605.00300.
- Search arXiv for related work on transformer capacity bounds and residual stream analysis
📖 Related Reads
- CodeIntel Log — code quality, debugging, and software engineering benchmarks
- ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
Cross-links automatically generated from ToolBrain.
← Back to all posts