Weekly LLM Review — May 15, 2026: SubQ's 12M Context, GPT-5.5 Instant, and a Pricing Earthquake

TL;DR: The frontier went quiet in May after April's sprint — but architecture innovation took center stage. SubQ launched the first commercial subquadratic LLM with a 12M-token context window. OpenAI rolled out GPT-5.5 Instant as the new ChatGPT default. DeepSeek V4 Pro dropped pricing by 75%. Grok 3 hiked 10x. ZAYA1-8B became the first AMD-trained MoE.

The Big Picture: A Breather After April's Sprint

April 2026 was the most competitive month in LLM history. Five labs — OpenAI, Anthropic, DeepSeek, Moonshot, and Xiaomi — all dropped models scoring above 50 on the Intelligence Index. GPT-5.5 broke 60. Opus 4.7 landed. The ceiling moved up three points.

8.0 / 10

Weekly LLM Review 2026

🛡️ AI Tool · Updated 2026

May is different. The frontier leaderboard hasn't changed. Instead, the action shifted sideways into architecture innovation, pricing wars, and product defaults.

This Week's Model Releases

Date Model Developer Type License
May 5 GPT-5.5 Instant OpenAI Text + Reasoning Proprietary
May 5 SubQ 1M-Preview Subquadratic Subquadratic LLM Proprietary (API)
May 6 ZAYA1-8B Zyphra Text + MoE Apache 2.0
May 6 Grok 4.3 xAI Text + Reasoning Proprietary
May 8 Gemini 3.1 Flash Lite Google Text + Vision Proprietary (API)
May 14 Qwen3 Coder Next Alibaba Code Apache 2.0
May 14 MiniMax M2.7 Highspeed MiniMax Text Proprietary (API)

SubQ 1M-Preview: Architecture Takes the Stage

The most significant release this week isn't from a name you know. Subquadratic (SubQ) launched with $29M in seed funding and a bold claim: their model isn't a transformer. Using sparse, subquadratic attention end-to-end, SubQ 1M-Preview ships with a native 12 million token context window — the largest of any commercially available LLM.

Standard transformer attention scales at O(n²). Double the context, quadruple the cost. SubQ's approach claims roughly 1/5 the cost of frontier models on long-context tasks and up to 52x faster attention at scale. If this holds in production, it's not just a new model — it's a new architecture category.

The model is available via API and priced competitively for long-context workloads like repo-wide code analysis, multi-document research, and legal document review. The 12M-token window is real — no quality degradation caveats at the upper end.

GPT-5.5 Instant: The New Default

OpenAI quietly made GPT-5.5 Instant the new default in ChatGPT on May 5. This lightweight, low-latency variant of GPT-5.5 aims for faster responses and reduced hallucinations compared to the standard GPT-5.5. It's not a frontier breakthrough — the Intelligence Index didn't move — but it's what millions of ChatGPT users will experience daily. A faster, cheaper default that still benefits from the GPT-5.5 architecture is a meaningful product improvement even if it doesn't make headlines.

ZAYA1-8B: AMD-Trained Open MoE

Zyphra released ZAYA1-8B on May 6 under Apache 2.0 — a Mixture of Experts model with 8 billion total parameters and approximately 760 million active parameters per token. What makes it notable: it was trained entirely on AMD hardware. This is one of the first major open-weight models that didn't touch NVIDIA GPUs during training, signaling growing maturity in the AMD AI ecosystem.

With Apache 2.0 licensing, ZAYA1-8B is free for any use including commercial deployment. It's designed for efficient self-hosting, with the MoE architecture providing strong per-token compute efficiency.

Pricing Earthquake: Winners and Losers

What to Watch

Google's Gemini 2.0 Flash — currently the cheapest model at $0.10/$0.40 — is slated for deprecation by June 2026. Migration will be needed soon. The SubQ architecture will face real production scrutiny in the coming weeks — if it delivers on its efficiency claims, it could reshape how we think about long-context inference. And the labs that didn't ship in April (Anthropic past Opus 4.7, Google past Flash Lite, Meta, Mistral) are likely preparing their next moves.

For more on cost optimization across LLM providers, see our Claude Code cost guide and multi-provider failover build log.

Frequently Asked Questions

What is the cheapest LLM API right now?

Google's Gemini 2.0 Flash at $0.10/$0.40 per 1M tokens is the cheapest from a major provider. However, it's slated for deprecation by June 2026 — migrate soon.

Is SubQ better than GPT-5.5?

Not on raw intelligence. SubQ claims roughly 1/5 frontier capability with its current preview. Its advantage is context length (12M tokens) and cost efficiency — it's a different product for different use cases.

Why did Grok 3 prices increase so much?

xAI has not published a detailed rationale. The 10x increase likely reflects repositioning toward enterprise pricing or capacity management. Grok 3 Mini remains at $3/$5 as a fallback option.

Which models offer the best value right now?

DeepSeek V4 Pro at $0.44/$0.87 offers the best premium-to-price ratio. For budget workloads, Gemini 2.0 Flash ($0.10/$0.40) and DeepSeek V4 Flash ($0.14/$0.28) are the clear winners.

← Back to all posts