Tip of the Day: Slash 90% Off Your AI API Bill — The Prompt Caching Playbook for 2026

🔬 Deep Dive (6-8 min) · 1213 words

TL;DR: Prompt caching can slash your AI API bill by 50–90% by reusing processed prompt prefixes across calls. Both OpenAI and Anthropic support it natively — no infrastructure changes needed. Just structure your prompts with stable system prompts and context prefixes, and the cache hits come for free. It's the single highest-ROI optimization you can make in 2026.

Why Prompt Caching Is Free Money

Your AI API calls are burning money on repeat processing. Every time you send a prompt, the model reprocesses the entire input — system instructions, conversation history, document context — even when the same prefix appeared in the previous call. With prompt caching, the provider stores the processed representation of common prefixes and reuses them. You pay a fraction of the cost (typically a 10% cache write fee + cache read at 10–50% of normal input pricing) and get the same output quality.

This isn't theoretical. In May 2026, both major API providers support it, and the savings compound across agents, chat apps, RAG pipelines, and batch processing. For more on structuring AI workflows, check out our guide on prompt chaining and our deep-dive into structured outputs.

How Much Can You Save?

The savings depend on your prompt structure and cache hit rate. Here are the real numbers:

Stable system prompts — A 2K token system instruction that never changes. Every call with the same prefix hits cache. Savings: 85–90% on those prefix tokens.
Document-based context — Loading the same 50-page PDF across 100 queries. First query pays full price; queries 2–100 pay ~10% for the document portion. Savings: 50–70% overall.
Conversation history — Chat apps where the first 10K tokens of context are static. Cache hits accumulate per-user. Savings: 40–60% per session.
Agent loops — Multi-step agents that keep the same instructions between tool calls. Cache returns the full context prefix each step. Savings: 30–50% per agent run.

Portkey's analysis of 10,000+ production deployments found the median prompt caching savings across all use cases was 67%, with top-quartile users exceeding 85%.

Provider Support: OpenAI vs Anthropic

Both major providers offer prompt caching, but the details differ. Here is how they compare.

OpenAI Prompt Caching

OpenAI renamed their offering to "Prompt Caching" in early 2026 and streamlined it significantly. Key details:

Automatic. No API changes needed. If your prompt structure is stable, caching happens transparently.
Scope. The entire prompt prefix of your messages array (system message + user messages) that matches across requests.
Pricing. Cache writes (first time a prefix is seen): 25% discount on input tokens. Cache reads: 50% discount vs. normal input pricing.
TTL. 5–10 minutes of inactivity before cache eviction. High-traffic apps maintain cache momentum better.
Supported models. GPT-4o, GPT-4.1, GPT-4.1 mini, o4-mini.

Anthropic Prompt Caching

Anthropic's version requires opt-in via a cache_control breakpoint in the messages API. This gives you more control:

Manual opt-in. Add cache_control to system prompts, message content blocks, or tools to mark cache breakpoints.
Scope. Anything between the start of your request and a breakpoint you define, including tool definitions and document context.
Pricing. Cache writes (first hit): 100% of normal input price. Cache reads (subsequent hits): 10% of normal input price.
TTL. Same as OpenAI at 5 minutes of inactivity. With high usage, caches persist longer.
Supported models. Claude 4 Opus, Claude 4.5 Sonnet.

Real Code: Setting Up Prompt Caching

Here is how you enable prompt caching on each platform. The difference in approach — automatic vs. manual — is important to understand.

OpenAI (Automatic)

from openai import OpenAI

client = OpenAI()

# No special params needed — caching is automatic
response = client.chat.completions.create(
 model="gpt-4.1",
 messages=[
 {"role": "system", "content": "You are a senior code reviewer. "
 "Analyze pull requests for security issues, bugs, and "
 "performance bottlenecks. Format as a structured report."},
 {"role": "user", "content": prompt}
 ]
)
# Check cached token count
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")

Anthropic (Manual, with Cache Control)

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=4096,
 system=[
 {
 "type": "text",
 "text": "You are an expert code reviewer...",
 "cache_control": {"type": "ephemeral"}
 }
 ],
 messages=[
 {
 "role": "user",
 "content": [
 {
 "type": "text",
 "text": diff_content,
 "cache_control": {"type": "ephemeral"}
 },
 {
 "type": "text",
 "text": feedback_prompt
 }
 ]
 }
 ]
)
# Check caching stats
cached_input = response.usage.cache_read_input_tokens
print(f"Cached input tokens: {cached_input}")

Advanced Cache Strategies

Getting to that 90% savings requires deliberate design. These strategies work across providers.

1. Segment Your Prompts by Function

Instead of building one monolithic system prompt, split it into stable and dynamic sections. Put everything that doesn't change (personality, rules, output format) at the beginning. Then the variable part (instructions, context, questions) comes after. This maximizes the cached prefix.

2. Batch Similar Queries

When processing a document with multiple questions, process all questions against the same context in sequence. The first query writes the cache; all subsequent queries read it at 90% less. This is the single biggest optimization for RAG pipelines.

3. Maintain Cache Temperature

High-frequency usage keeps cache entries warm. For batch jobs, add synthetic "ping" calls with the same prefix to prevent eviction. OpenAI's 5-minute TTL means a ping every 3–4 minutes keeps expensive prefixes alive.

4. Cache Large Fixed Assets

For Anthropic, use the cache_control breakpoint on every large document chunk. This lets you serve the same 50K token document to thousands of users while only paying 10% of the document cost per user after the first one.

5. Monitor Cache Hit Rates

Both providers expose cache statistics in the API response. Log these metrics and alert when cache hit rate drops below 70%. A sudden drop usually signals a prompt structure change or a deployment gone wrong.

Common Pitfalls

Prompt caching is not set-and-forget. Watch for these gotchas:

Variable prefixes. If your system prompt includes a unique user ID or session token at the start, every user gets a different prefix and zero cache hits. Move dynamic content to the end.
Short TTL. Low-traffic apps may never see cache hits because the 5-minute TTL expires between calls. Batch processing or rate-limiting helps.
Provider differences. Switching between OpenAI and Anthropic requires different prompt structures. Design your prompt templates to work with both.
Long-context caching. Very long documents (100K+ tokens) may cache less effectively because the total token budget varies. Keep cache segments under 50K tokens when possible.
Token counting. Cached tokens still count toward your context window. A 200K token prompt with 150K cached still consumes 200K during inference.

When NOT to Use Prompt Caching

Prompt caching isn't always the right move:

One-off prompts with no stable prefix get zero cache hits. Don't bother.
Frequently changing system prompts (fine-tuning experiments, A/B tests) invalidate caches constantly.
Extremely short prompts (under 100 tokens) have too little prefix to cache meaningfully.
Compliance-regulated scenarios where prompt contents must not persist on provider infrastructure, even temporarily.

Putting It All Together

Prompt caching is the easiest high-ROI optimization available to AI developers today. It requires no new infrastructure, no third-party tools, and no complex orchestration. It just requires thoughtful prompt structure.

The formula is simple: identify your stable prompt prefixes, segment dynamic content, and let the provider's cache do the work. The result is 50–90% lower costs without sacrificing quality or latency.

If you are only going to make one optimization this year, start here.

Frequently Asked Questions

Does prompt caching affect output quality?

No. The cache stores the processed input representation, not the output. The model still generates output from scratch each time. Quality is identical to non-cached calls.

Is prompt caching compatible with streaming?

Yes. Cached tokens process faster, so streaming actually starts sooner. Both OpenAI and Anthropic support caching with streaming responses.

Do fine-tuned models support prompt caching?

OpenAI supports it with fine-tuned models. Anthropic does not currently support caching with fine-tuned versions.

Can I use prompt caching with Azure OpenAI?

Yes, Azure OpenAI supports prompt caching with GPT-4o and GPT-4.1 models. The TTL and pricing may differ slightly from OpenAI's direct offering.

How do I verify caching is working?

OpenAI returns response.usage.prompt_tokens_details.cached_tokens. Anthropic returns response.usage.cache_read_input_tokens and cache_creation_input_tokens. Log both and monitor your cache hit ratio.

← Back to all posts