Prompt caching is roughly a 10x discount on the largest cost in AI products. It also fails without an error message. Here’s why that matters to anyone who signs off on the spend.
Most software costs scale with how many customers you have. AI features scale with something newer and harder to forecast: how much text your application sends a model, and how often. Every system prompt, every reference document, every turn of a conversation is tokens you buy by the million. The largest discount on that spend is prompt caching. When an application sends a model the same context repeatedly, the model reuses what it already read at about a tenth of the price.
The savings aren’t marginal. For a simple chatbot, caching is a nice-to-have. For an AI agent it’s the whole game, because an agent resends its growing context on every step.
Prompt caching failed for 36 hours
A customer came to us after a cost spike they couldn’t explain. Over a single 36-hour window, one of their API keys had run up roughly 4x the usual spend on a top-tier model, with no change in how many requests they were sending. The price per request had moved, not the volume.
The key belonged to one developer’s coding agent, an OpenAI-compatible client. That format has no way to express a cache breakpoint, so every turn re-sent the agent’s entire growing context to the model at full price. Agent transcripts run long, so each request averaged about 190,000 input tokens, almost none of it cached. The shape of it, by the time it reached the invoice:
– About 3,000 requests from the single key in 36 hours
– Roughly 560 million input tokens billed at full price
– A cache hit rate between 10 and 25 percent, where a stable agent workload reaches 70 to 80
– About 4x the expected cost for that workload
It ran for a day and a half before anyone noticed. There were three causes. The client format couldn’t request caching at all, so the prefix went up uncached every turn. The path that failed over to a second cloud provider used a different caching dialect and cached nothing. And a data protection step was rewriting part of each request with a fresh value every time, which changed the prompt just enough to defeat any cache that did apply.
Why this was an annoying problem to fix
Prompt caching fails silently. The application works, the answers are right, the users are happy. The only place the failure shows up is the invoice, and by then you’ve been overpaying for weeks. Most cost surprises in software announce themselves first, with a traffic spike, an outage, or a usage report climbing. This one doesn’t. It just quietly stops saving you money while everything looks fine.
It gets one notch worse than “no savings.” Turning the cache on costs a small premium the first time, to store the context for reuse. If the cache is switched on but the reuse never happens, because something keeps changing the request slightly between calls, you pay that premium on every request and never collect the discount. A broken cache is more expensive than no cache at all. Plenty of teams are running exactly that configuration right now and don’t know it, because, again, nothing errors.
Finding solutions can be a moving target
So why not just turn it on correctly and move on? Because “correctly” is a moving target sitting at the intersection of several things most teams would rather not maintain.
Every model vendor implements caching differently, in its own request format. If you use more than one (and most serious deployments do, for cost, redundancy, and negotiating room), the discount has to be built and tested separately for each. That includes the backup vendor you fail over to during an outage, which is precisely when your traffic is highest. Left unattended, the default outcome is that your most expensive hour is also your least cached.
Your own safeguards can cancel the savings. If you redact or mask sensitive data before it reaches the model, as any serious enterprise deployment should, that step rewrites the request on its way out. Build it carelessly and it changes the request just enough each time to defeat the cache, so your security posture and your cost posture quietly work against each other without anyone choosing that tradeoff on purpose.
And the discount only counts if you can see it. Because the failure is invisible by default, the only way to manage it is to measure, on every request, whether the reuse actually happened, and to raise a flag the day it slips instead of discovering it on next month’s statement.
What Barndoor does about it
That leaves a company two real options. Either your application engineers learn each vendor’s caching rules, keep them current as the vendors change them, make them cooperate with your security and routing, and build the monitoring that catches silent regressions, and then do that forever. Or someone owns the layer for you.
Barndoor LLM Gateway governs every AI request in your stack. The cache instruction is inserted automatically, in the correct format for whichever vendor handles the request, failover included. Barndoor data protection keeps the request stable so they don’t undo the discount. And every request is measured, so a workload like that one surfaces on a dashboard in minutes instead of on a statement, attributable to the team, app, or key that drove it. Cost stops being something that arrives at the end of the month and becomes something you can forecast and assign.
Your engineers don’t become caching experts. Your security team doesn’t have to choose between protecting data and controlling cost. And your finance team stops serving as the early-warning system for a problem that should have been caught three layers upstream.
The first time you hear about your AI spend shouldn’t be the invoice. Give Barndoor a try to see for yourself.
