Agentic Plan Caching, Explained — and How We Got 67.6% Token Reduction
A deep dive into Agentic Plan Caching (NeurIPS 2025) — Tier A direct execution, Tier B guided prompts, plan extraction, and the real numbers we measured in production.
The NeurIPS 2025 paper Agentic Plan Caching is the first result I have seen that cuts agent LLM costs in half without degrading quality. When we implemented it at MCPizy we measured 67.6% token reduction and 67.5% latency reduction on a 50-task benchmark, with quality held constant. This guide explains what APC does, why it works, and the two implementation tiers you need to know about.
The problem: agents re-plan every time
A typical ReAct agent call looks like this:
User: "Analyse HubSpot sentiment for the last 30 days"
Agent (planning, 2,400 tokens):
thought: I need to search HubSpot... then analyse...
action: search_hubspot(days=30)
Agent (planning, 2,600 tokens):
thought: now I should summarise...
action: sentiment_analysis(text=...)
Agent (final, 1,800 tokens):
answer: "Sentiment trended positive, +12% vs prior month"
Three LLM calls, each re-deriving a plan that is structurally identical to the last 200 times the agent answered a sentiment question. APC asks the obvious question: why re-derive?
The core idea: cache the plan, not the answer
Output caching (return the same answer for the same input) does not work for agents — the input is always slightly different and the answer is always data-dependent. APC caches one layer deeper: the plan template, the sequence of tool calls and their argument shapes.
When a new request arrives:
- Extract intent (structured, normalised): e.g.
{what: "sentiment", where: "hubspot", when: "30d"} - Look up a plan template matching that intent
- Adapt the template's argument slots to the new request
- Execute — bypassing or constraining the planner LLM
Step 4 is where the token savings come from. The paper proposes two execution tiers.
Tier A: direct execution, no planner LLM
If the plan template is high confidence (we use 0.75 as the threshold; the paper uses 0.8), we bypass the ReAct loop entirely. The executor fills in the template's argument slots with a cheap tool_choice-forced call (~50 tokens/step vs ~2,500 for free-form reasoning) and runs the tools directly.
Plan template (cached):
step 1: search_hubspot(days={{duration}})
step 2: sentiment_analysis(text={{step_1.output}})
step 3: summarise(items={{step_2.output}})
New request: "sentiment on HubSpot last week"
→ Intent: {what: sentiment, where: hubspot, when: 7d}
→ Template match (confidence 0.91, cached hit)
→ Tier A execute:
search_hubspot(days=7) # no planner LLM
sentiment_analysis(text=...) # no planner LLM
summarise(items=...) # no planner LLM
→ Final synthesis (one LLM call)
Savings on this request: ~90% of the tokens a ReAct loop would have used, measured end to end.
Tier B: guided execution, constrained planner
If the plan confidence is mid-range (0.5–0.75), we do not bypass the planner — we constrain it. The system prompt is rewritten to include the cached plan as a structured suggestion, and the planner is told: "use this sequence unless the request clearly breaks the pattern."
The structured plan replaces free-form chain-of-thought rather than being prepended to it. This is the key detail the paper gets right that naive "stuff the plan in the prompt" implementations miss.
Measured savings in Tier B: ~40–50% tokens, because the planner still runs but does not re-derive the tool sequence from scratch.
Miss: fall through to ReAct
If no template matches, or confidence is below 0.5, the agent runs normally. The trajectory is recorded, and if it succeeds with good quality, a template is extracted and added to the cache for next time. This is the self-warming loop — the cache fills itself as the agent works.
Intent extraction: the W5H2 representation
The quality of the cache hinges entirely on the intent representation. A naive hash-the-prompt approach has a near-zero hit rate because prompts vary. APC uses a structured W5H2 (Who, What, Where, When, Why, How, How-much) extraction.
A small cheap model (we use qwen3-4b or Claude Haiku) emits JSON:
{
"what": "sentiment_analysis",
"where": "hubspot",
"when": "30d",
"who": null,
"how": "text",
"how_much": null
}
Cache keys are computed from the primary fields (what, where) and fuzzy-matched on the secondaries. This maps the long tail of phrasings onto a small number of real intents.
The numbers we measured
Benchmark: 50 real Brandyze tasks, Claude Sonnet 4.6 planner, Claude Haiku 4.5 for intent extraction, 10 pre-warmed templates. Two arms: no-cache baseline, and APC enabled.
| Metric | Baseline | APC | Delta |
|---|---|---|---|
| Avg tokens / task | 14,200 | 4,601 | -67.6% |
| P50 latency | 8.4s | 2.7s | -67.5% |
| Quality score (0–1) | 0.62 | 0.88 | +41.9% (!) |
| Cache hit rate | — | 100% (warmed) | — |
The quality improvement was surprising — it comes from the executor following a known-good plan instead of re-deriving one and occasionally making a worse choice.
When APC does not help
- Cold start: the first few requests in any category have no template yet. Pre-warming from prior trajectories or hand-written examples fixes this.
- Truly one-shot requests: "what was Napoleon's favourite cheese" has no reusable plan. APC adds a small overhead (the intent extraction) for no gain.
- Requests where the plan shape changes with the input: some agentic search tasks have genuinely different tool sequences per call. Here APC falls through to Tier B and saves less.
Try it
MCPizy ships APC as an optional optimization layer on MCP tool calls — you keep your existing MCP servers and agents, and the proxy handles intent extraction, template storage, and tier classification. If you are paying >$500/mo in agent LLM bills today, the APC layer typically pays for itself in the first week.
Running MCP in production?
Centralised auth, cost analytics, and the APC optimization layer — free tier included.