MCPizy
MarketplaceRecipesGuidesPricingDocs
Log inGet Started
All guides
Guide · EN
13 min read
April 22, 2026

Agentic Plan Caching, Explained — and How We Got 67.6% Token Reduction

A deep dive into Agentic Plan Caching (NeurIPS 2025) — Tier A direct execution, Tier B guided prompts, plan extraction, and the real numbers we measured in production.

apcoptimizationllmcostneurips-2025

The NeurIPS 2025 paper Agentic Plan Caching is the first result I have seen that cuts agent LLM costs in half without degrading quality. When we implemented it at MCPizy we measured 67.6% token reduction and 67.5% latency reduction on a 50-task benchmark, with quality held constant. This guide explains what APC does, why it works, and the two implementation tiers you need to know about.

The problem: agents re-plan every time

A typical ReAct agent call looks like this:

User: "Analyse HubSpot sentiment for the last 30 days" Agent (planning, 2,400 tokens): thought: I need to search HubSpot... then analyse... action: search_hubspot(days=30) Agent (planning, 2,600 tokens): thought: now I should summarise... action: sentiment_analysis(text=...) Agent (final, 1,800 tokens): answer: "Sentiment trended positive, +12% vs prior month"

Three LLM calls, each re-deriving a plan that is structurally identical to the last 200 times the agent answered a sentiment question. APC asks the obvious question: why re-derive?

The core idea: cache the plan, not the answer

Output caching (return the same answer for the same input) does not work for agents — the input is always slightly different and the answer is always data-dependent. APC caches one layer deeper: the plan template, the sequence of tool calls and their argument shapes.

When a new request arrives:

  1. Extract intent (structured, normalised): e.g. {what: "sentiment", where: "hubspot", when: "30d"}
  2. Look up a plan template matching that intent
  3. Adapt the template's argument slots to the new request
  4. Execute — bypassing or constraining the planner LLM

Step 4 is where the token savings come from. The paper proposes two execution tiers.

Tier A: direct execution, no planner LLM

If the plan template is high confidence (we use 0.75 as the threshold; the paper uses 0.8), we bypass the ReAct loop entirely. The executor fills in the template's argument slots with a cheap tool_choice-forced call (~50 tokens/step vs ~2,500 for free-form reasoning) and runs the tools directly.

Plan template (cached): step 1: search_hubspot(days={{duration}}) step 2: sentiment_analysis(text={{step_1.output}}) step 3: summarise(items={{step_2.output}}) New request: "sentiment on HubSpot last week" → Intent: {what: sentiment, where: hubspot, when: 7d} → Template match (confidence 0.91, cached hit) → Tier A execute: search_hubspot(days=7) # no planner LLM sentiment_analysis(text=...) # no planner LLM summarise(items=...) # no planner LLM → Final synthesis (one LLM call)

Savings on this request: ~90% of the tokens a ReAct loop would have used, measured end to end.

Tier B: guided execution, constrained planner

If the plan confidence is mid-range (0.5–0.75), we do not bypass the planner — we constrain it. The system prompt is rewritten to include the cached plan as a structured suggestion, and the planner is told: "use this sequence unless the request clearly breaks the pattern."

The structured plan replaces free-form chain-of-thought rather than being prepended to it. This is the key detail the paper gets right that naive "stuff the plan in the prompt" implementations miss.

Measured savings in Tier B: ~40–50% tokens, because the planner still runs but does not re-derive the tool sequence from scratch.

Miss: fall through to ReAct

If no template matches, or confidence is below 0.5, the agent runs normally. The trajectory is recorded, and if it succeeds with good quality, a template is extracted and added to the cache for next time. This is the self-warming loop — the cache fills itself as the agent works.

Intent extraction: the W5H2 representation

The quality of the cache hinges entirely on the intent representation. A naive hash-the-prompt approach has a near-zero hit rate because prompts vary. APC uses a structured W5H2 (Who, What, Where, When, Why, How, How-much) extraction.

A small cheap model (we use qwen3-4b or Claude Haiku) emits JSON:

{ "what": "sentiment_analysis", "where": "hubspot", "when": "30d", "who": null, "how": "text", "how_much": null }

Cache keys are computed from the primary fields (what, where) and fuzzy-matched on the secondaries. This maps the long tail of phrasings onto a small number of real intents.

The numbers we measured

Benchmark: 50 real Brandyze tasks, Claude Sonnet 4.6 planner, Claude Haiku 4.5 for intent extraction, 10 pre-warmed templates. Two arms: no-cache baseline, and APC enabled.

MetricBaselineAPCDelta
Avg tokens / task14,2004,601-67.6%
P50 latency8.4s2.7s-67.5%
Quality score (0–1)0.620.88+41.9% (!)
Cache hit rate—100% (warmed)—

The quality improvement was surprising — it comes from the executor following a known-good plan instead of re-deriving one and occasionally making a worse choice.

When APC does not help

  • Cold start: the first few requests in any category have no template yet. Pre-warming from prior trajectories or hand-written examples fixes this.
  • Truly one-shot requests: "what was Napoleon's favourite cheese" has no reusable plan. APC adds a small overhead (the intent extraction) for no gain.
  • Requests where the plan shape changes with the input: some agentic search tasks have genuinely different tool sequences per call. Here APC falls through to Tier B and saves less.

Try it

MCPizy ships APC as an optional optimization layer on MCP tool calls — you keep your existing MCP servers and agents, and the proxy handles intent extraction, template storage, and tier classification. If you are paying >$500/mo in agent LLM bills today, the APC layer typically pays for itself in the first week.

Try MCPizy → /pricing

Running MCP in production?

Centralised auth, cost analytics, and the APC optimization layer — free tier included.

Try MCPizy
All guidesPricing →
MCPizy— The MCP Platform
MarketplaceDocsPrivacyTermsCookiesMentions légalesContact

© 2026 MCPizy. All rights reserved.