Guide · EN

13 min read

April 22, 2026

Agentic Plan Caching, Explained — and How We Got 67.6% Token Reduction

A deep dive into Agentic Plan Caching (NeurIPS 2025) — Tier A direct execution, Tier B guided prompts, plan extraction, and the real numbers we measured in production.

apcoptimizationllmcostneurips-2025

The NeurIPS 2025 paper Agentic Plan Caching is the first result I have seen that cuts agent LLM costs in half without degrading quality. When we implemented it at MCPizy we measured 67.6% token reduction and 67.5% latency reduction on a 50-task benchmark, with quality held constant. This guide explains what APC does, why it works, and the two implementation tiers you need to know about.

The problem: agents re-plan every time

A typical ReAct agent call looks like this:

User: "Analyse HubSpot sentiment for the last 30 days"

Agent (planning, 2,400 tokens):
  thought: I need to search HubSpot... then analyse...
  action: search_hubspot(days=30)

Agent (planning, 2,600 tokens):
  thought: now I should summarise...
  action: sentiment_analysis(text=...)

Agent (final, 1,800 tokens):
  answer: "Sentiment trended positive, +12% vs prior month"

Three LLM calls, each re-deriving a plan that is structurally identical to the last 200 times the agent answered a sentiment question. APC asks the obvious question: why re-derive?

The core idea: cache the plan, not the answer

Output caching (return the same answer for the same input) does not work for agents — the input is always slightly different and the answer is always data-dependent. APC caches one layer deeper: the plan template, the sequence of tool calls and their argument shapes.

When a new request arrives:

Extract intent (structured, normalised): e.g. {what: "sentiment", where: "hubspot", when: "30d"}
Look up a plan template matching that intent
Adapt the template's argument slots to the new request
Execute — bypassing or constraining the planner LLM

Step 4 is where the token savings come from. The paper proposes two execution tiers.

Tier A: direct execution, no planner LLM

If the plan template is high confidence (we use 0.75 as the threshold; the paper uses 0.8), we bypass the ReAct loop entirely. The executor fills in the template's argument slots with a cheap tool_choice-forced call (~50 tokens/step vs ~2,500 for free-form reasoning) and runs the tools directly.

Plan template (cached):
  step 1: search_hubspot(days={{duration}})
  step 2: sentiment_analysis(text={{step_1.output}})
  step 3: summarise(items={{step_2.output}})

New request: "sentiment on HubSpot last week"
→ Intent: {what: sentiment, where: hubspot, when: 7d}
→ Template match (confidence 0.91, cached hit)
→ Tier A execute:
    search_hubspot(days=7)              # no planner LLM
    sentiment_analysis(text=...)         # no planner LLM
    summarise(items=...)                 # no planner LLM
→ Final synthesis (one LLM call)

Savings on this request: ~90% of the tokens a ReAct loop would have used, measured end to end.

Tier B: guided execution, constrained planner

If the plan confidence is mid-range (0.5–0.75), we do not bypass the planner — we constrain it. The system prompt is rewritten to include the cached plan as a structured suggestion, and the planner is told: "use this sequence unless the request clearly breaks the pattern."

The structured plan replaces free-form chain-of-thought rather than being prepended to it. This is the key detail the paper gets right that naive "stuff the plan in the prompt" implementations miss.

Measured savings in Tier B: ~40–50% tokens, because the planner still runs but does not re-derive the tool sequence from scratch.

Miss: fall through to ReAct

If no template matches, or confidence is below 0.5, the agent runs normally. The trajectory is recorded, and if it succeeds with good quality, a template is extracted and added to the cache for next time. This is the self-warming loop — the cache fills itself as the agent works.

Intent extraction: the W5H2 representation

The quality of the cache hinges entirely on the intent representation. A naive hash-the-prompt approach has a near-zero hit rate because prompts vary. APC uses a structured W5H2 (Who, What, Where, When, Why, How, How-much) extraction.

A small cheap model (we use qwen3-4b or Claude Haiku) emits JSON:

{
  "what": "sentiment_analysis",
  "where": "hubspot",
  "when": "30d",
  "who": null,
  "how": "text",
  "how_much": null
}

Cache keys are computed from the primary fields (what, where) and fuzzy-matched on the secondaries. This maps the long tail of phrasings onto a small number of real intents.

The numbers we measured

Benchmark: 50 real Brandyze tasks, Claude Sonnet 4.6 planner, Claude Haiku 4.5 for intent extraction, 10 pre-warmed templates. Two arms: no-cache baseline, and APC enabled.

Metric	Baseline	APC	Delta
Avg tokens / task	14,200	4,601	-67.6%
P50 latency	8.4s	2.7s	-67.5%
Quality score (0–1)	0.62	0.88	+41.9% (!)
Cache hit rate	—	100% (warmed)	—

The quality improvement was surprising — it comes from the executor following a known-good plan instead of re-deriving one and occasionally making a worse choice.

When APC does not help

Cold start: the first few requests in any category have no template yet. Pre-warming from prior trajectories or hand-written examples fixes this.
Truly one-shot requests: "what was Napoleon's favourite cheese" has no reusable plan. APC adds a small overhead (the intent extraction) for no gain.
Requests where the plan shape changes with the input: some agentic search tasks have genuinely different tool sequences per call. Here APC falls through to Tier B and saves less.

Try it

MCPizy ships APC as an optional optimization layer on MCP tool calls — you keep your existing MCP servers and agents, and the proxy handles intent extraction, template storage, and tier classification. If you are paying >$500/mo in agent LLM bills today, the APC layer typically pays for itself in the first week.

Try MCPizy → /pricing

Running MCP in production?

Centralised auth, cost analytics, and the APC optimization layer — free tier included.

Try MCPizy

All guides Pricing →

All guides