Guide · EN

12 min read

April 24, 2026

MCP Observability: Tracing, Metrics, and Alerts That Actually Matter

A production-grade observability stack for MCP servers: OpenTelemetry traces, Prometheus metrics, structured logs with CID, and the 5 alerts every MCP operator needs.

mcpobservabilityopentelemetrymonitoringsre

Most MCP servers in the wild ship with console.log and a prayer. When they break in production, operators learn the hard way that "log lines grep" is not an observability strategy. This guide is the exact stack we deploy at MCPizy and recommend for any MCP server past 1k tool calls/day.

Three pillars, one purpose

Traces answer what happened. Metrics answer how often. Logs answer why. You need all three; one cannot replace the others.

Traces — OpenTelemetry, one span per tool call, with inputs/outputs redacted to fields-of-interest only.
Metrics — Prometheus histograms for latency, counters for errors by class, gauges for in-flight requests.
Logs — structured JSON, one Correlation ID (CID) per client session, shipped to Loki or OpenSearch.

The 5 alerts every MCP server needs

Error rate > 2% for 5 min — 5xx + tool-call failures combined. Page oncall.
p95 tool latency > 2x baseline for 10 min — slow regression before users notice.
Auth failure rate > 5% for 2 min — early signal of token rotation or brute-force.
Upstream dependency down > 30s — if your MCP calls Tavily / Supabase / Stripe, their downtime is your downtime.
Queue depth > 100 for 1 min — scale-out signal before 429s flow through.

Tracing: span structure

tool_call (span)
├─ attributes: mcp.tool_name, mcp.client_id, mcp.session_id
├─ auth_check (child)
├─ argument_validation (child)
├─ upstream_call (child, per external dependency)
│   └─ attributes: http.method, http.status_code, http.duration
├─ post_processing (child)
└─ serialize_response (child)

Each span carries the CID as a baggage propagation header (traceparent) so downstream services inherit the trace context automatically.

Metrics: label discipline

The #1 rookie mistake in MCP observability is high-cardinality labels. Never label by user ID, session ID, or request ID — they explode metrics cardinality and your Prometheus bill.

Safe labels:

mcp_tool_name (bounded, under 200)
mcp_status (ok, error_client, error_server, timeout)
upstream_dependency (bounded, the list of SaaS you call)
http_method, http_status_code

Log format

{
  "ts": "2026-04-24T12:00:00.000Z",
  "level": "info",
  "cid": "sess_abc123",
  "mcp_tool": "search_products",
  "mcp_client": "claude-desktop",
  "duration_ms": 142,
  "status": "ok",
  "user_id_hash": "h_xyz",
  "msg": "tool_call_complete"
}

Redaction rules

Arguments over 1000 chars: truncate + indicate truncation
Fields matching secret patterns (apikey, token, password, cookie): replace with [REDACTED]
URLs with query strings: log the host+path only, drop the query
Emails: hash (first-3-chars + hash(rest))

Reference stack

Concern	Self-hosted	Managed
Traces	Grafana Tempo	Honeycomb, Datadog APM
Metrics	Prometheus + Grafana	Datadog, Chronosphere
Logs	Loki or OpenSearch	Datadog, New Relic
Alerting	Alertmanager + PagerDuty	Grafana OnCall, PagerDuty

For MCPizy customers, all four pillars are built-in from Pro — the APC layer auto-instruments every tool call.

What "good" looks like

You should be able to answer these in under 30 seconds:

Which tool is slowest right now, and by how much vs. yesterday?
Who (which client) is hitting errors in the last hour?
Is the recent spike caused by upstream (external) or us (internal)?
What does a typical successful call look like vs a typical failing one?

If any of those takes more than 30 seconds, your stack is missing a piece.

Ship your MCP to production on MCPizy → /pricing

Running MCP in production?

Centralised auth, cost analytics, and the APC optimization layer — free tier included.

Try MCPizy

All guides Pricing →

All guides