In depth
The context window is the LLM's short-term memory. Everything the model needs for the current response — system prompt, chat history, tool definitions, tool call results, retrieved documents — must fit inside this window. Exceed it and you hit a hard error or silently lose the earliest messages.
Context window size has exploded in 2024-2025. GPT-3.5 shipped with 4K tokens; GPT-4o has 128K; Claude Opus 4 supports 1M tokens for long-context; Gemini 1.5 Pro and 2.5 Pro go up to 1M-2M. Larger windows enable whole-codebase reasoning, long research reports, and multi-document synthesis.
But larger isn't free. Cost scales linearly with tokens (input and output priced separately). Latency grows with context size. And attention degrades at the edges — models often have 'lost in the middle' issues where content in the center of a long context gets ignored.
Managing context is a core skill for agent builders. Techniques include: summarize old turns, use RAG to inject only relevant docs, paginate tool results, and checkpoint state externally. MCP helps by letting servers expose large data incrementally rather than dumping it all into context.