A correction about bloated context
A reader pointed out an error/imprecision in a recent post about the context window:
The underlying architecture, moreover, uses pairwise relationships of everything in the context (see this useful document from Anthropic. Adding to the context, therefore, has a quadratically dilutive effect on it.
There are many problems with this claim:
- Even if the assumptions are correct, the conclusion doesn't follow: each item participates in n of the n-squared relations, so the dilution would be linear. (Insofar as particular pairwise relationships are important, those are now competing with quadratically many others, but see below.)
- Every attention head uses softmax, which means that the dilution isn't linear in any straightforward sense: there are mechanisms to preserve the most important information.
- It's the computation that increases quadratically (or approximately so), not the amount of information being used in the model's representation of things.
My main point, which is that adding to the context makes what's there less potent, is mostly, basically correct. "Guard your context window" might be the most underappreciated basic principle of AI engineering.
It's very hard to measure these effects: we care about many different kinds of tasks, and the slowdown you get as the window grows matters a lot but in ways that are hard to quantify. I have no real doubt that the effects are real and important, however. It's about as clear from my experience as something like this could possibly be, and the best agentic programmers are emphasizing this over and over (or at least so it seems to me). So: guard your context window.