long-horizon reasoning – Scott Loftesness

Here is a small, possibly embarrassing confession: I have never, not once, gone looking for the best AI model.

I have a model. It lives in a browser tab — Safari, usually, on whichever device is nearest, occasionally Chrome if I happen to be at the desktop. It does what I need — drafts an email, untangles a sentence, tells me what a Norwegian emigration record from 1856 probably says — and then I close the tab and go on a walk.

Somewhere out there, presumably, a much smarter, much more expensive machine is doing something extraordinary with protein folding or hedge fund arbitrage or the outer edges of mathematics I will never visit. I have made my peace with never meeting it.

This did not used to feel like a confession. For a while there — a year, eighteen months — it felt like the central drama of the whole industry: which model was “best,” who had it, who had lost it, whether some lab’s quarterly earnings call would reveal that the frontier had quietly moved sixty miles down the road while everyone was looking the other way. Benchmarks were released like box scores. People argued about them the way people argue about batting averages, with the same weird intensity, the same conviction that a two-point difference in some abstract reasoning test settled something important about the future.

And then, at some point I can’t quite date — it crept up, the way these things do — I noticed I had stopped caring.

Not because the frontier stopped moving. It didn’t. It’s still moving, arguably faster than ever, in ways that occasionally show up in the news with all the drama of a soap opera (a delayed launch, a researcher poached, a stock down five percent in an afternoon, always something).

I stopped caring because none of it touched me. My model — whatever it was, this week — had long since crossed some invisible threshold past which more didn’t register as more. It was already better than I needed. It has been better than I needed for a while now. I suspect I am not unusual in this. I suspect most people, doing most things, most days, are operating comfortably inside a capability surplus so large they’ve stopped noticing it’s there, the way you stop noticing a room is warm.

If the top of the model isn’t for people like me — and it increasingly isn’t — then who, or what, is it actually for? I went looking for one piece of the answer and found, instead, a metaphor.

It’s called “context rot.” I have to admit, before I go further, that I’m not sure I’ve ever felt it myself — which, on reflection, is its own small piece of evidence. My sessions close in minutes, not hours. I ask, it answers, I leave. Whatever happens to a model over the fourth or fifth hour of sustained, dependent work is a country I simply don’t visit.

But other people do, increasingly — entire teams do, for entire projects — and what they’re finding out there is worth understanding, even secondhand. It describes something that happens to AI models when they’re asked to work for a long time on something complicated — not five minutes, but five hours; not one question, but a hundred small decisions stacked on top of each other, each one depending on the last.

You’d think the limiting factor would be room. Models have a “context window” — a stated capacity, like a gas tank, measured in tokens, and for a while the marketing numbers on these were the whole story: two million tokens! A library! And you’d think, as with a gas tank, that the thing runs fine until it’s empty and then it stops.

That is not, it turns out, what happens. What happens is closer to what happens to your desk.

You know the desk. Everyone has the desk. It starts the morning clean — an aspirational, almost insulting cleanliness — and by four in the afternoon it is a geological record of the day: three coffee cups, a stack of things you meant to file, a Post-it with a phone number you no longer need, the good pen buried under a printout of something you already dealt with an hour ago. The desk is not full. There is, technically, room. You could clear a space if you tried. But you don’t try, because functionally, cognitively, the desk has stopped being usable long before it ran out of surface area. You start looking for the stapler and forget what you were stapling. This — and I did not make this term up, I want to be clear, though I wish I had — is context rot. The window hasn’t run out. The signal has just drowned in its own debris.

Researchers watching this happen to long-running AI agents have found something almost cruelly elegant about how it fails: it doesn’t fail gradually, the way you’d expect a desk to get gradually messier. Errors compound. A task that takes twice as long doesn’t get twice as likely to go wrong — the failure rate roughly quadruples. Two mistakes early in a long chain of dependent steps don’t add up to a slightly worse outcome. They multiply into something close to total collapse, four hours in, for reasons that trace back to a single bad assumption made in the first twenty minutes and never revisited.

Here is where the frontier comes back in — not as the whole answer, but as a piece of one.

It is not that frontier models are smarter in the way a benchmark measures smart — better at a single hard math problem, a cleverer turn of reasoning. Plenty of models can do that now; the “good enough” tier has crept remarkably high.

It’s that frontier models are apparently, marginally, meaningfully better at not rotting. At keeping the desk usable at hour six. At knowing which of the forty things on the desk actually still matters and which is a coffee cup that should have been thrown out an hour ago. This is a genuinely different kind of intelligence than the one benchmarks were built to measure, and it is almost invisible from the outside — you don’t see it in a single exchange, you see it only in the difference between a project that holds together over three days and one that quietly, subtly, stops making sense somewhere around Tuesday afternoon and nobody notices until Thursday.

If that’s true — if the frontier’s real edge is durability rather than raw cleverness — you’d expect to see it show up in how the labs actually deploy their own models: saving the sharpest tools for the tasks that need to survive the longest.

I went looking for a real-world example and found one closer to home than I expected: Anthropic’s own Slack tool, the one where you tag the AI into a channel the way you’d tag a coworker, and it works alongside a whole team over days, learning the channel as it goes. It runs on a serious, capable, thoroughly frontier model — but not, it turns out, on the company’s very best one. That one is held back, reserved for a smaller and stranger set of problems nobody has solved before at all. I sat with that for a while. The tool built to survive a whole team’s whole week, in public, under the most sustained pressure any of their products face, wasn’t handed the sharpest blade in the drawer. It was handed the second-sharpest — which was apparently, entirely, enough. Which tells you something about where the two kinds of intelligence actually diverge: the merely-very-good model handles the desk staying clean for a week, in public, in front of a whole team, where one bad assumption made Monday and never revisited would be visible to everyone by Thursday. The truly new capability is being held in reserve for something else altogether.

I don’t have a tidy place to land this, and I’m suspicious of anyone who does. But here’s the closest I can get.

Imagine a three-Michelin-star chef — the kind of person who has spent thirty years learning to coax something transcendent out of a single scallop, who can tell you, by smell, that a stock has forty more minutes in it — standing at your stove on a Tuesday night making you a grilled cheese sandwich. It will, I promise you, be a very good grilled cheese sandwich. The bread will be evenly golden. The cheese will have reached some ideal, fully-considered state of melt. But almost none of what makes that chef extraordinary is actually being used to make it — none of the thirty years spent learning to hold forty things in mind at once without losing track of any of them, the exact skill, it occurs to me, that keeps a long, complicated project from quietly falling apart on day three. The technique is idling. The thirty years are in the room, present, available, and almost entirely beside the point, because a grilled cheese sandwich was never the place where thirty years shows up. It shows up somewhere else — in a dish you will never order, on a night you weren’t there.

What you got instead, on your ordinary Tuesday, was simply more than enough.

Share this: