RAG – Scott Loftesness

Two airlines can fly the same airplane. Not airplanes of the same type — the same airplane, serial number and all, handed back at the end of a lease and reassigned, sometimes within weeks, to a competitor on another continent. AerCap owns more commercial aircraft than any airline on earth, and it leases them to airlines that spend their advertising budgets convincing passengers that flying them is a distinctive experience. The 737 MAX that wears Ryanair’s livery this year might wear Lion Air’s the next, repainted, recertified, its avionics untouched, its airframe indifferent to the change of ownership. The lessor does not care who is flying its asset. It cares that the asset comes back in airworthy condition and that the lease payments clear.

What the airline owns, in the sense that matters, is never the aircraft. It is the route network built up over decades of slot negotiations at constrained airports. It is the maintenance log — every inspection, every part swapped, every anomaly a mechanic in Singapore flagged in 2019 that turned out to predict a fatigue crack nobody else had seen yet. None of that travels with the airplane when the lease ends. It stays behind, compounding, in systems the airline built and the lessor never touches.

Karl Mehta, who has spent a career inside enterprise software watching this kind of asymmetry repeat itself, put a version of it plainly: a model is a brain you rent, and you and your competitor rent the same one. The formulation has the compression of something that has been tested in a few dozen meetings before it found that sentence. It is also, structurally, the airplane story. Anthropic and OpenAI and Google are AerCap. They retain residual value on enormous capital assets — clusters of GPUs depreciating on a schedule, weights trained at a cost that only a handful of balance sheets in the world can absorb — and they lease access to those assets by the token, to anyone who can pay, including, in the same afternoon, two companies trying to put each other out of business. The model does not know whose prompt it is answering. It has no loyalty file. It has, in fact, no memory at all, in the ordinary sense of the word — each call begins exactly where the last one ended for everybody, which is nowhere.

The asymmetry that airlines exploit is the one available here too, and it sits one layer up from the engine. Call it the embedding store, the vector database, the fine-tuning corpus, the retrieval index — the terminology varies by vendor, but the function is constant. It is the accumulated, indexed residue of every customer interaction a company has had, structured so that the rented brain can be handed the relevant fragment of it at the moment of each new call. A bank’s fraud model and a competing bank’s fraud model can call the identical foundation model, route through the identical API, and arrive at entirely different verdicts on the identical transaction, because one of them is retrieving against eleven years of labeled chargebacks specific to its own card portfolio and the other is retrieving against four. The intelligence rented by the hour is, for practical purposes, a commodity, priced down toward marginal cost the way jet fuel is priced — everyone pays close to the same number per unit. The memory is not a commodity. It cannot be, because it is not for sale; it is the institutional record of what has already happened to you, and no amount of capital lets a competitor buy a copy of your chargeback history any more than it lets them buy your maintenance logs.

This produces a particular kind of corporate vertigo, which Mehta’s sentence is really addressing. For three or four years the industry conversation about artificial intelligence has been a conversation about models — which lab’s was larger, which benchmark moved, which release cycle a company should anchor its roadmap to. That conversation rewards being an early and aggressive lessee. But a lessee relationship, however aggressive, does not compound into anything a competitor cannot eventually also lease. The compounding, when it happens, happens in the layer below the API call: in how cleanly a company has structured the record of its own customers, its own failures, its own edge cases, so that the rented brain, plugged in fresh every morning with no memory of yesterday, can be handed exactly the right fragment of yesterday and made to look, for a few hundred milliseconds, like it has been there all along.

A hospital chart has two kinds of entries. There is the vital-signs strip clipped to the bed rail — temperature, pulse, blood pressure, checked every four hours and replaced every four hours, because a reading from yesterday tells the night nurse nothing about the patient in front of her right now. And there is the permanent record in the file downstairs: the allergy that nearly killed him in 2019, the surgery, the medication history going back a decade, written once and never overwritten, because that record is exactly as valuable ten years from now as it is today. Nobody confuses the two charts. Nobody staples last Tuesday’s blood pressure into the permanent file. The hospital figured out, long before anyone digitized it, that memory is not one problem. It is two, and they fail in opposite directions if you run them through the same system.

Most teams building the layer Mehta is describing make exactly that mistake — they staple everything to the same chart. The shorthand for it is dumping everything into a vector database and praying, and it is worth asking why that particular error is so popular. The answer is that it feels like progress: embeddings go in, something resembling memory comes out, and the team moves on to the next sprint without confronting the harder question, which is what kind of memory it just built.

Short-term memory is the vital-signs strip — everything the model needs to finish the task in front of it and nothing it needs after. A customer-service exchange in progress, the order number already mentioned, the fact that this is the second call today, belongs here. So does the scratchpad of a multi-step agent: the search results just pulled, the file just opened, the partial answer being assembled before it commits. The test is not how important the information is but how long it stays true. A customer’s mood this minute is real and gone in twenty minutes; storing it permanently is like stapling yesterday’s temperature reading into the permanent file, undated, until the chart tells you nothing about fever and everything about clutter. Short-term memory should live in the context window itself, or a session-scoped cache, and it should be allowed to die when the session ends. The sin is not forgetting it. The sin is remembering it forever.

Long-term memory is the file downstairs, and it does not come in one shape any more than that file does. The first shape is semantic memory — facts. A customer’s account tier. The chargeback history that decides, in fractions of a second, whether this morning’s transaction clears. Facts belong in a database with a schema, not a vector store, because a fact has a right answer and a vector store gives you an approximate neighbor. Ask a vector index what tier a customer is on and it hands you the five most semantically similar sentences in the corpus — one correct, four merely correct-sounding. Ask a schema the same question and it tells you, because that is what the schema is for.

The more sophisticated shops are already building the seam between the two, rather than picking one and living with its blind spot. A knowledge graph keeps the relationships a schema is good at — this customer, that account, this chargeback, in fixed and queryable connection to one another — while still letting a retrieval layer search across it by meaning rather than by exact key. The approach has a name now, GraphRAG, and the name matters less than what it concedes: that facts and resemblance are different operations, and the honest fix is to run both and let each one answer the kind of question it’s actually suited for, not to force a single index to pretend it can do both jobs at once.

The second shape is episodic memory — what actually happened. The specific conversation last March in which the customer explained, at length, why the previous fix didn’t work. The exact sequence of an agent’s failed attempt at a task, preserved so the next attempt doesn’t repeat it. This is where the vector store finally earns its keep, because an episode isn’t an exact-match lookup, it’s a resemblance — has anything like this come up before — and a vector index, built to find the nearest thing to a fuzzy question, is the right tool for that question and almost no other. The error was never using a vector store. The error is using only a vector store, for facts as well as episodes, on the theory that one hammer with sufficient cosine similarity can stand in for the whole toolbox.

The third shape is the rarest, and the one teams forget to build at all: procedural memory, which is not a fact and not an episode but a skill — the model’s learned sense of how this company writes a refund email, escalates a complaint, formats an invoice. Style is the visible half of it. The other half is harder to see and matters more: the rails the model is forced to run on before it ever gets to choose a word. A refund above some threshold routes to a human, no exceptions, because the workflow says so, not because the model was persuaded to think so on this particular call. An agent that touches a production database does it through a reviewed function with a fixed set of permitted calls, not through whatever query it improvises in the moment. None of that lives in a prompt, and none of it lives in the model’s weights either. It lives in code — the orchestration layer, the permissioning, the state machine the agent is required to pass through — and it is procedural in the oldest sense of the word: not a memory of what to say but a memory of what is and isn’t allowed to happen, enforced whether or not the model that day feels like remembering it. It doesn’t live in a database at all. It lives in fine-tuning, in carefully maintained house-style examples, and in the surrounding scaffolding of guardrails and permitted actions, and it changes slower than the other two, the way a surgeon’s hands carry both technique and caution years after the specific patients are forgotten. A company that has built rich semantic and episodic memory but skipped this layer has a model that knows everything about its customers, writes in exactly the right voice, and is one well-crafted prompt away from doing something the company never agreed to.

The real argument here is not which database serves which layer — that part is plumbing, and plumbing changes every eighteen months. The argument is that memory has to be triaged the way the hospital triages it, with something deciding on purpose what survives the session and what doesn’t, rather than writing every token of every interaction into the same undifferentiated store and trusting retrieval to sort it out later. A vector database with no triage in front of it is not a memory system. It is a landfill with a search function, and it will retrieve the wrong eleven-month-old conversation with the same confidence it retrieves the right one, because nobody wrote the part of the system whose only job is deciding what belongs on which chart.

The lessor’s airplane, repainted, will fly for someone else next year. The route network will not. Neither will the schema that knows a customer’s tier on contact, nor the index that remembers the conversation from last March, nor the fine-tuned hand that knows, without being told twice, how this company writes a refund email. These are the things that do not come back at the end of the lease, because they were never on it.

Share this: