Tag: foundation models

What the Lessor Keeps

Two airlines can fly the same airplane. Not airplanes of the same type — the same airplane, serial number and all, handed back at the end of a lease and reassigned, sometimes within weeks, to a competitor on another continent. AerCap owns more commercial aircraft than any airline on earth, and it leases them to airlines that spend their advertising budgets convincing passengers that flying them is a distinctive experience. The 737 MAX that wears Ryanair’s livery this year might wear Lion Air’s the next, repainted, recertified, its avionics untouched, its airframe indifferent to the change of ownership. The lessor does not care who is flying its asset. It cares that the asset comes back in airworthy condition and that the lease payments clear.

What the airline owns, in the sense that matters, is never the aircraft. It is the route network built up over decades of slot negotiations at constrained airports. It is the maintenance log — every inspection, every part swapped, every anomaly a mechanic in Singapore flagged in 2019 that turned out to predict a fatigue crack nobody else had seen yet. None of that travels with the airplane when the lease ends. It stays behind, compounding, in systems the airline built and the lessor never touches.

Karl Mehta, who has spent a career inside enterprise software watching this kind of asymmetry repeat itself, put a version of it plainly: a model is a brain you rent, and you and your competitor rent the same one. The formulation has the compression of something that has been tested in a few dozen meetings before it found that sentence. It is also, structurally, the airplane story. Anthropic and OpenAI and Google are AerCap. They retain residual value on enormous capital assets — clusters of GPUs depreciating on a schedule, weights trained at a cost that only a handful of balance sheets in the world can absorb — and they lease access to those assets by the token, to anyone who can pay, including, in the same afternoon, two companies trying to put each other out of business. The model does not know whose prompt it is answering. It has no loyalty file. It has, in fact, no memory at all, in the ordinary sense of the word — each call begins exactly where the last one ended for everybody, which is nowhere.

The asymmetry that airlines exploit is the one available here too, and it sits one layer up from the engine. Call it the embedding store, the vector database, the fine-tuning corpus, the retrieval index — the terminology varies by vendor, but the function is constant. It is the accumulated, indexed residue of every customer interaction a company has had, structured so that the rented brain can be handed the relevant fragment of it at the moment of each new call. A bank’s fraud model and a competing bank’s fraud model can call the identical foundation model, route through the identical API, and arrive at entirely different verdicts on the identical transaction, because one of them is retrieving against eleven years of labeled chargebacks specific to its own card portfolio and the other is retrieving against four. The intelligence rented by the hour is, for practical purposes, a commodity, priced down toward marginal cost the way jet fuel is priced — everyone pays close to the same number per unit. The memory is not a commodity. It cannot be, because it is not for sale; it is the institutional record of what has already happened to you, and no amount of capital lets a competitor buy a copy of your chargeback history any more than it lets them buy your maintenance logs.

This produces a particular kind of corporate vertigo, which Mehta’s sentence is really addressing. For three or four years the industry conversation about artificial intelligence has been a conversation about models — which lab’s was larger, which benchmark moved, which release cycle a company should anchor its roadmap to. That conversation rewards being an early and aggressive lessee. But a lessee relationship, however aggressive, does not compound into anything a competitor cannot eventually also lease. The compounding, when it happens, happens in the layer below the API call: in how cleanly a company has structured the record of its own customers, its own failures, its own edge cases, so that the rented brain, plugged in fresh every morning with no memory of yesterday, can be handed exactly the right fragment of yesterday and made to look, for a few hundred milliseconds, like it has been there all along.

A hospital chart has two kinds of entries. There is the vital-signs strip clipped to the bed rail — temperature, pulse, blood pressure, checked every four hours and replaced every four hours, because a reading from yesterday tells the night nurse nothing about the patient in front of her right now. And there is the permanent record in the file downstairs: the allergy that nearly killed him in 2019, the surgery, the medication history going back a decade, written once and never overwritten, because that record is exactly as valuable ten years from now as it is today. Nobody confuses the two charts. Nobody staples last Tuesday’s blood pressure into the permanent file. The hospital figured out, long before anyone digitized it, that memory is not one problem. It is two, and they fail in opposite directions if you run them through the same system.

Most teams building the layer Mehta is describing make exactly that mistake — they staple everything to the same chart. The shorthand for it is dumping everything into a vector database and praying, and it is worth asking why that particular error is so popular. The answer is that it feels like progress: embeddings go in, something resembling memory comes out, and the team moves on to the next sprint without confronting the harder question, which is what kind of memory it just built.

Short-term memory is the vital-signs strip — everything the model needs to finish the task in front of it and nothing it needs after. A customer-service exchange in progress, the order number already mentioned, the fact that this is the second call today, belongs here. So does the scratchpad of a multi-step agent: the search results just pulled, the file just opened, the partial answer being assembled before it commits. The test is not how important the information is but how long it stays true. A customer’s mood this minute is real and gone in twenty minutes; storing it permanently is like stapling yesterday’s temperature reading into the permanent file, undated, until the chart tells you nothing about fever and everything about clutter. Short-term memory should live in the context window itself, or a session-scoped cache, and it should be allowed to die when the session ends. The sin is not forgetting it. The sin is remembering it forever.

Long-term memory is the file downstairs, and it does not come in one shape any more than that file does. The first shape is semantic memory — facts. A customer’s account tier. The chargeback history that decides, in fractions of a second, whether this morning’s transaction clears. Facts belong in a database with a schema, not a vector store, because a fact has a right answer and a vector store gives you an approximate neighbor. Ask a vector index what tier a customer is on and it hands you the five most semantically similar sentences in the corpus — one correct, four merely correct-sounding. Ask a schema the same question and it tells you, because that is what the schema is for.

The more sophisticated shops are already building the seam between the two, rather than picking one and living with its blind spot. A knowledge graph keeps the relationships a schema is good at — this customer, that account, this chargeback, in fixed and queryable connection to one another — while still letting a retrieval layer search across it by meaning rather than by exact key. The approach has a name now, GraphRAG, and the name matters less than what it concedes: that facts and resemblance are different operations, and the honest fix is to run both and let each one answer the kind of question it’s actually suited for, not to force a single index to pretend it can do both jobs at once.

The second shape is episodic memory — what actually happened. The specific conversation last March in which the customer explained, at length, why the previous fix didn’t work. The exact sequence of an agent’s failed attempt at a task, preserved so the next attempt doesn’t repeat it. This is where the vector store finally earns its keep, because an episode isn’t an exact-match lookup, it’s a resemblance — has anything like this come up before — and a vector index, built to find the nearest thing to a fuzzy question, is the right tool for that question and almost no other. The error was never using a vector store. The error is using only a vector store, for facts as well as episodes, on the theory that one hammer with sufficient cosine similarity can stand in for the whole toolbox.

The third shape is the rarest, and the one teams forget to build at all: procedural memory, which is not a fact and not an episode but a skill — the model’s learned sense of how this company writes a refund email, escalates a complaint, formats an invoice. Style is the visible half of it. The other half is harder to see and matters more: the rails the model is forced to run on before it ever gets to choose a word. A refund above some threshold routes to a human, no exceptions, because the workflow says so, not because the model was persuaded to think so on this particular call. An agent that touches a production database does it through a reviewed function with a fixed set of permitted calls, not through whatever query it improvises in the moment. None of that lives in a prompt, and none of it lives in the model’s weights either. It lives in code — the orchestration layer, the permissioning, the state machine the agent is required to pass through — and it is procedural in the oldest sense of the word: not a memory of what to say but a memory of what is and isn’t allowed to happen, enforced whether or not the model that day feels like remembering it. It doesn’t live in a database at all. It lives in fine-tuning, in carefully maintained house-style examples, and in the surrounding scaffolding of guardrails and permitted actions, and it changes slower than the other two, the way a surgeon’s hands carry both technique and caution years after the specific patients are forgotten. A company that has built rich semantic and episodic memory but skipped this layer has a model that knows everything about its customers, writes in exactly the right voice, and is one well-crafted prompt away from doing something the company never agreed to.

The real argument here is not which database serves which layer — that part is plumbing, and plumbing changes every eighteen months. The argument is that memory has to be triaged the way the hospital triages it, with something deciding on purpose what survives the session and what doesn’t, rather than writing every token of every interaction into the same undifferentiated store and trusting retrieval to sort it out later. A vector database with no triage in front of it is not a memory system. It is a landfill with a search function, and it will retrieve the wrong eleven-month-old conversation with the same confidence it retrieves the right one, because nobody wrote the part of the system whose only job is deciding what belongs on which chart.

The lessor’s airplane, repainted, will fly for someone else next year. The route network will not. Neither will the schema that knows a customer’s tier on contact, nor the index that remembers the conversation from last March, nor the fine-tuned hand that knows, without being told twice, how this company writes a refund email. These are the things that do not come back at the end of the lease, because they were never on it.

AI Apple Bicycles History

The Best Lathe in the Shop

Post author By Scott Loftesness
Post date June 13, 2026
No Comments on The Best Lathe in the Shop

Part 3 of 3…

There is a version of this story where Apple is the Wright Brothers.

It is not an unreasonable version. Apple has done the safety bicycle move more times than almost any company in history — taken a technology the engineers built for engineers and brought it down to earth, made it a machine for everyone. The Mac. The iPod. The iPhone. Each one was a wheel coming down. Each one arrived after a period of apparent slowness, of critics saying Apple had lost its edge, of the industry having already moved on to the next thing. Each one was, in retrospect, obvious. Apple had been in the bicycle shop the whole time. You just couldn’t see what they were building.

So when Apple showed its hand at WWDC this week — a rebuilt Siri operating at the OS level, accessing your messages and mail and photos in real time, understanding context across apps, doing things the old Siri could only approximate — it is tempting to read it as Kitty Hawk. The long preparation made visible. The brothers finally leaving the shop.

It might be. It also might not be. That is the only honest thing to say.

What Apple showed was real. The new Siri, built on Apple’s own Foundation Models with help from Google’s Gemini, is not the Siri that became a punchline. It holds context. It moves across apps without being asked. It knows what you were doing five minutes ago and connects it to what you are doing now. It can surface a photo without opening Photos, build a navigation route from an image, draft a message in the tone of the conversation it is joining. These are not features. They are the beginning of an operating system that understands you, which is a different thing from an operating system that executes your commands.

The structure of the keynote said more than the words did. Apple led with fixes before features. iOS 27 is a Snow Leopard update — performance, reliability, the underlying machinery — and Siri AI was presented as one item on a long list rather than the main event. This is Apple’s tell. When they are doing something foundational they tend to understate it, the way a craftsman doesn’t announce the quality of his work but simply does it and lets you find it. The penny-farthing riders called their machine the ordinary. They didn’t think they needed to explain.

But here is the thing about the bicycle shop analogy that the optimistic version leaves out. The Wright Brothers knew what they were trying to build. They had been thinking about flight for years before Kitty Hawk. The bicycle shop gave them the craft knowledge, the physical intuition, the hands-on education in how machines move through space. What it did not give them was the destination. They brought the destination themselves.

The question Apple has not answered for me — the question this week’s keynote raised rather than resolved — is whether they know where they are going. Or whether this has only been a partial reveal and there’s much more behind the curtain?

The OS-level integration is the chain drive. Decoupling AI from the app, letting it run through the substrate the way a chain runs through a drivetrain, is exactly the kind of architectural insight that changes what a machine can do. It is not a feature you add. It is a rethinking of what the machine is for. Every previous AI assistant lived above the operating system, looking down at your data from a remove. Apple’s new architecture lives inside it, which is a different relationship entirely — the difference between a mechanic who reads about your car and one who has driven it for a year.

That is the Coventry precision. The tight tolerances. The discipline of making things that have to work at the level where failure is not an option.

What nobody knows, including Apple, is what you build with it.

There is also this: Tim Cook will not be driving this evolution. He announced that John Ternus takes over in September, which means this WWDC — this particular showing of the hand — is the last one Cook owns. Ternus is a hardware engineer, the man who built the Apple Silicon transition, the person most responsible for the Neural Engine that makes on-device inference possible. He is, in the bicycle shop metaphor, the craftsman who built the lathe. Whether he knows how to use it to make something that flies is the question the next several years will answer.

History is patient about these things. It lets the work speak.

In 1892, two brothers opened a shop on West Third Street in Dayton and started fixing bicycles. They were not trying to change the world. They were trying to make a living, to learn a machine, to understand in their hands what the books couldn’t teach them. The flying came later, and it came because of the shop, not despite it. The shop was the point. They just didn’t know it yet.

Apple has the best lathe in the bicycle shop. They have the chain drive architecture, the on-device precision, the installed base of two billion devices that will carry whatever they build into more hands than any other platform on earth. They have a new set of hands on the wheel starting in September, hands that know the metal intimately, that built the engine the whole thing runs on.

What they do not have yet — or if they have it, they are not showing it — is the image of what they are flying toward.

Maybe that’s the ordinary part. Maybe that’s always been the ordinary part. You don’t know what you’re building until you’ve built it, and by then the world has already changed, and everyone says it was obvious, and they are right, and they are also completely wrong about when the decision was made.

The shop is open. The lathe is running. Work is underway.

What happens when someone finally knows what to make?

Business History IBM Infrastructure Nvidia Programming Semiconductors

The Half-Life of Moats

Prompted by an article on X by @magicsilicon on the CUDA moat. Research and drafting assistance from my AI intern assistant Clark.

The NVIDIA H100 looks, in retrospect, like an inevitability. It wasn’t.

What Jensen Huang built is more accurately understood as a sixteen-year accumulation of optionality — a platform investment made in 2006 for a market that wouldn’t fully materialize until 2022. NVIDIA intros the G80 architecture in November 2006, laying the groundwork for CUDA’s release a few months later. The stated ambition was to let scientists write C++ that ran on GPU cores without needing to understand 3D graphics pipelines. The unstated bet was that parallel computation would eventually matter for something bigger than rendering shadows in video games.

For sixteen years, it mostly didn’t. Not at scale. Not commercially. CUDA lived in research labs and HPC clusters. It attracted a small, devoted, and economically marginal user base — the kind that papers cite but investors ignore. NVIDIA kept investing in it anyway: cuDNN for deep learning operations, cuBLAS for linear algebra, a layered ecosystem of libraries that made CUDA not just accessible but nearly irreplaceable for anyone doing serious numerical computation. When TensorFlow and PyTorch emerged as the standard frameworks for neural network research, they didn’t adopt CUDA because it was the only option. They adopted it because CUDA was where the optimized kernels already lived.

AlexNet won the ImageNet competition in 2012 and did it on two NVIDIA GPUs. The deep learning community noticed immediately. The financial community largely did not.

Then ChatGPT launched in November 2022, and suddenly everyone needed H100s they couldn’t get.

The parallel to Intel is instructive and also undersells how strange this kind of story looks while you’re living through it. Intel was founded in 1968 as a memory company. DRAM. The founders — Noyce, Moore, Grove — were materials scientists and engineers who believed the future was in silicon memory chips. They were right, briefly: in the early 1970s Intel dominated the DRAM market. By 1984, that share had collapsed to 1.3%, ceded almost entirely to Japanese manufacturers who had commoditized the product.

What saved Intel wasn’t a pivot so much as a realization that a stopgap had become a foundation. The 8086, conceived in 1976 as an internal hedge and launched in 1978 was never supposed to matter. It was a 16-bit processor designed to hold off Zilog while Intel finished its ambitious 32-bit iAPX 432 architecture. The 8086 was assigned to a single engineer. “If management had any inkling that this architecture would live on through many generations,” its designer Stephen Morse later recalled, “they never would have trusted this task to a single person.”

IBM chose the 8088 — a cost-reduced variant — for the original IBM PC in 1981. That decision wasn’t destiny, it was simply a procurement. And yet from that accident of selection, Intel’s x86 line became the backbone of personal computing for four decades. The Pentium in 1993 was Intel’s Wintel moment — the flag bearer the @magicsilicon tweet gestures at — but the flag had been quietly sewn since 1978.

What these histories share is not just a pattern of “slow build, explosive payoff.” The structural similarity is subtler: in both cases, the moat was a software abstraction layer built on top of hardware. Intel’s real lock-in wasn’t transistor count or clock speed. It was backward compatibility — the commitment, formalized with the 80386 in 1985, that every future Intel chip would run software written for older ones. That promise created a flywheel that trapped developers and buyers in a virtuous (for Intel) dependency loop for decades.

CUDA is the same architecture at a different layer. The lock-in isn’t the H100’s 80 gigabytes of HBM3. It’s that switching to an AMD MI300X or Google TPU means potentially rewriting training pipelines that have been optimized against CUDA kernels for years. AMD’s ROCm platform exists. It is, by most accounts, maturing. Engineers who have tried the migration report that it costs months and hundreds of thousands of dollars. The moat isn’t a wall. It’s accumulated friction — the switching cost of a decade of engineering decisions baked into codebases that no one wants to touch.

But to find the actual origin of this pattern, you have to go back further than Intel. To 1964, and to a decision IBM made that Fred Brooks — its project manager — called a bet-the-business move.

The IBM System/360 was announced on April 7, 1964, after five years of turbulent internal development. What it introduced wasn’t just a new computer. It was a new concept: the separation of architecture from implementation. Before the 360, IBM ran five incompatible product lines simultaneously. A customer who outgrew their machine had to scrap all existing software and start over. The 360 replaced all five lines with a single unified architecture — six models covering a fiftyfold performance range, all running the same operating system, all sharing the same instruction set. The name itself encoded the ambition: 360 degrees, all directions, all users.

Gene Amdahl, the 360’s chief architect, had a precise formulation for what this meant: the architecture was “an interface for which software is written, independent of any implementation.” The Principles of Operation manual described what the machine did; separate Functional Characteristics documents described how each model did it. This distinction — separating the contract from the execution — was genuinely new. It’s the conceptual root of everything that came after.

The 360 generated over $100 billion in revenue for IBM and established the first platform business model in computing. Jim Collins would later rank it alongside the Model T and the Boeing 707 as one of the three greatest business achievements of the twentieth century. But its deepest legacy was architectural: the insight that if you make your abstraction layer the standard, the hardware underneath becomes fungible. Customers didn’t buy specific IBM machines. They bought into OS/360. The machines were an implementation detail.

Intel understood this by the 1980s, even if implicitly. The 80386’s backward compatibility commitment in 1985 was IBM’s 360 insight applied to microprocessors — the architecture is the product, the silicon is the vehicle. CUDA is the same insight applied to GPU compute. What NVIDIA sold researchers in 2006 wasn’t the G80 card. It was the abstraction: write parallel code in C++, run it on any NVIDIA hardware, trust that the next generation will be faster and compatible.

The pattern is now sixty years old. It has reproduced in every major platform transition. And it keeps working for the same reason it worked in 1964: when you own the layer that developers write to, your customers’ switching costs compound every year they stay.

There’s something worth sitting with here. Neither Jensen Huang in 2006 nor Gordon Moore in 1968 could have specified exactly what the payoff would look like. What they shared was a willingness to build infrastructure for a demand they could sense but not yet see — and the discipline to keep investing in it through the long years when it looked like a research project rather than a business.

The question that doesn’t resolve cleanly is whether that kind of patience is a strategy or a personality. And whether, in an industry that now moves faster than the cycles it’s lived through, sixteen-year moats are still the kind that get built.

Which raises the uncomfortable corollary: the same AI tools that CUDA enabled may be what ultimately erodes it.

The attack on CUDA’s moat is now structurally different from anything AMD or Intel could mount before. OpenAI’s Triton compiler lets developers write GPU kernels in Python without touching CUDA at all, and generates optimized machine code that often matches hand-tuned CUDA performance. MLIR — Multi-Level Intermediate Representation, originally from Google — provides a compiler infrastructure that can target any hardware backend from a single codebase. AMD’s ROCm has historically been dismissed as immature; ROCm 7, released this year, delivers meaningfully better inference performance than its predecessors. And perhaps most directly: Claude Code reportedly ported a CUDA codebase to AMD’s ROCm in thirty minutes — work that previously took months of engineering time.

The irony is almost too neat. CUDA’s moat was built on accumulated switching costs: the friction of rewriting code, the library dependencies, the tribal knowledge encoded in a decade of kernel optimizations. AI coding tools are specifically good at exactly that kind of mechanical, high-context translation. The weapon is attacking the wall it was built behind.

That said, it’s worth being careful about the speed of this. Abstraction layers that “should” erode moats often take far longer than expected, because the moat isn’t just the code — it’s the ecosystem of tooling, documentation, community knowledge, and hardware-software co-optimization that took eighteen years to compound. Triton and MLIR are real. They’re also early. The question isn’t whether the moat is vulnerable; it’s whether it erodes before NVIDIA’s next generation of chips makes it irrelevant to argue about.

As for what comes next — which company is building the IBM 360 of this decade — the honest answer is that it’s too early to call with confidence. But there’s a candidate worth watching.

Anthropic’s Model Context Protocol, launched in late 2024, has the structural fingerprint of a platform play. MCP is a standard for how AI agents connect to external tools and data sources — a common interface layer, hardware-agnostic (or rather, model-agnostic), that any system can implement. By late 2025 it had been donated to the Linux Foundation, adopted by OpenAI and Google, and was tracking 97 million monthly SDK downloads. There are now over 10,000 MCP servers. It is becoming the way agents talk to the world.

The parallel to OS/360 is imprecise but instructive. What IBM built in 1964 was a standard interface between software and hardware that decoupled what you wrote from what you ran it on. MCP is attempting something similar one abstraction layer higher: decoupling what an agent does from the specific models, tools, and data sources it does it with. If it becomes the standard — the layer that developers write to — then whoever owns or most deeply shapes that standard controls the integration tax of an industry whose applications we can’t fully specify yet.

The counterargument is that open standards, once donated to foundations and broadly adopted, don’t generate the same lock-in as proprietary platforms. OS/360 was IBM’s. CUDA is NVIDIA’s. MCP is now the Linux Foundation’s, with OpenAI and Google as co-stewards. The historical pattern suggests the moat accrues to whoever owns the layer, not whoever invented it.

Which may mean the next great platform play is still being assembled in a room we haven’t seen yet — the way IBM’s System/360 was being architected in a Connecticut motor lodge in 1961, three years before anyone else knew what was coming.

Tags abstraction layers, AI accelerators, AI boom, AI hardware, AI Infrastructure, AI investment, AlexNet, AMD ROCm, Blackwell, chatgpt, chip architecture, computing platforms, CUDA, cuDNN, Deep Learning, developer ecosystems, emerging technology, foundation models, Fred Brooks, generative ai, Gordon Moore, GPU computing, H100, hardware abstraction, Hopper architecture, IBM mainframe, IBM System/360, Inference, Intel, jensen huang, Linux Foundation, MCP, MLIR, Model Context Protocol, nvidia, open standards, OpenAI Triton, parallel computing, PC revolution, physical AI, platform strategy, PyTorch, Scott Loftesness, semiconductor history, software ecosystems, switching costs, Tech History, tech strategy, technology moats, TensorFlow, Venture Capital, x86

Share this:

Share this:

Share this: