Sarah Guo's mid-2026 investor thesis on AI value: why sustainable advantage lives in work whose correctness is private, expensive to establish, and walled off from benchmark-driven absorption.
The central argument of The Untrainable is that the AI industry's despair — that only compute and frontier weights hold value — is half right but wrong about what it leaves. Intelligence keeps getting cheaper, and value keeps sliding toward the few places a model cannot reach.
Sarah Guo identifies a structural dynamic she calls the absorption frontier: the boundary where measurable work gets pulled into model weights. What is legible (benchmarkable, trainable) gets commoditized. What survives is illegible work — whose correctness is private, grounded in history, and established inside organizations through trust and relationships.
"The most cited benchmark score of the year is a map of territory about to be worthless, and a notice of who is about to lose the right to say what counts as good."
Guo proposes evaluating any knowledge work domain along two axes:
Work that can be measured, benchmarked, and trained against. Anything on a leaderboard is already on its way to commodity. Eaten from two directions: task saturation from below and model absorption from above.
Work whose correctness is private and expensive to establish. The valuable work that remains. Integration, domain translation, organizational change — the slow work that runs as long as the relationship does.
Private correctness is the article's central concept. It means truth that exists only inside a specific firm's data and context — the kind that cannot be read off a leaderboard. You find out whether a complex system works by running it in the world long enough to learn. As Noam Brown of OpenAI noted, evaluating an agent over a one-year horizon may require running it for a year. The clock cannot be skipped.
Coding benchmarks (SWE-Bench Verified) provide free verifiers. Agents hit high 80s. But writing code is not the same as shipping. Mert Demirer's MIT study found coding agents lifted code written by 180% but shipped code by only ~30%.
A white-shoe firm runs ~1,000 deals/year. Confidentiality prevents general agents from processing client files. Deal-level signal (NDA, term sheet, diligence) is practice-area specific and does not transfer.
OpenEvidence is used by a majority of American doctors daily. No amount of compute buys that. Trust is built through relationships, with user acquiescence — not gradient descent.
The absorption frontier describes how the labs are trying to get models to swallow their own scaffolding — the retrieval, routing, tool use, and reasoning policy that used to wrap the model are being pulled into the weights, until the wrapper is the model. Margin pressure cuts both ways: a general agent must be ready for anything (expensive), while a focused application tunes one workflow until it runs on a fraction of the token spend and keeps the difference.
"What's legible is what's leaving. The valuable work is illegible by construction: anything you can put on a leaderboard, you can train against, so anything measurable is already on its way to commodity."
Thin wrappers — shallow applications over foundation models — are being absorbed. But absorption is not total. On a narrow task with private data and custom evals, a focused company can train to the frontier and beat the general models where it counts. That specialized model becomes part of the moat.
Guo argues that intelligence is not the bottleneck. Permission and accountability are. The door to private-correctness domains has two barriers:
Security review, integration, contract, liability. A better model does not hold the license, sign off on liability, own the firm's files, or get sued when the answer is wrong. A bank's production systems are neither portable nor standardized. You do not get root by being clever on SWE-Bench.
Trust is built slowly on relationships. A majority of American doctors open OpenEvidence every day, and no amount of compute buys that. A lab can train a flawless medical model tomorrow and still have no way into the physician's habit or the decision flow of UCSF.
Matt MacInnis of Rippling provides the value framework: a token spent answering a generic question is worth almost nothing (anyone's model can answer it), while a token reasoning over a company's specific data is worth much more (it does the thing you actually want).
Outcome-based pricing is the natural mechanism for this: Sierra charges when its agent resolves a customer issue, nothing when it escalates. Cognition's Devin offers a performance guarantee in software. You can only price outcomes in a system you are trusted inside.
"A company that brings the translation is tough to copy — and the translation never ends. Integration and maintenance run as long as the relationship does."
Guo directly addresses the objection that labs will run first-party products below cost to bleed out application-layer companies. Her rebuttal: this only works if the model layer is a single-player game. It is not — it looks like a three-and-a-half-way death match with international players six months behind and a development league 5x the size it was last year.
In consumer chat, the best model has never simply won. ChatGPT is losing share to Gemini on the strength of Android and Search, not a better model. Anthropic, rated as having the best model by prediction markets, is barely a factor in consumer chat. If the frontier remains crowded, the layer above will be valuable.
The Innovator's Dilemma applies: incumbents keep the ground they have, and the next thing comes from someone who finds a use before the rest. Competing on the general model is a capital war you lose to whoever owns the most compute.
Interactive visualization of the thesis knowledge graph. Drag nodes to explore relationships, scroll to zoom, click nodes or edge labels to open entity pages.
The untrainable is value with history: work whose correctness cannot be measured by any public benchmark because it exists only inside a specific firm's data, relationships, and operational reality. A better model does not make private ground truth public, hold the license, sign off on liability, or be the party sued when the answer is wrong.
The absorption frontier is where measurable work gets pulled into model weights. The retrieval, routing, tool use, and reasoning policy that used to wrap a model are being absorbed into the weights, until the wrapper is the model. The frontier keeps rising as we learn to measure more work.
Legible work can be measured, benchmarked, and trained against — it is on its way to commodity. Illegible work has private correctness that is expensive to establish: integration that runs as long as the relationship does, the slow rebuild of an engineering organization, the judgment of what good means in a specific field.
Private correctness is truth that exists only inside a specific firm's data, context, and operations. It cannot be read off a leaderboard. As Noam Brown noted, evaluating an agent over a one-year horizon may require running it for a year. The clock cannot be skipped.
Tokens spent answering generic questions that any model can answer are worth almost nothing. Tokens reasoning over company-specific data are worth much more — they do the thing the company actually wants, not just the plausible thing.
Most companies built on a foundation model are shallow application layers waiting to be absorbed as the model incorporates scaffolding into its own weights. A lot of what looks like a company today is a thin wrapper.
The price is the evaluation: Sierra charges when its agent resolves an issue, nothing when it escalates. Devin offers a performance guarantee. Only possible inside a system the provider is trusted in. Whoever defines "resolved" controls the standard.
Two barriers: the lock (environment) and the deadbolt (user). A model smarter than any person still has to be let in the door, and someone must put their name on what it does. Intelligence is not the bottleneck; permission and accountability are.
The best model has never simply won consumer chat. ChatGPT is losing share to Gemini on Android and Search distribution, not model quality. Anthropic is barely a factor in consumer chat. The public chooses on more than coding today.
Incumbents keep the ground they have; the next thing comes from someone who finds a use before the rest. Competing on the general model is a capital war won by whoever owns the most compute. The trap is committing to out-train the frontier in general tasks.
Integration and maintenance run as long as the relationship does, won by teams that put domain-specialized engineers next to the customer. The translation never ends. It takes an operator to moneyball a firm, with ambiguous goals and incomplete feedback over very long horizons.
Someone on the inside must decide what good means. That decision is the whole game. Harvey publishes a legal benchmark. Sierra publishes one for voice. You earn the right to define good by being the one the field already uses. The senior lawyer writes the legal benchmark. Defining a safe clinical answer falls to a physician.
Derived from Sarah Guo's thesis, seven steps for building AI companies that resist commoditization by the absorption frontier.
Ask two questions: (1) Is its correctness private and expensive to establish, existing only inside someone's data? (2) Is it walled off inside a system you cannot get into? If yes to both, the work is in the untrainable corner. If correctness is public and the task is saturated, you are competing on commodity tokens.
The lock is the environment (security review, integration, contract, liability). The deadbolt is the user (habit, trust, acquiescence). You cannot verify whether AI is useful until you are trusted inside. Trust is built slowly on relationships, not gradient descent.
An application earns its place by arranging a company's private reality so a model can act on it. The translation never ends. Integration and maintenance run as long as the relationship does.
Get in and price the outcome. Sierra charges when its agent resolves an issue and nothing when it escalates. Devin offers a performance guarantee. When the price is the evaluation, it works only because you own the definition of good from inside the trusted system.
If work cannot be scored from outside, someone inside must decide what good means. Enough decisions written down become a benchmark. The senior lawyer writes the legal benchmark. The physician defines a safe clinical answer.
The frontier keeps rising. You cannot find a defensible spot and rest. Keep stepping toward whatever cannot yet be scored. On a narrow task with private data and your own evals, you can beat the general models where it counts.
Even harder than defense is offense: choosing what to build. The model cannot tell you what is worth pointing it at. You cannot benchmark that, so you cannot train it. Intent is an even scarcer input than compute.
Run SPARQL queries against the thesis analysis graph. Each query targets the named graph <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>. Click a query to expand, then Run on URIBurner to execute it live.
PREFIX schema: <http://schema.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?concept ?definition
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
?concept a skos:Concept ;
skos:definition ?definition .
}
ORDER BY ?concept
PREFIX schema: <http://schema.org/>
SELECT ?entity ?type ?name
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
{ ?entity a schema:Person ; schema:name ?name . }
UNION
{ ?entity a schema:Organization ; schema:name ?name . }
BIND (IF(EXISTS { ?entity a schema:Person . }, "Person", "Organization") AS ?type)
}
ORDER BY ?type ?name
PREFIX schema: <http://schema.org/>
SELECT ?question ?answer
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
?q a schema:Question ;
schema:name ?question ;
schema:acceptedAnswer ?a .
?a schema:text ?answer .
}
ORDER BY ?question
PREFIX schema: <http://schema.org/>
PREFIX thesis: <https://saranormous.substack.com/p/the-untrainable%23>
SELECT ?domain ?name ?correctnessType ?saturation
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
?domain a thesis:WorkDomain ;
schema:name ?name ;
thesis:hasCorrectnessType ?correctnessType .
OPTIONAL { ?domain thesis:hasSaturationLevel ?saturation . }
}
ORDER BY ?name
PREFIX schema: <http://schema.org/>
SELECT (SAMPLE(?s) AS ?EntityID) (COUNT(*) AS ?count) (?o AS ?EntityTypeID)
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
?s a ?o .
}
GROUP BY ?o
ORDER BY DESC(?count)
LIMIT 50