The Untrainable: Thesis Analysis

Thesis Abstract🔗

The central argument of The Untrainable is that the AI industry's despair — that only compute and frontier weights hold value — is half right but wrong about what it leaves. Intelligence keeps getting cheaper, and value keeps sliding toward the few places a model cannot reach.

Sarah Guo identifies a structural dynamic she calls the absorption frontier: the boundary where measurable work gets pulled into model weights. What is legible (benchmarkable, trainable) gets commoditized. What survives is illegible work — whose correctness is private, grounded in history, and established inside organizations through trust and relationships.

"The most cited benchmark score of the year is a map of territory about to be worthless, and a notice of who is about to lose the right to say what counts as good."

The Thesis Framework🔗

Guo proposes evaluating any knowledge work domain along two axes:

Public Correctness

Legible Work

Work that can be measured, benchmarked, and trained against. Anything on a leaderboard is already on its way to commodity. Eaten from two directions: task saturation from below and model absorption from above.

Private Correctness

Illegible Work

Work whose correctness is private and expensive to establish. The valuable work that remains. Integration, domain translation, organizational change — the slow work that runs as long as the relationship does.

Key Insight: Private Correctness

Private correctness is the article's central concept. It means truth that exists only inside a specific firm's data and context — the kind that cannot be read off a leaderboard. You find out whether a complex system works by running it in the world long enough to learn. As Noam Brown of OpenAI noted, evaluating an agent over a one-year horizon may require running it for a year. The clock cannot be skipped.

High Saturation

Software Engineering

Coding benchmarks (SWE-Bench Verified) provide free verifiers. Agents hit high 80s. But writing code is not the same as shipping. Mert Demirer's MIT study found coding agents lifted code written by 180% but shipped code by only ~30%.

Medium Saturation

Legal Services (M&A)

A white-shoe firm runs ~1,000 deals/year. Confidentiality prevents general agents from processing client files. Deal-level signal (NDA, term sheet, diligence) is practice-area specific and does not transfer.

Low Saturation

Clinical Medicine

OpenEvidence is used by a majority of American doctors daily. No amount of compute buys that. Trust is built through relationships, with user acquiescence — not gradient descent.

The Absorption Frontier🔗

The absorption frontier describes how the labs are trying to get models to swallow their own scaffolding — the retrieval, routing, tool use, and reasoning policy that used to wrap the model are being pulled into the weights, until the wrapper is the model. Margin pressure cuts both ways: a general agent must be ready for anything (expensive), while a focused application tunes one workflow until it runs on a fraction of the token spend and keeps the difference.

"What's legible is what's leaving. The valuable work is illegible by construction: anything you can put on a leaderboard, you can train against, so anything measurable is already on its way to commodity."

Thin wrappers — shallow applications over foundation models — are being absorbed. But absorption is not total. On a narrow task with private data and custom evals, a focused company can train to the frontier and beat the general models where it counts. That specialized model becomes part of the moat.

Permission and Accountability🔗

Guo argues that intelligence is not the bottleneck. Permission and accountability are. The door to private-correctness domains has two barriers:

The Lock: Environment

Security review, integration, contract, liability. A better model does not hold the license, sign off on liability, own the firm's files, or get sued when the answer is wrong. A bank's production systems are neither portable nor standardized. You do not get root by being clever on SWE-Bench.

The Deadbolt: User

Trust is built slowly on relationships. A majority of American doctors open OpenEvidence every day, and no amount of compute buys that. A lab can train a flawless medical model tomorrow and still have no way into the physician's habit or the decision flow of UCSF.

The Economics of Value🔗

Matt MacInnis of Rippling provides the value framework: a token spent answering a generic question is worth almost nothing (anyone's model can answer it), while a token reasoning over a company's specific data is worth much more (it does the thing you actually want).

Outcome-based pricing is the natural mechanism for this: Sierra charges when its agent resolves a customer issue, nothing when it escalates. Cognition's Devin offers a performance guarantee in software. You can only price outcomes in a system you are trusted inside.

"A company that brings the translation is tough to copy — and the translation never ends. Integration and maintenance run as long as the relationship does."

Competitive Dynamics🔗

Guo directly addresses the objection that labs will run first-party products below cost to bleed out application-layer companies. Her rebuttal: this only works if the model layer is a single-player game. It is not — it looks like a three-and-a-half-way death match with international players six months behind and a development league 5x the size it was last year.

In consumer chat, the best model has never simply won. ChatGPT is losing share to Gemini on the strength of Android and Search, not a better model. Anthropic, rated as having the best model by prediction markets, is barely a factor in consumer chat. If the frontier remains crowded, the layer above will be valuable.

The Innovator's Dilemma applies: incumbents keep the ground they have, and the next thing comes from someone who finds a use before the rest. Competing on the general model is a capital war you lose to whoever owns the most compute.

Knowledge Graph Explorer🔗

Interactive visualization of the thesis knowledge graph. Drag nodes to explore relationships, scroll to zoom, click nodes or edge labels to open entity pages.

The Untrainable — Thesis Knowledge Graph

0 nodes 0 links

Frequently Asked Questions🔗

What is "the untrainable" in Sarah Guo's thesis? ▼

The untrainable is value with history: work whose correctness cannot be measured by any public benchmark because it exists only inside a specific firm's data, relationships, and operational reality. A better model does not make private ground truth public, hold the license, sign off on liability, or be the party sued when the answer is wrong.

What is the absorption frontier? ▼

The absorption frontier is where measurable work gets pulled into model weights. The retrieval, routing, tool use, and reasoning policy that used to wrap a model are being absorbed into the weights, until the wrapper is the model. The frontier keeps rising as we learn to measure more work.

How does legible work differ from illegible work? ▼

Legible work can be measured, benchmarked, and trained against — it is on its way to commodity. Illegible work has private correctness that is expensive to establish: integration that runs as long as the relationship does, the slow rebuild of an engineering organization, the judgment of what good means in a specific field.

What does "private correctness" mean? ▼

Private correctness is truth that exists only inside a specific firm's data, context, and operations. It cannot be read off a leaderboard. As Noam Brown noted, evaluating an agent over a one-year horizon may require running it for a year. The clock cannot be skipped.

What are commodity tokens? ▼

Tokens spent answering generic questions that any model can answer are worth almost nothing. Tokens reasoning over company-specific data are worth much more — they do the thing the company actually wants, not just the plausible thing.

What is the thin wrapper problem? ▼

Most companies built on a foundation model are shallow application layers waiting to be absorbed as the model incorporates scaffolding into its own weights. A lot of what looks like a company today is a thin wrapper.

How does outcome-based pricing work and why is it defensible? ▼

The price is the evaluation: Sierra charges when its agent resolves an issue, nothing when it escalates. Devin offers a performance guarantee. Only possible inside a system the provider is trusted in. Whoever defines "resolved" controls the standard.

What is the permission moat? ▼

Two barriers: the lock (environment) and the deadbolt (user). A model smarter than any person still has to be let in the door, and someone must put their name on what it does. Intelligence is not the bottleneck; permission and accountability are.

Why can't better models alone win markets? ▼

The best model has never simply won consumer chat. ChatGPT is losing share to Gemini on Android and Search distribution, not model quality. Anthropic is barely a factor in consumer chat. The public chooses on more than coding today.

How does the Innovator's Dilemma apply to AI companies? ▼

Incumbents keep the ground they have; the next thing comes from someone who finds a use before the rest. Competing on the general model is a capital war won by whoever owns the most compute. The trap is committing to out-train the frontier in general tasks.

What makes integration and maintenance defensible? ▼

Integration and maintenance run as long as the relationship does, won by teams that put domain-specialized engineers next to the customer. The translation never ends. It takes an operator to moneyball a firm, with ambiguous goals and incomplete feedback over very long horizons.

What is the role of judgment in defining "good" work? ▼

Someone on the inside must decide what good means. That decision is the whole game. Harvey publishes a legal benchmark. Sierra publishes one for voice. You earn the right to define good by being the one the field already uses. The senior lawyer writes the legal benchmark. Defining a safe clinical answer falls to a physician.

Glossary of Key Terms🔗

The Untrainable: Value that cannot be captured by AI benchmarks because its correctness is private, grounded in history, and established through relationships inside organizations.
Absorption Frontier: The boundary where measurable work gets pulled into model weights as labs incorporate retrieval, routing, tool use, and reasoning policy directly into the model.
Legible Work: Work that can be measured, benchmarked, and trained against. Being eaten from two directions: task saturation from below and model absorption from above.
Illegible Work: Work whose correctness is private and expensive to establish. The valuable work that survives after legible tasks are commoditized.
Private Correctness: Truth that exists only inside a specific firm's data and operational context. Cannot be verified by public benchmarks; requires time and trust to establish.
Permission Moat: The double barrier of environment (security, integration, liability) and user (habit, trust, acquiescence) that protects private-correctness domains from AI absorption.
Outcome-Based Pricing: A pricing model where the fee is the evaluation: charge only when the AI agent produces a verified good outcome. Only possible inside a trusted system.
Thin Wrapper: A shallow application over a foundation model that is easily absorbed as the model incorporates the wrapper's functionality into its own weights.
Commodity Token: A token spent answering a generic question that any model can answer. Worth near zero because the buyer can substitute any provider.
Vertical Mapping: The strategic process of mapping knowledge work domains by correctness type (public vs private) and saturation level to identify where defensible AI value can be built.

Strategic Framework: How to Build Defensible AI Value🔗

Derived from Sarah Guo's thesis, seven steps for building AI companies that resist commoditization by the absorption frontier.

Identify work whose correctness is private and expensive to establish

Ask two questions: (1) Is its correctness private and expensive to establish, existing only inside someone's data? (2) Is it walled off inside a system you cannot get into? If yes to both, the work is in the untrainable corner. If correctness is public and the task is saturated, you are competing on commodity tokens.

Get inside a domain with high permission barriers

The lock is the environment (security review, integration, contract, liability). The deadbolt is the user (habit, trust, acquiescence). You cannot verify whether AI is useful until you are trusted inside. Trust is built slowly on relationships, not gradient descent.

Build the translation layer between private reality and model action

An application earns its place by arranging a company's private reality so a model can act on it. The translation never ends. Integration and maintenance run as long as the relationship does.

Price outcomes, not tools or access

Get in and price the outcome. Sierra charges when its agent resolves an issue and nothing when it escalates. Devin offers a performance guarantee. When the price is the evaluation, it works only because you own the definition of good from inside the trusted system.

Earn the right to define what "good" means in your field

If work cannot be scored from outside, someone inside must decide what good means. Enough decisions written down become a benchmark. The senior lawyer writes the legal benchmark. The physician defines a safe clinical answer.

Continuously re-underwrite defensibility as the absorption frontier rises

The frontier keeps rising. You cannot find a defensible spot and rest. Keep stepping toward whatever cannot yet be scored. On a narrow task with private data and your own evals, you can beat the general models where it counts.

Choose what to build through judgment no model can replicate

Even harder than defense is offense: choosing what to build. The model cannot tell you what is worth pointing it at. You cannot benchmark that, so you cannot train it. Intent is an even scarcer input than compute.

Explore the Knowledge Graph Using SPARQL🔗

Run SPARQL queries against the thesis analysis graph. Each query targets the named graph <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>. Click a query to expand, then Run on URIBurner to execute it live.

All Thesis Concepts and Definitions ▼

PREFIX schema: <http://schema.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ?definition
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
  ?concept a skos:Concept ;
           skos:definition ?definition .
}
ORDER BY ?concept

▶ Run on URIBurner

All People and Organizations Referenced ▼

PREFIX schema: <http://schema.org/>

SELECT ?entity ?type ?name
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
  { ?entity a schema:Person ; schema:name ?name . }
  UNION
  { ?entity a schema:Organization ; schema:name ?name . }
  BIND (IF(EXISTS { ?entity a schema:Person . }, "Person", "Organization") AS ?type)
}
ORDER BY ?type ?name

▶ Run on URIBurner

FAQ Questions and Answers ▼

PREFIX schema: <http://schema.org/>

SELECT ?question ?answer
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
  ?q a schema:Question ;
     schema:name ?question ;
     schema:acceptedAnswer ?a .
  ?a schema:text ?answer .
}
ORDER BY ?question

▶ Run on URIBurner

Work Domains with Correctness Types ▼

PREFIX schema: <http://schema.org/>
PREFIX thesis: <https://saranormous.substack.com/p/the-untrainable%23>

SELECT ?domain ?name ?correctnessType ?saturation
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
  ?domain a thesis:WorkDomain ;
          schema:name ?name ;
          thesis:hasCorrectnessType ?correctnessType .
  OPTIONAL { ?domain thesis:hasSaturationLevel ?saturation . }
}
ORDER BY ?name

▶ Run on URIBurner

Entity Type Distribution (Sample) ▼

PREFIX schema: <http://schema.org/>

SELECT (SAMPLE(?s) AS ?EntityID) (COUNT(*) AS ?count) (?o AS ?EntityTypeID)
FROM <https://linkeddata.uriburner.com/dav/home/demo/docs/the-untrainable-thesis-analysis-big_pickle-1.ttl>
WHERE {
  ?s a ?o .
}
GROUP BY ?o
ORDER BY DESC(?count)
LIMIT 50

▶ Run on URIBurner