Thesis Analysis · Knowledge Graph

The Untrainable

Sarah Guo's investment thesis: durable AI value concentrates where no benchmark can reach — in private ground truth, domain trust, and illegible correctness.

✍️ Sarah Guo 📅 June 10, 2026 🔗 Original Post 📊 193 RDF Triples

The Untrainable Thesis

Durable AI value concentrates in work whose correctness is private, expensive to verify externally, and locked inside systems requiring trust and permission — the "untrainable corner" no benchmark or general model can colonize. The despair that nothing is investable is half right: thin wrappers are being absorbed. It is wrong about what survives.

+180% code written
by AI agents
+30% code actually
shipped
13%→88% SWE-Bench
2024→2026
~1yr true agent
eval horizon

Core Claims

Five interlocking arguments that underpin the thesis

1Benchmarks are self-defeating

A thing you can measure is a thing you can train against. Once work is checkable cheaply, it commoditizes toward open models at lowest cost. The measurable is always on its way to commodity.

2Intelligence is not the bottleneck

Permission and accountability are. A model smarter than any person still needs to be let in the door. Capability does not confer license, liability, or ownership of private data. Trust is built on relationships, not gradient descent.

3The absorption frontier keeps rising

Labs pull scaffolding (retrieval, routing, tool use, reasoning policy) into weights, narrowing the app layer. The untrainable ground shrinks — companies must continuously step toward what can't yet be scored.

4Thin wrappers are being absorbed

Companies built on frontier models with no proprietary data or trust integration face severe absorption risk. The despair about AI investability is "half right" on this point.

5Choosing what to build is harder than defense

The model cannot tell you what is worth building. Identifying unaddressed use cases before the market is a uniquely human judgment — intent may be an even scarcer input than compute.

The Saturated × Public/Private 2×2

Where value lives in the AI stack — derived directly from the article's strategic framework

Public Correctness
Private Correctness
Saturated Tasks
🪙 Commodity Tokens

Open models at lowest cost. Buyers stop asking which model and start asking the price. Margins collapse.

🛡️ Partial Moat

Private data helps; task saturation limits upside. Some durability but not the prize.

Frontier Work
🏛️ Lab Territory

Free eval → owning it counts for nothing. Coding benchmarks live here. Labs win by grinding the check.

🏆 The Untrainable Corner

Frontier work + private correctness. Custom models on private data. The only quadrant with durable independent value.

Supporting Evidence

Data, observations, and cases cited in the article

🔬

MIT Study — 100K+ Developers

Mert Demirer et al. found coding agents lifted code written by ~180% but code actually shipped by only ~30%. Writing got cheap; shipping still runs through a person.

📊

SWE-Bench Trajectory

Devin launched in 2024 at 13%. By mid-2026, best agents hit the high 80s. The benchmark was conquered — but the "wrong lesson" was drawn. Engineering resists full measurement.

💬

Noam Brown on Evaluation Horizon

OpenAI's Noam Brown wrote that the only sure way to evaluate an agent over a one-year horizon may be to run it for a year. The clock cannot be skipped.

🏥

OpenEvidence Physician Adoption

A majority of American doctors now open OpenEvidence daily. No amount of compute buys physician habit or UCSF decision-flow integration. Trust is built slowly through relationships.

How to Build in the Untrainable Corner

Strategic playbook derived from the article's thesis

1

Get inside one

Do the unglamorous work: pass the security review, sign the integration contract, become the party with your name on outcomes. Permission precedes capability.

2

Arrange the customer's private reality

Connect the model to the firm's private data, internal tools, and workflows. This translation is hard to copy and never ends.

3

Write down what good means

Accumulate in-domain judgments. Companies owning the definition of "resolved" or authoring the legal benchmark earn standing no frontier lab can buy.

4

Price outcomes, not inputs

Charge for resolution, not tokens. Outcome pricing is only credible if you own the definition of success and are trusted inside the system.

5

Avoid the general-model capital war

Train specialized models on your private data and evals for narrow tasks. Out-training frontier labs on general tasks usually ends in a sale to someone compute-rich.

6

Keep stepping toward the not-yet-scored

The untrainable ground shrinks continuously. Don't find a defensible spot and rest. Re-underwrite constantly, moving toward what cannot yet be benchmarked.

Companies & People in the Argument

Entities cited as case studies or evidence

Sierra

Charges only when its agent resolves a customer issue. Outcome pricing works because Sierra owns the definition of "resolved" — a concrete example of untrainable-corner moat.

Harvey

Publishes a benchmark for law — earning the right to define what "good" means in legal work through real adoption, not lab authority.

Cognition (Devin)

Offers a "performance guarantee" on coding output — only possible when trusted inside the customer's system. The guarantee signals depth of integration.

OpenEvidence

Used daily by a majority of American doctors. Physician habit and hospital decision-flow integration cannot be purchased with compute.

Rippling / Matt MacInnis

Source of the token value gradient insight: generic-question tokens are near-worthless; proprietary-data-reasoning tokens carry real value.

Gabe Pereyra

Real automation requires product, model, workflow, and firm moving together. Three of those four move at the speed of an organization — no benchmark captures this.

FAQ

Key analytical questions from the thesis

The despair correctly identifies that thin wrappers over frontier models face absorption. It is wrong about what survives: private ground truth, domain trust, and illegible correctness create durable value the model layer cannot reach.
ChatGPT held its lead through years of real competition not on model quality alone. Gemini is gaining share via Android and Search distribution, not a better model. Trust, habit, and distribution are non-model advantages the public weighs heavily.
A top white-shoe M&A practice runs ~1,000 deals per year with practice-area-specific deal shapes (NDA → term sheet → diligence → purchase agreement) and confidentiality constraints. Transformation requires ambiguous intermediate goals, incomplete feedback, and a multi-year horizon — no single eval can capture it.
The absorption frontier is the expanding boundary at which model labs pull application scaffolding — retrieval, routing, tool use, reasoning policy — into base weights. It rises because more measurable work is continuously benchmarked and trained against, narrowing the app layer unless companies move into private, illegible territory.

Glossary of Key Concepts

Defined terms and their semantic cross-references

Frontier work whose correctness exists only in private data — the 2×2 quadrant where tasks are not saturated and answers are not publicly verifiable, making it resistant to general model commoditization. See also: Proprietary information (DBpedia)
The expanding boundary at which model labs pull application scaffolding (retrieval, routing, tool use, reasoning) into base model weights, eroding the margin of app-layer companies.
A form of truth that cannot be read off any leaderboard and is only discovered by running a complex system in the world over time. See also: Tacit knowledge (DBpedia)
Correctness that exists only inside a firm's data, workflows, and accumulated judgments. Neither trainable from public data nor verifiable by external benchmarks. See also: Ground truth (DBpedia)
A durable barrier protecting a business from rivals. In the AI context, moats come from private data and trust relationships, not model quality. See also: Economic moat (DBpedia)
A check that costs nothing to run — a compiler, a test suite — enabling rapid grinding against a benchmark. Its absence is what makes a task illegible and resistant to training.
A technique to produce a smaller, cheaper model from a larger one. Enables focused applications to run a workflow on a fraction of the token spend of a general agent. See also: Knowledge distillation (DBpedia)

Knowledge Graph Explorer

Interactive visualization of 193 RDF triples derived from the article. Click SVG to arm zoom, drag nodes to reposition, double-click to unpin.

Nodes: — · Links: —
Loading graph…

SPARQL Workbench

Entity Type Summary Query

▶ Run Live Query