The Untrainable Thesis
Durable AI value concentrates in work whose correctness is private, expensive to verify externally, and locked inside systems requiring trust and permission — the "untrainable corner" no benchmark or general model can colonize. The despair that nothing is investable is half right: thin wrappers are being absorbed. It is wrong about what survives.
by AI agents
shipped
2024→2026
eval horizon
Core Claims
Five interlocking arguments that underpin the thesis
1Benchmarks are self-defeating
A thing you can measure is a thing you can train against. Once work is checkable cheaply, it commoditizes toward open models at lowest cost. The measurable is always on its way to commodity.
2Intelligence is not the bottleneck
Permission and accountability are. A model smarter than any person still needs to be let in the door. Capability does not confer license, liability, or ownership of private data. Trust is built on relationships, not gradient descent.
3The absorption frontier keeps rising
Labs pull scaffolding (retrieval, routing, tool use, reasoning policy) into weights, narrowing the app layer. The untrainable ground shrinks — companies must continuously step toward what can't yet be scored.
4Thin wrappers are being absorbed
Companies built on frontier models with no proprietary data or trust integration face severe absorption risk. The despair about AI investability is "half right" on this point.
5Choosing what to build is harder than defense
The model cannot tell you what is worth building. Identifying unaddressed use cases before the market is a uniquely human judgment — intent may be an even scarcer input than compute.
The Saturated × Public/Private 2×2
Where value lives in the AI stack — derived directly from the article's strategic framework
Open models at lowest cost. Buyers stop asking which model and start asking the price. Margins collapse.
Private data helps; task saturation limits upside. Some durability but not the prize.
Free eval → owning it counts for nothing. Coding benchmarks live here. Labs win by grinding the check.
Frontier work + private correctness. Custom models on private data. The only quadrant with durable independent value.
Supporting Evidence
Data, observations, and cases cited in the article
MIT Study — 100K+ Developers
Mert Demirer et al. found coding agents lifted code written by ~180% but code actually shipped by only ~30%. Writing got cheap; shipping still runs through a person.
SWE-Bench Trajectory
Devin launched in 2024 at 13%. By mid-2026, best agents hit the high 80s. The benchmark was conquered — but the "wrong lesson" was drawn. Engineering resists full measurement.
Noam Brown on Evaluation Horizon
OpenAI's Noam Brown wrote that the only sure way to evaluate an agent over a one-year horizon may be to run it for a year. The clock cannot be skipped.
OpenEvidence Physician Adoption
A majority of American doctors now open OpenEvidence daily. No amount of compute buys physician habit or UCSF decision-flow integration. Trust is built slowly through relationships.
How to Build in the Untrainable Corner
Strategic playbook derived from the article's thesis
Get inside one
Do the unglamorous work: pass the security review, sign the integration contract, become the party with your name on outcomes. Permission precedes capability.
Arrange the customer's private reality
Connect the model to the firm's private data, internal tools, and workflows. This translation is hard to copy and never ends.
Write down what good means
Accumulate in-domain judgments. Companies owning the definition of "resolved" or authoring the legal benchmark earn standing no frontier lab can buy.
Price outcomes, not inputs
Charge for resolution, not tokens. Outcome pricing is only credible if you own the definition of success and are trusted inside the system.
Avoid the general-model capital war
Train specialized models on your private data and evals for narrow tasks. Out-training frontier labs on general tasks usually ends in a sale to someone compute-rich.
Keep stepping toward the not-yet-scored
The untrainable ground shrinks continuously. Don't find a defensible spot and rest. Re-underwrite constantly, moving toward what cannot yet be benchmarked.
Companies & People in the Argument
Entities cited as case studies or evidence
Sierra
Charges only when its agent resolves a customer issue. Outcome pricing works because Sierra owns the definition of "resolved" — a concrete example of untrainable-corner moat.
Harvey
Publishes a benchmark for law — earning the right to define what "good" means in legal work through real adoption, not lab authority.
Cognition (Devin)
Offers a "performance guarantee" on coding output — only possible when trusted inside the customer's system. The guarantee signals depth of integration.
OpenEvidence
Used daily by a majority of American doctors. Physician habit and hospital decision-flow integration cannot be purchased with compute.
Rippling / Matt MacInnis
Source of the token value gradient insight: generic-question tokens are near-worthless; proprietary-data-reasoning tokens carry real value.
Gabe Pereyra
Real automation requires product, model, workflow, and firm moving together. Three of those four move at the speed of an organization — no benchmark captures this.
FAQ
Key analytical questions from the thesis
Glossary of Key Concepts
Defined terms and their semantic cross-references
Knowledge Graph Explorer
Interactive visualization of 193 RDF triples derived from the article. Click SVG to arm zoom, drag nodes to reposition, double-click to unpin.