# Interaction Models: A Scalable Approach to Human-AI Collaboration

**Author:** [Thinking Machines Lab](https://thinkingmachines.ai)
**Published:** May 11, 2026
**Source:** https://thinkingmachines.ai/blog/interaction-models/

---

## Overview

Thinking Machines Lab introduces **[interaction models](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23interactionModels)** — AI models that handle interactivity natively through [micro-turn design](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23microTurnDesign) (200ms chunks) rather than external scaffolding. The core thesis: **interactivity should scale alongside intelligence**. A 276B MoE model (12B active) achieves state-of-the-art streaming performance while being the first to demonstrate meaningful results on novel interactivity dimensions — time awareness, visual proactivity, and simultaneous speech.

---

## Three Principles

### [Interactivity Should Scale with Intelligence](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23interactivityScales)
As models get smarter, interfaces must get more interactive. Turn-based chat pushes humans out not because their input isn't needed, but because the interface has no room for them.

### [Keep Humans in the Collaborative Loop](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23humansInTheLoop)
Autonomous agents are valuable but don't serve collaborative workflows. Humans need to stay in the loop — clarifying, course-correcting, and giving real-time feedback.

### [Native Interaction, Not External Scaffolding](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23nativeNotScaffolding)
Interaction must be handled by the model itself — multi-stream input/output processed simultaneously, natively — not bolted on externally.

---

## Architecture (5 Components)

| Component | Description |
|-----------|-------------|
| **[Micro-Turn Processing](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23microTurn)** | 200ms time-aligned chunks — simultaneous input/output for continuous exchange |
| **[Encoder-Free Early Fusion](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23encoderFree)** | Audio via dMel, images via hMLP, decoder via flow head — all co-trained |
| **[Interaction + Background Split](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23interactionBackground)** | Real-time interaction model + async background model for reasoning/tools/search |
| **[276B MoE (12B Active)](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23moEdesign)** | Gather+gemv kernels, bitwise deterministic training via NVLS |
| **[SGLang Streaming](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23sglangStreaming)** | Split-KV attention, upstreamed patch, under 5% overhead |

---

## Benchmarks

| Benchmark | TML-Small | Best Competitor |
|-----------|-----------|-----------------|
| FD-bench V1 Turn-Taking | **0.40s** | 1.18s (GPT-2.0 minimal) |
| FD-bench V1.5 Average | **77.8** | 54.3 (Gemini minimal) |
| FD-bench V3 Quality | **82.8% / 68.0%** | 81.0% / 58.0% (GPT-2.0 xhigh) |
| TimeSpeak (novel) | **64.7** | 4.3 (GPT-2.0 minimal) |
| CueSpeak (novel) | **81.7** | 2.9 (GPT-2.0 minimal) |
| RepCount-A (novel) | **35.4** | 1.3 (GPT-2.0 minimal) |
| ProactiveVideoQA (novel) | **33.5** | 25.0 (no-response baseline) |
| Harmbench Refusal | 99.0% | 100.0% (GPT-2.0 xhigh) |

---

## How to Build Interaction Models (7 Steps)

1. **[Start with Micro-Turn Design](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step1)** — Process 200ms chunks for simultaneous input/output.
2. **[Use Encoder-Free Early Fusion](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step2)** — Minimize preprocessing with lightweight co-trained encoders.
3. **[Split Interaction from Background](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step3)** — Real-time presence vs. sustained reasoning and tool use.
4. **[Optimize with MoE + Streaming](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step4)** — Gather+gemv kernels, SGLang, bitwise deterministic training.
5. **[Build Novel Interactivity Benchmarks](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step5)** — Standard benchmarks don't measure interactivity.
6. **[Design Safety for Continuous Mode](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step6)** — Modality-appropriate refusals, long-horizon red-teaming.
7. **[Release Incrementally](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23step7)** — Research preview first, wider release later, fund evaluation frameworks.

---

## FAQ

**Q: [What are interaction models?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq1)**
A: AI models handling interaction natively via micro-turn design — 200ms chunks of simultaneous input/output across audio, video, and text.

**Q: [What problem do they solve?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq2)**
A: The collaboration bottleneck — turn-based interfaces push humans out because the interface has no room for them.

**Q: [What is micro-turn design?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq3)**
A: Time-aligned processing enabling continuous exchange — the model perceives and generates simultaneously.

**Q: [What is encoder-free early fusion?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq4)**
A: Minimal preprocessing — dMel for audio, hMLP for images, flow head for decoding. All co-trained from scratch.

**Q: [How does the split architecture work?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq5)**
A: Real-time interaction model + async background model for sustained reasoning, tool use, and search — sharing context.

**Q: [What model specs?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq6)**
A: 276B MoE with 12B active. Gather+gemv kernels. SGLang streaming. Bitwise deterministic training.

**Q: [How does it compare?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq7)**
A: 0.40s turn latency (vs 1.18s). Leads on FD-bench V1.5 and all novel interactivity benchmarks.

**Q: [What novel benchmarks?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq8)**
A: TimeSpeak, CueSpeak, RepCount-A, ProactiveVideoQA, Charades. No existing model performs meaningfully.

**Q: [Current limitations?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq9)**
A: Long session context, compute reliability, safety research, larger models too slow to serve.

**Q: [How does safety work?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq10)**
A: Modality-appropriate refusals (TTS-generated) and long-horizon robustness via automated red-teaming.

**Q: [Philosophical grounding?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq11)**
A: Clark & Brennan grounding theory, Ong orality/literacy, Scott métis, Hayek knowledge in society.

**Q: [When available?](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23faq12)**
A: Limited research preview in coming months. Wider release later in 2026. Research grants offered.

---

## Glossary

- **[Interaction Models](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23interactionModels)** — AI models handling interactivity natively, not through external scaffolding.
- **[Human-AI Collaboration](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23humanAIcollaboration)** — Continuous real-time exchange keeping humans in the loop.
- **[Micro-Turn Design](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23microTurnDesign)** — 200ms time-aligned processing for simultaneous input/output.
- **[Streaming AI](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23streamingAI)** — Real-time AI across audio, video, and text with continuous perception.
- **[Mixture of Experts](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23moETerm)** — Architecture with sparse activation — 276B total, 12B active per pass.
- **[SGLang](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23sglangTerm)** — Inference framework with streaming sessions and efficient MoE kernels.
- **[dMel](https://linkeddata.uriburner.com/describe/?uri=https%3A%2F%2Fthinkingmachines.ai%2Fblog%2Finteraction-models%2F%23dMelTerm)** — Differentiable Mel-Spectrogram for lightweight end-to-end audio encoding.

---

## Related Resources

- [Original Article](https://thinkingmachines.ai/blog/interaction-models/)
- [Thinking Machines Lab](https://thinkingmachines.ai)
- [RDF Turtle](../rdf/interaction-models-deepseek_v4pro-1.ttl)
- [JSON-LD](../rdf/interaction-models-deepseek_v4pro-1.jsonld)
- [HTML Infographic](../webpages/interaction-models-deepseek_v4pro-1.html)

---

*Generated by kg-generator skill · Powered by DeepSeek V4 Pro · May 11, 2026*