Voice chatting with today’s AI can feel as unnatural as talking over a crackling CB radio. You wait for your turn, speak, then wait again while the AI processes and responds. There’s no overlap, no interruptions, no subtle cues—just a stilted back-and-forth that strips conversation of its human rhythm. For many users, this friction makes voice AI feel like a gimmick rather than a genuine productivity tool.
But a new generation of AI models is poised to change that. Thinking Machines, an AI startup founded by Mira Murati—formerly a high-profile executive at OpenAI—has developed what it calls 'interaction models' that break the single-threaded paradigm. Instead of forcing a strict turn-taking protocol, these models employ a 'multi-stream, micro-turn' architecture that allows them to process inputs while listening, react in real time, and even interrupt the user when appropriate. The result is a conversational flow that feels far closer to human interaction.
How current AI voice chat works—and why it's broken
To understand the breakthrough, it helps to examine the limitations of existing voice modes in systems like ChatGPT, Google Gemini, or Amazon Alexa. Today’s AI voice interfaces are essentially text-generation models with a speech-to-text and text-to-speech layer added on top. When the user speaks, the AI must first record the entire utterance, convert it to text, generate a response, and then synthesize speech. During this process, the model is entirely focused on generating its reply—it cannot listen to new input, perceive the passage of time, or notice changes in the user’s environment.
This creates a 'push-to-talk' or CB-radio experience where the user must indicate they are finished speaking (often by a long pause or a specific phrase), and then wait for the AI to complete its response. If the user tries to interrupt, the AI typically ignores the new input or gets confused. Moreover, the AI has no awareness of visual cues—it cannot see if the user is nodding, smiling, or holding up an object. These limitations make AI voice chat feel like a technical demo rather than a natural conversation partner.
The problem is rooted in the underlying architecture. Large language models (LLMs) are designed to process one input at a time. They have no built-in mechanism for parallel processing of audio and video streams, nor for managing overlapping dialogue—a feature that human conversation relies on heavily. Researchers have attempted to mitigate this with 'pause and response' heuristics, but these are brittle and often result in unnatural pauses or missed opportunities for feedback. The industry has recognized this as a major barrier to adoption, and several labs are racing to solve it.
Thinking Machines' dual-model approach
Thinking Machines, founded in 2024 by Mira Murati after her departure from OpenAI, has taken a fundamentally different approach. Instead of trying to modify a single monolithic model, the company uses two distinct AI models that work in tandem. The first is a lightweight 'interaction model' that is always 'present' with the user. It processes incoming audio and video in rapid, 200-millisecond chunks, allowing it to react almost instantly to changes in the conversation. It can detect when the user pauses, when a word is mispronounced, or when new visual information appears—for example, a product held up to the camera.
The second model is a more powerful 'background model' that handles complex reasoning tasks. When the interaction model encounters a question that requires deeper thought—like a math problem or a detailed fact-check—it offloads the request to the background model. The background model processes the task asynchronously and passes the result back to the interaction model, which then delivers it seamlessly within the ongoing conversation. This separation allows the system to maintain the illusion of uninterrupted, real-time dialogue while still accessing the full reasoning power of a large language model.
The micro-turn mechanism is the key innovation. Instead of waiting for the user to finish a full sentence, the interaction model processes the audio stream in very brief segments—around 200ms each. For each segment, it decides whether to continue listening, nod, make a sound, or even interrupt. This is analogous to how humans process conversation: we don't wait until someone finishes speaking to formulate a response; we continuously process and make small decisions. The micro-turn model mimics this by generating tiny responses (like 'uh-huh,' a pause, or a short clarification) that keep the conversation flowing.
In a series of demo videos, Thinking Machines showcased this capability. In one clip, a user holds up different fruits to the camera while discussing their diet. The AI names each fruit instantly, without the user having to stop and ask. In another demo, the AI keeps a running tally of how many times the user says a specific word (like 'deer') while the user continues speaking naturally. More strikingly, the AI can intervene mid-sentence to correct a mispronunciation: when a user mispronounces 'acai' and claims it originated in Argentina, the AI interrupts gently to correct both the pronunciation and the geographical inaccuracy. These interactions happen fluidly, without the 'over and out' feel of current systems.
The startup also demonstrated impressive restraint: in one video, a user pauses mid-sentence to take a sip of coffee. The AI waits silently, recognizing the pause as a natural break rather than a turn-switch. This ability to understand the rhythm of human conversation—including silences—is a major step forward.
Challenges and limitations
Nevertheless, Thinking Machines is the first to admit that its interaction models are still in a research preview. The company acknowledges that the models 'struggle with very long conversations' and require 'reliable connectivity' to function properly. The interaction model itself is relatively small, because larger models are too slow to run the 200ms micro-turn cycle. This means that for complex, open-ended discussions, the system may rely heavily on the background model, which introduces latency that can break the illusion of fluidity.
The current demos are carefully staged, and it remains to be seen how the system behaves in the wild—with background noise, poor lighting, multiple speakers, or erratic internet connections. The micro-turn approach also raises questions about privacy: a model that continuously processes audio and video streams requires more data than a simple push-to-talk system. Users may be uncomfortable with a constantly-listening AI that can see and hear everything in its vicinity.
Moreover, the paradigm shift from turn-taking to full-duplex conversation introduces new design challenges. How does the AI decide when to interrupt? Too aggressive, and it becomes annoying; too passive, and it reverts to the old CB-radio feel. The balance between responsiveness and intrusiveness is subtle, and getting it right will require extensive user testing. The company's demos show careful scripting: the AI interrupts only when instructed or when correcting a clear factual error. In everyday use, interpreting when an interruption is desirable versus disrespectful will be a harder nut to crack.
The race for natural voice interaction
Thinking Machines is not alone in pursuing natural voice AI. Google has long worked on 'full-duplex' speech models, and OpenAI showed off its own 'Voice Engine' with real-time speaking capabilities earlier this year. Amazon is rumored to be developing a more conversational Alexa. However, most of these efforts have focused on making speech generation more human-like—adding fillers, pauses, and emotional tone—rather than fundamentally changing the underlying conversation model. Thinking Machines' approach is distinctive because it addresses the core architecture of how the AI processes and responds, rather than just polishing the output.
The implications extend far beyond casual chat. In customer service, a natural voice AI could handle complex queries without forcing customers to repeat themselves after interruptions. In education, a tutoring AI could correct a student in real time, guiding them through a problem step by step. In accessibility, users with speech impairments could benefit from an AI that understands non-verbal cues and pauses. The potential is enormous, provided the technology can be refined for reliable, general-purpose use.
Mira Murati’s involvement adds credibility to the project. During her tenure at OpenAI, she oversaw the development of GPT-3, GPT-4, and DALL·E, and was a strong advocate for responsible AI deployment. Her departure from OpenAI in late 2023 was widely covered as a sign of internal disagreements over the pace of commercialization. With Thinking Machines, she appears to be betting on a more fundamental research-first approach—with the interaction model representing a return to ambitious architectural innovation rather than incremental product improvements.
The company is still small, with fewer than 50 employees, but has attracted significant venture capital interest. If the interaction model can scale to larger models and handle real-world conditions, it could become the backbone of a new generation of voice assistants that finally feel like talking to a person instead of a machine. For now, the demos are promising, but the proof will come when users can try it themselves—without a script, without perfect lighting, and with the kind of messy, unpredictable dialogue that defines real human conversation.
The era of the CB-radio AI may be coming to an end. The question is whether Thinking Machines can deliver on its audacious vision, or whether the challenges of full-duplex conversation will prove too complex for even the best AI to master. Given the pace of progress, a world where we naturally converse with our devices—interrupting, being interrupted, and sharing a laugh—might be closer than we think.
Source: PCWorld News