- Published on
Voice: Part 2 - Milliseconds Matter
- Authors
- Name
- Strategic Machines
Deflection Points?
In Part 1 of our series on Voice, we commented on the recent Gartner Hype Cycle, noting that conversational platforms were positioned for a breakout year in 2026. Our experience is that breakaway performance with AI will only be realized by firms paying close attention to AI design principles, avoiding the sensational and embracing the essential. With that, we believe that 2026 will be the 'Year of Voice' for many essential use cases.
But as we move from hype to reality, we encounter the physics of real-time interaction. In text-based chat or 'first paint' in a web app, a 3-second delay is acceptable. In voice, a 500-millisecond delay feels like an eternity - it disupts the natural interchange of a conversation. That 'deflects' the intended path of the conversation, derailing the expected outcome. To succeed, we must understand why Milliseconds Matter.
The Latency War
As noted by the partners at a16z in their update earlier this year, we are transitioning from the infrastructure layer to the application layer for Voice. The "stack" has been streamlined and the technologies have advanced, allowing for lower latency, but the bar for consumer expectations remains incredibly high.
When a customer speaks, the system must:
- Listen (Speech-to-Text)
- Think (LLM Processing)
- Retrieve relevant context
- Think (LLM Processing)
- Speak (Text-to-Speech)
If this loop takes longer than the natural pause in a human conversation, the illusion begins to breaks. Companies like OpenAI, xAI, Google and others are pushing the boundaries with native real-time capabilities, and innovators like Lemon Slice are even introducing real-time video avatars that react at 20 frames per second. We are entering an era where the machine is delivering at ‘human speed’ – or better. Longer latencies requires visual or audio clues to preserve the 'connection' - all matters that must be addressed with design.
The Physics of "Hearing"
Speed implies nothing if the agent can't hear you. In our testing, we dive deep into the technical metrics that ensure reliability.
One critical metric is RMS (Root Mean Square). Without becoming mired in the math, think of this as the "energy" of the conversation. It measures the average power of the sound wave. If your agent can't distinguish the RMS of a customer's voice from the RMS of the background noise in a busy airport, the transaction fails.
This brings us to Noise Cancellation. Effective voice agents today utilize advanced signal processing (like WebRTC or RNNoise) to isolate the human voice from the chaos of the environment. The "Signal-to-Noise" ratio is a business metric. If the agent can’t filter the noise, it can’t capture the intent. This is all just to say that when selecting voice providers, features like isolation mode, recently introduced by xAI, are a genuine necessity.
The Landscape is Shifting
The market is flooding with providers competing on these very metrics—speed and quality.
- ElevenLabs and Deepgram are setting standards for voice realism.
- Resemble AI’s Chatterbox is proving that open-source solutions can outperform proprietary giants in blind tests.
- Grok Voice is entering the fray with powerful real-time capabilities.
We track these players not just for their tech, but for their pricing models. As costs drop (OpenAI recently dropped API prices by nearly 90%), the economic viability of voice skyrockets. Of course, as with any major Voice providers, accuracy is critical as well. We've summarized in the table belwo some information from our research among a few of the major players:
| Provider | Estimated Market Share (2025) | Cost (Per Million Characters or Equivalent) | Accuracy (MOS Score Estimate) |
|---|---|---|---|
| Google Cloud TTS | 20% (Enterprise leader in multilingual TTS) | $16 per million characters (neural voices; ~$4 per million tokens) | 4.5/5 (Excellent prosody via WaveNet) |
| Amazon Polly | 17% (Strong AWS integration for e-commerce) | $16 per million characters (neural; ~$4 per million tokens) | 4.4/5 (High intelligibility with sync features) |
| Azure Speech | 15% (Custom voices and broad integrations) | ~$16 per million characters (neural, tiered; ~$4 per million tokens) | 4.6/5 (Superior emotional depth and customization) |
| ElevenLabs | 12% (Hyper-realistic cloning and premium TTS) | $75–300 per million characters (varies by model like Turbo v2 at ~$75/M; ~$18.75–75 per million tokens) | 4.8/5 (Best-in-class realism per benchmarks) |
| OpenAI Realtime | 14% (Rapid growth in real-time voice agents via ChatGPT and API) | ~$400 per million characters (based on ~$0.24/min output, ~600 chars/min; equivalent to ~$100 per million audio tokens; full agent use) | 4.7/5 (Near-human conversational quality, >80% accuracy in reasoning evals) |
| Grok Voice (xAI) | 3% (Emerging in real-time agents with X/Tesla ties) | ~$83 per million characters (based on $0.05/min flat rate, ~600 chars/min; ~$20.75 per million tokens) | 4.7/5 (Strong low-latency realism for agents) |
We're big fans of these platforms, recognizing that we're still in a 'horse race' with new entrants entering the market every quarter. But with the right design, robust applications can be developed for engaging and serving customers in astonishing new ways - and the infrastructure can be changed as easily as swapping APIs.
We have the design (Part 1). We have the speed and technology (Part 2). Now, we must answer the final (and most important) question: Does this actually make money? Please join us for Part 3 of our series: Business Matters.
And give us a call. We'd welcome the chance to 'swap stories' on navigating this shift and getting ready for the 2026 revolution.