- Published on
Voice Wars: Redux
- Authors
- Name
- Strategic Machines
Revisiting the Benchmarks
We didn't expect to be back so quickly after our last post. But here we are, with an update after OpenAI's latest announcement. This just underscores how fast this corner of the technology market is moving.
You may recall that a week ago we published Voice Wars which looked at the benchmarks for top voice technologies. There are a lot of benchmarks being published these days, but we have found the data from Artificial Analysis to be among the most consistent and reliable tests of voice technologies.
We didn't anticipate the landscape shifting so quickly, but here we are.
Speech Reasoning (Big Bench Audio)
Speech reasoning based on the Artificial Analysis Big Bench Audio dataset. Higher is better. This Week results extracted directly from the uploaded chart.
| Rank | Model | Company | Score Last Week | Score This Week |
|---|---|---|---|---|
| 1 | Step-Audio R1.1 (Realtime) | Stepfun | 97.0% | 97.6% |
| 2 | Grok Voice Think Fast 1.0 | xAI | — | 97.1% |
| 3 | GPT-Realtime-2 (High) | OpenAI | — | 96.6% |
| 4 | Gemini 3.1 Flash Live Preview - High | — | 96.6% | |
| 5 | Grok Voice Agent | xAI | 92.9% | 93.3% |
| 6 | Gemini 2.5 Flash Native Audio Dialog Thinking | 90.7% | 90.7% | |
| 7 | Nova 2.0 Sonic (Mar 2026) | Amazon (AWS) | 86.6% | 88.1% |
| 8 | GPT Realtime | OpenAI | 83.3% | 83.3% |
| 9 | GPT-Realtime-1.5 | OpenAI | 81.1% | 81.4% |
| 10 | Qwen3.5 Omni Plus Realtime | Alibaba | — | 73.0% |
| 11 | GPT-Realtime-2 (Minimal) | OpenAI | — | 71.8% |
| 12 | Gemini 3.1 Flash Live Preview - Minimal | — | 71.3% | |
| 13 | GPT-4o mini Realtime (Dec 2024) | OpenAI | 68.9% | 68.9% |
| 14 | Gemini 2.5 Flash Native Audio Dialog | 68.6% | 68.6% | |
| 15 | GPT Realtime Mini (Oct 2025) | OpenAI | 62.3% | 63.6% |
| 16 | Qwen3.5 Omni Flash Realtime | Alibaba | — | 59.0% |
| 17 | Qwen3 Omni Flash | Alibaba | 58.1% | 58.7% |
| 18 | Qwen3 Omni Realtime | Alibaba | 56.8% | 56.8% |
| 19 | GPT-4o audio chatcompletions | OpenAI | 54.3% | 54.3% |
| 20 | Nemotron Voicechat | NVIDIA | 38.8% | 38.8% |
| 21 | Freeze-Omni | VITA-MLLM (research) | 31.7% | 33.4% |
| 22 | PersonalPlex | NVIDIA | 19.1% | 19.1% |
| 23 | F.L.M-Audio | CofeAI | 16.0% | 16.0% |
| 24 | Moshi | Kyutai | 4.3% | 4.4% |
The OpenAI Move
The headline this week is GPT-Realtime-2, OpenAI's new flagship native Speech-to-Speech model. The numbers are significant: 96.6% on Big Bench Audio — a ~13% jump from the previous generation — and the #1 position on Artificial Analysis's Conversational Dynamics benchmark at 96.1%. These are not incremental gains. A 13-point improvement in a single release cycle is a signal that the underlying reasoning architecture has changed materially, not just been tuned.
But the more consequential development is not the score. It is the architecture decision behind it.
The Insight the Benchmarks Don't Show You
GPT-Realtime-2 introduces adjustable reasoning effort — minimal, low, medium, high, and xHigh — applied at the voice layer. This matters more than the headline number. It means a developer can now deploy a single model that runs at 1.12 seconds Time to First Audio on transactional exchanges and scales up to deep reasoning on complex decisions, all within the same session. That has never been available in a voice model before.
Pair that with a context window that jumped from 32K to 128K tokens — a 4x expansion — and the implications compound quickly. Previous voice models forgot the beginning of a conversation before it concluded. GPT-Realtime-2 can hold the context of an entire customer journey: account history, prior interactions, mid-call decisions, and real-time tool results. That is not a voice assistant. That is a voice reasoning engine.
The model also introduces audible transparency during agentic tasks — phrases like "checking your calendar" or "let me verify that" — signaling active tool use rather than unexplained silence. This is a small UX detail with large trust implications in production environments.
The Benchmark Landscape
So maybe this is less about Voice Wars and more about Benchmark Wars? Why so many? Each has its own theme and peculiar focus. We learn something from each. Here are the benchmarks we keep our eyes on:
| Benchmark Name | Organization / Runner | What It Focuses On (Voice Technologies) |
|---|---|---|
| Artificial Analysis Speech Arena & Leaderboards | Artificial Analysis | Blind human preference Elo ratings for TTS naturalness, prosody, and quality; AA-WER + speed/pricing for STT; Speech Reasoning (Big Bench Audio), conversational dynamics, agentic performance, and latency for native Speech-to-Speech models. |
| Arena.ai Leaderboards | Arena.ai | Crowdsourced human preference battles via interactive chats and votes; real-world user judgments across text, vision, and multimodal, with growing coverage of voice and speech capabilities in frontier models. |
| Scale AI Voice Showdown | Scale AI | First large-scale preference arena built on real human speech (29M+ prompts from 300K+ global users); blind comparisons of voice AI in extended real conversations across 60+ languages, accents, and noisy environments. |
| Hugging Face TTS Arena (V2) | Hugging Face / TTS-AGI community | Crowdsourced blind side-by-side listening tests; ranks TTS models purely on perceived naturalness, intelligibility, and overall speech quality. |
| aiewf-eval (Voice Agent Benchmark) | Daily.co (with Ultravox, Coval, others) | LLM and S2S performance for real voice agents: end-to-end latency, tool calling accuracy, instruction following, and multi-turn conversational robustness in realistic agent scenarios. |
| Coval.ai Voice Benchmarks | Coval.ai | Independent production-focused testing of TTS providers and full voice platforms; emphasizes real-world metrics like end-to-end latency, voice consistency, scalability, and conversational performance in live agents. |
What Winning Looks Like — For Now
The leaderboard this week has no single winner. Stepfun leads on raw speech reasoning. xAI leads on conversational dynamics with its Grok Voice Think Fast model, deployed at production scale on the Starlink support line. OpenAI leads on the combination of reasoning range, context depth, and conversational naturalness. Google and Alibaba are not far behind.
This is what a mature competitive market looks like before consolidation. Every major player has a credible model. The differentiation is shifting from capability to deployment — latency management, interruption handling, signal quality, and the infrastructure layer that sits between raw audio and the LLM.
Companies like Krisp (VIVA 2.0) and Deepgram are addressing exactly this gap. Turn prediction, interruption detection, voice isolation — these components operate before the model sees a single token, and their quality determines whether a production deployment sounds human or robotic.
What Comes Next
The convergence of near-real-time response (under 1.2 seconds), human-level reasoning at 96%+, 128K context, and increasingly intelligent signal processing removes the last architectural excuses. The applications that are now buildable — voice-first sales agents, healthcare intake systems, financial advisory conversations, complex multi-step customer journeys handled end-to-end — were not practical eighteen months ago. They are practical today.
The infrastructure is ready. The models are ready. The remaining variable is execution.
Stay tuned.
At Strategic Machines, our agents already operate at the intersection of voice, data, and action. We are extending that foundation with visual context layers — integrating image retrieval, real-time rendering, and multimodal response directly into the conversational stack.
We are deploying agents across complex, high-value use cases — scheduling, sales, and product selection — where context and execution matter. We invite you to try our test agents. Request a one-time password, select an agent from the interface, and experience the interaction firsthand.
We are combining design, infrastructure, and business logic to build the next generation of conversational commerce.
SOURCES AND REFERENCES
Voice Wars, Part 1 — Strategic Machines
Voice Wars, Part 2 — Strategic Machines
GPT-Realtime-2 — Artificial Analysis Benchmark Results
xAI Voice API — Grok Voice Think Fast 1.0
Krisp VIVA 2.0 — AI Infrastructure for Voice Agents
The Definitive Guide to Voice AI Agents — Deepgram
What Is the AI Agent Loop — Oracle Developers