Voice Wars: Redux

Revisiting the Benchmarks

We didn't expect to be back so quickly after our last post. But here we are, with an update after OpenAI's latest announcement. This just underscores how fast this corner of the technology market is moving.

You may recall that a week ago we published Voice Wars which looked at the benchmarks for top voice technologies. There are a lot of benchmarks being published these days, but we have found the data from Artificial Analysis to be among the most consistent and reliable tests of voice technologies.

We didn't anticipate the landscape shifting so quickly, but here we are.

Speech Reasoning (Big Bench Audio)
Speech reasoning based on the Artificial Analysis Big Bench Audio dataset. Higher is better. This Week results extracted directly from the uploaded chart.

Rank	Model	Company	Score Last Week	Score This Week
1	Step-Audio R1.1 (Realtime)	Stepfun	97.0%	97.6%
2	Grok Voice Think Fast 1.0	xAI	—	97.1%
3	GPT-Realtime-2 (High)	OpenAI	—	96.6%
4	Gemini 3.1 Flash Live Preview - High	Google	—	96.6%
5	Grok Voice Agent	xAI	92.9%	93.3%
6	Gemini 2.5 Flash Native Audio Dialog Thinking	Google	90.7%	90.7%
7	Nova 2.0 Sonic (Mar 2026)	Amazon (AWS)	86.6%	88.1%
8	GPT Realtime	OpenAI	83.3%	83.3%
9	GPT-Realtime-1.5	OpenAI	81.1%	81.4%
10	Qwen3.5 Omni Plus Realtime	Alibaba	—	73.0%
11	GPT-Realtime-2 (Minimal)	OpenAI	—	71.8%
12	Gemini 3.1 Flash Live Preview - Minimal	Google	—	71.3%
13	GPT-4o mini Realtime (Dec 2024)	OpenAI	68.9%	68.9%
14	Gemini 2.5 Flash Native Audio Dialog	Google	68.6%	68.6%
15	GPT Realtime Mini (Oct 2025)	OpenAI	62.3%	63.6%
16	Qwen3.5 Omni Flash Realtime	Alibaba	—	59.0%
17	Qwen3 Omni Flash	Alibaba	58.1%	58.7%
18	Qwen3 Omni Realtime	Alibaba	56.8%	56.8%
19	GPT-4o audio chatcompletions	OpenAI	54.3%	54.3%
20	Nemotron Voicechat	NVIDIA	38.8%	38.8%
21	Freeze-Omni	VITA-MLLM (research)	31.7%	33.4%
22	PersonalPlex	NVIDIA	19.1%	19.1%
23	F.L.M-Audio	CofeAI	16.0%	16.0%
24	Moshi	Kyutai	4.3%	4.4%

The OpenAI Move

The headline this week is GPT-Realtime-2, OpenAI's new flagship native Speech-to-Speech model. The numbers are significant: 96.6% on Big Bench Audio — a ~13% jump from the previous generation — and the #1 position on Artificial Analysis's Conversational Dynamics benchmark at 96.1%. These are not incremental gains. A 13-point improvement in a single release cycle is a signal that the underlying reasoning architecture has changed materially, not just been tuned.

But the more consequential development is not the score. It is the architecture decision behind it.

The Insight the Benchmarks Don't Show You

GPT-Realtime-2 introduces adjustable reasoning effort — minimal, low, medium, high, and xHigh — applied at the voice layer. This matters more than the headline number. It means a developer can now deploy a single model that runs at 1.12 seconds Time to First Audio on transactional exchanges and scales up to deep reasoning on complex decisions, all within the same session. That has never been available in a voice model before.

Pair that with a context window that jumped from 32K to 128K tokens — a 4x expansion — and the implications compound quickly. Previous voice models forgot the beginning of a conversation before it concluded. GPT-Realtime-2 can hold the context of an entire customer journey: account history, prior interactions, mid-call decisions, and real-time tool results. That is not a voice assistant. That is a voice reasoning engine.

The model also introduces audible transparency during agentic tasks — phrases like "checking your calendar" or "let me verify that" — signaling active tool use rather than unexplained silence. This is a small UX detail with large trust implications in production environments.

The Benchmark Landscape

So maybe this is less about Voice Wars and more about Benchmark Wars? Why so many? Each has its own theme and peculiar focus. We learn something from each. Here are the benchmarks we keep our eyes on:

Benchmark Name	Organization / Runner	What It Focuses On (Voice Technologies)
Artificial Analysis Speech Arena & Leaderboards	Artificial Analysis	Blind human preference Elo ratings for TTS naturalness, prosody, and quality; AA-WER + speed/pricing for STT; Speech Reasoning (Big Bench Audio), conversational dynamics, agentic performance, and latency for native Speech-to-Speech models.
Arena.ai Leaderboards	Arena.ai	Crowdsourced human preference battles via interactive chats and votes; real-world user judgments across text, vision, and multimodal, with growing coverage of voice and speech capabilities in frontier models.
Scale AI Voice Showdown	Scale AI	First large-scale preference arena built on real human speech (29M+ prompts from 300K+ global users); blind comparisons of voice AI in extended real conversations across 60+ languages, accents, and noisy environments.
Hugging Face TTS Arena (V2)	Hugging Face / TTS-AGI community	Crowdsourced blind side-by-side listening tests; ranks TTS models purely on perceived naturalness, intelligibility, and overall speech quality.
aiewf-eval (Voice Agent Benchmark)	Daily.co (with Ultravox, Coval, others)	LLM and S2S performance for real voice agents: end-to-end latency, tool calling accuracy, instruction following, and multi-turn conversational robustness in realistic agent scenarios.
Coval.ai Voice Benchmarks	Coval.ai	Independent production-focused testing of TTS providers and full voice platforms; emphasizes real-world metrics like end-to-end latency, voice consistency, scalability, and conversational performance in live agents.

What Winning Looks Like — For Now

The leaderboard this week has no single winner. Stepfun leads on raw speech reasoning. xAI leads on conversational dynamics with its Grok Voice Think Fast model, deployed at production scale on the Starlink support line. OpenAI leads on the combination of reasoning range, context depth, and conversational naturalness. Google and Alibaba are not far behind.

This is what a mature competitive market looks like before consolidation. Every major player has a credible model. The differentiation is shifting from capability to deployment — latency management, interruption handling, signal quality, and the infrastructure layer that sits between raw audio and the LLM.

Companies like Krisp (VIVA 2.0) and Deepgram are addressing exactly this gap. Turn prediction, interruption detection, voice isolation — these components operate before the model sees a single token, and their quality determines whether a production deployment sounds human or robotic.

What Comes Next

The convergence of near-real-time response (under 1.2 seconds), human-level reasoning at 96%+, 128K context, and increasingly intelligent signal processing removes the last architectural excuses. The applications that are now buildable — voice-first sales agents, healthcare intake systems, financial advisory conversations, complex multi-step customer journeys handled end-to-end — were not practical eighteen months ago. They are practical today.

The infrastructure is ready. The models are ready. The remaining variable is execution.

Stay tuned.

At Strategic Machines, our agents already operate at the intersection of voice, data, and action. We are extending that foundation with visual context layers — integrating image retrieval, real-time rendering, and multimodal response directly into the conversational stack.

We are deploying agents across complex, high-value use cases — scheduling, sales, and product selection — where context and execution matter. We invite you to try our test agents. Request a one-time password, select an agent from the interface, and experience the interaction firsthand.

We are combining design, infrastructure, and business logic to build the next generation of conversational commerce.

Let's talk.

SOURCES AND REFERENCES

Voice Wars, Part 1 — Strategic Machines

Voice Wars, Part 2 — Strategic Machines

GPT-Realtime-2 — Artificial Analysis Benchmark Results

Arena.ai Voice Leaderboards

xAI Voice API — Grok Voice Think Fast 1.0

Krisp VIVA 2.0 — AI Infrastructure for Voice Agents

The Definitive Guide to Voice AI Agents — Deepgram

What Is the AI Agent Loop — Oracle Developers

Retell AI — Enterprise Voice Agent Platform

Vapi — Voice AI Infrastructure