Voice Wars

How may I help you?

We have written about this fast-paced technology in prior posts here and here. The markets continue to advance so quickly that we felt compelled to return to the topic and provide you with an update. The bottom line is that real business value is being experienced by companies investing in voice agents — and now is the time to explore options, to prototype apps, and to prepare your organization for change. Customers are expecting it.

The Most Natural Interface on Earth

Voice is how humans prefer to commincate. It is faster than typing, more natural than clicking, and infinitely more engaging than a text box. For businesses, that translates directly: faster resolution times, higher satisfaction, and agents that can be deployed across customer service, scheduling, sales, and IT help desk — without retraining an entire workforce on new software.

The question was never whether voice AI would matter. The question was when. When would it become reliable enough and cost effective to deploy at scale?

That moment has arrived.

Measuring What Actually Matters: τ-voice

Most voice demos are polished under ideal conditions. Real customer conversations are not. Customers interrupt. They speak with accents, background noise bleeds in from offices and cars, and conversations spiral off-script into follow-ups, corrections, and tangents.

τ-voice is the benchmark designed to test exactly that. It is run by Sierra Research and is built to measure performance in the complex nuanced world where companies operate. It runs every major voice model through hundreds of realistic, full-duplex (simultaneous two-way) scenarios and measures the PASS¹ score — the percentage of interactions the AI completes successfully without derailing, misunderstanding, or failing the task.

The results are not close:

#1 Grok Voice Think Fast 1.0 (xAI): 67.3%
#2 Gemini 3.1 Flash Live (Google): 43.8%
#3 Grok Voice Fast 1.0 (xAI): 38.3%
#4 GPT Realtime 1.5 (OpenAI): 35.3%
#5 GPT Realtime 1.0 (OpenAI): 30.4%

A 67% success rate against sub-45% for the nearest competitor is not a modest lead — it is a structural advantage. For executives evaluating voice deployment, this is the chart that matters.

Measuring Intelligence: Big Bench Audio

Handling a conversation smoothly is table stakes. The harder question is whether the voice agent is actually thinking — or just sounding confident while getting things wrong.

Big Bench Audio, from Artificial Analysis, tests AI reasoning in native speech-to-speech mode. One thousand challenging logic and multi-step problems from the gold-standard Big Bench Hard dataset — spoken aloud, requiring the model to listen, reason, and answer entirely in voice.

Rank	Model	Company	Score
1	Step-Audio D1.1 (Realtime)	Stepfun	97.0%
2	Grok Voice Agent	xAI	92.9%
3	Gemini 2.5 Flash Native Audio Dialog Thinking	Google	90.7%
4	Nova 2.0 Sonic	Amazon (AWS)	86.6%
5	GPT Realtime	OpenAI	83.3%
6	GPT-Realtime-1.5	OpenAI	81.1%
7	GPT-4o mini Realtime (Dec 2024)	OpenAI	68.9%
8	Gemini 2.5 Flash Native Audio Dialog	Google	68.6%
9	GPT Realtime Mini (Oct 2025)	OpenAI	62.3%
10	Qwen3 Omni Flash	Alibaba	58.1%
11	Qwen3 Omni Realtime	Alibaba	56.8%
12	GPT-4o audio chatcompletions	OpenAI	54.3%
13	Nemotron Voicechat	NVIDIA	38.8%
14	Freeze-Omni	VITA-MLLM (research)	31.7%
15	PersonalPlex	NVIDIA	19.1%
16	F.L.M-Audio	CofeAI	16.0%
17	Moshi	Kyutai	4.3%

Every major technology company is on this leaderboard — Google, Amazon, OpenAI, Alibaba, NVIDIA, with a Chinese firm, Stepfun, at the top. This is not a niche experiment. It is a full-scale industry arms race.

The Proof Point: Starlink

Benchmarks establish potential. Production deployments establish reality.

xAI built Grok Voice to staff Starlink's customer support line at +1 (888) GO-STARLINK — a full production deployment handling phone sales and support across dozens of languages, managing complex multi-step workflows with no script to fall back on.

The 73.7% PASS¹ score on the hardest business scenario category — plan changes, billing disputes, technical troubleshooting — is the number to hold onto. Those are exactly the conversations that break lesser systems. Grok handles them autonomously, using 28 distinct tools across hundreds of support and sales workflows. The result: 70% of customer support inquiries resolved with no human in the loop. A 20% conversion rate on inbound sales calls.

That is not a demo. That is a new operating model.

Two Benchmarks, One Conclusion

Intelligence without conversational reliability is a research project. Fluency without intelligence is a liability. The combination — an agent that thinks clearly and handles real-world noise — is what enterprise deployment demands. Together, τ-voice and Big Bench Audio confirm that the threshold has been crossed.

The race is live. The leaders are pulling ahead. In Part 2, we examine the emerging infrastructure layer — Retell AI, Vapi, and Microsoft VibeVoice — and the concrete steps executives should be taking right now.

Let's talk.

SOURCES AND REFERENCES

τ-voice Benchmark Leaderboard — Artificial Analysis

Big Bench Audio Benchmark — Artificial Analysis

xAI Grok Voice Think Fast 1.0 Announcement

Starlink Customer Support via Grok Voice — xAI

Voice Agents — Strategic Machines (prior post)

Speed and the AI Race — Strategic Machines (prior post)