logo
Published on

Show and Tell

Authors
  • avatar
    Name
    Strategic Machines
    Twitter
show

When a Picture Says a Thousand Words

In our Voice series, we explored the design principles behind world-class voice agents — declarative orchestration, context engineering, and the idea that Things Speak. The response was tremendous. You asked great questions. And one kept surfacing: what about visuals?

Fair point. Voice is powerful. But the human brain processes images 60,000 times faster than text. When a customer asks "what sandals would go with my bridesmaid dress in Hawaii?" — a description, however eloquent, is not enough. Show her the sandals.

Or who would ever buy bike without seeing a picture?

That's the frontier we're crossing now: agents that speak and show.

The $5 Problem

Bret Taylor, co-founder and CEO of Sierra, put it plainly at the WSJ CIO Network:

"The average price of a phone call to a typical consumer brand will be anywhere between 5 and 20 USD, mainly the labor cost of answering the phone. AI being able to pick up the phone for one or even two orders of magnitude less money really changes the game."

The cost argument is compelling on its own. But the experience argument is where it gets interesting. Taylor described one of Sierra's first retail deployments — a shoe company handling post-purchase interactions. A customer asked which sandals would pair with her bridesmaid dress. The agent, designed purely for returns and exchanges, had no answer.

"It wasn't something that we contemplated in the design."

That moment captures the gap. Voice unlocks free-form conversation. Customers stop clicking menus and start talking. And when they talk, they go places you didn't anticipate. The solution isn't to constrain the agent — it's to arm it with the ability to show.

Show, Don't Just Tell

Imagine the same interaction — but this time, as the customer describes her dress, the agent surfaces three options. A photo pushed to her phone. Or rendered directly on the web interface she's already on. She picks one. Done. No return. No exchange. No $15 labor cost.

This is the power of multimodal agents — systems that combine voice fluency with visual rendering. The technology is here. OpenAI's GPT-4o processes audio, vision, and text in a single model pass. Platforms like Tavus and HeyGen are embedding real-time visual context into agent conversations. The stack is assembling fast.

And the use cases go well beyond sandals:

  • Hotel booking: Don't describe the ocean-view suite. Show it. Room photos, floor plans, and amenity shots rendered mid-conversation eliminate hesitation and accelerate bookings.
  • Financial advisory: A voice agent walking a client through a portfolio rebalance becomes dramatically more effective when it can surface a chart showing historical performance — in real time.
  • Healthcare: An agent scheduling a procedure that can show facility photos, prep instructions, and post-care guides turns an anxious call into a confident appointment.

The pattern is consistent: visual context accelerates decisions and reduces friction.

From RPA to Real Intelligence

A16z frames the broader shift well. Robotic Process Automation promised the "fully automated enterprise" for over a decade — and largely failed. The bots broke whenever a process changed. They required expensive consultants. They couldn't reason.

"Instead of hard-coding each deterministic step in a process, AI agents will be prompted with an end goal and empowered with the right tooling and context to take those actions on behalf of the company."

That tooling increasingly includes vision. An agent that can read a product image, extract attributes, match against inventory, and present options to a customer mid-voice-call is not a bot. It's a salesperson.

The a16z estimate: over 8 million operations and information clerk roles in the U.S. alone represent the addressable labor being converted to intelligent automation. Add the $250 billion BPO market, and the scale of what's being unlocked becomes clear.

The Next Design Imperative

In Part 1, we introduced the concept that Things Speak — the idea that every entity in your business should have a voice. We're now extending that principle: every entity should also have a face.

A hotel room isn't just a record with check-in/check-out schema. It has photos, a view, a story. A product isn't just an SKU. It has texture, color, context. When your voice agent can pull those visual assets and surface them in the moment of conversation — you've crossed from automation into experience.

"Your website gives customers a multiple-choice question. An agent gives them a conversation." — Bret Taylor, Sierra

Now imagine that conversation includes images generated or retrieved on the fly — matched to what the customer just described. That's not the future. That's being deployed today.

What We're Building

At Strategic Machines, our agents already operate at the intersection of voice, data, and action. We are actively extending that with visual context layers — integrating image retrieval, real-time rendering, and multimodal response into the conversational stack.

Let's talk about what show-and-tell could do for your customer experience.

SOURCES AND REFERENCES

Voice: Part 1 - Design Matters - Strategic Machines

Bret Taylor, Sierra AI - WSJ CIO Network Summit

RIP to RPA: The Rise of Intelligent Automation - a16z

Bureau of Labor Statistics - Operations and Information Clerks

OpenAI GPT-4o Multimodal Capabilities