logo
Published on

AI Applied

Authors
  • avatar
    Name
    Strategic Machines
    Twitter
redux

Dealing With the Heat

You've seen the hysteria. Maybe lived it, if you've been part of an AI Implementation Team for your company. The board slides, the consultants, the breathless headlines about productivity multipliers and workforce transformation. The executives who arrived at the last offsite convinced that the company was already behind.

And yet.

Late last year we reported on the low penetration of practical AI in the enterprise. Given the extraordinary press — and yes, hype — that AI has garnered, more than a few of us were surprised by what the MIT data revealed. The implementation gap between what the technology could do and what organizations were actually doing with it was vast. And mostly quiet.

We wanted to revisit this topic. Because while the penetration remains low, the activity is not.


A Better Yardstick

You might be familiar with GDPval from OpenAI. If not, we recommend you bookmark this benchmark.

GDPval has been in place less than a year. It is designed to measure how well AI models perform on real-world, economically valuable knowledge-work tasks drawn from 44 occupations across the top 9 U.S. GDP-contributing industries — finance, healthcare, professional services, and six others. Tasks are drawn from actual work products created by domain experts with an average of 14+ years of experience. The full benchmark covers 1,320 tasks. A gold open-sourced subset of 220 tasks — 5 per occupation — remains publicly available on Hugging Face.

The primary evaluation is refreshingly direct: blind pairwise human expert comparisons, model output versus real human expert deliverable, rated as better / as good as / worse.

This is the right kind of measure for executives thinking about applied AI. Not raw benchmarks. Not esoteric studies about emergent reasoning. How well does the model perform on tasks that happen every day in a company, judged by the people who do them for a living?

Results from late last year showed frontier models approaching — but not yet fully matching — expert-level quality. Claude Opus 4.1 performed best overall, with roughly 47.6% wins + ties versus human experts. GPT-5 was strong on accuracy and domain knowledge. And the trajectory was striking: performance more than tripled from GPT-4o in spring 2024 to GPT-5 in summer 2025, with roughly linear improvement over time.

While OpenAI has since closed the public leaderboard, the concept remains the right frame for thinking about applied AI. When you can see which occupations a model is winning at, and at what percentage, you are looking at a practical heat map. No fire. Not hype. Not benchmarks on benchmarks. A map of where the technology is ready to do real work.


Missing From the Leaderboard

The models are getting better. That part is confirmed. But knowing that Claude wins 47% of tasks against human experts tells you almost nothing about whether you should deploy it, where, or how.

Kimberly Tan, a partner at a16z, captures this precisely in her recent analysis of enterprise AI adoption:

"Jobs are ill-defined and long-tailed by nature, making them extremely difficult to fully automate. And today it's unclear how much value enterprises can get out of partial automation — if AI can do only 50 percent of a human's tasks, the importance of the non-automatable tasks likely goes up since they become the bottlenecks, increasing their relative value."

This is the trap that has killed more AI initiatives than budget constraints ever did: the pursuit of full automation in a domain where 50% automation creates no operational relief, only new bottlenecks. The bar is set too high, the scope too broad, and the initiative dies from its own ambition before it demonstrates any value at all.

The a16z bottom line: AI adoption is difficult, but finding the right natural fits requires creative thinking. Look around at what other companies are doing with applied AI — even cross-sector — before looking in at your own operations. And be surgical with objectives: too aggressive, and you kill off what could have been a quietly productive, long-running advantage; too modest, and you can't attract the sponsors or talent to make it work.


What AI Actually Is

Before we get to the heat map, it helps to understand what you're working with at a fundamental level.

John West, in his most recent article, described it as clearly as anyone has:

"If you peer into the mind of a model, what you find won't be recognizably human; it's really a thicket of statistics, producing words by splitting language into long sequences of vectors. You can think of a vector as a point on a graph... Instead of two dimensions, an LLM is turning words into vectors with many hundreds of dimensions — more than it's possible to visualize. Plotting words in a vector space makes it possible for an LLM to detect the connections among them... With enough data, an LLM can learn these relationships, so that given any word, it can predict what the next word should be, and the one after that."

Statistical prediction at enormous scale. That's the engine. Not intelligence in any human sense — but a pattern-completion machine of staggering breadth, trained on more text than any human could read in a thousand lifetimes.

This matters for executive decision-making because it tells you where these systems fail: where context is thin, where data is proprietary, where the task requires judgment shaped by institutional memory that was never written down. An LLM knows everything that has been published. It knows nothing about your company, your customers, or your processes — unless you build that in.

Which is, precisely, why the companies generating durable wins from AI are not the ones who plugged in a generic model. They are the ones who built institutional context into the system.


What We've Learned Building One

We have spent the past several months building a hospitality intelligence platform — a voice AI concierge for luxury properties. Not because voice is the interesting part. Because the guest is.

The insight that changed everything was this: the durable advantage in an AI agent is never the voice pipe. It is the memory.

Every major voice provider — OpenAI Realtime, Twilio, Vapi, ElevenLabs — is a commodity. Any of them can be swapped in a day. What cannot be swapped is six months of a hotel's best concierge conversations, extracted, structured, and loaded into the AI as living institutional memory. What cannot be replaced is a guest profile that knows a returning visitor prefers a high floor, travels with a dog, ordered the same bottle of Barolo on their last two stays, and had a billing issue that was resolved. The AI surfaces all of that in under 50 milliseconds — before it says a word.

That is not automation. That is intelligence. And the gap between the two is where AI initiatives succeed or fail. AI can synthesize data better than the best.

The lesson for any executive thinking about AI deployment is the same: figure out what your organization knows that no model could ever be trained on, then build the system that surfaces it. The model is the interface. Your data is the product.


Look Outside First

The most practically useful thing you can do before your next internal AI workshop is spend an afternoon looking cross-sector at where AI is generating confirmed, quantified wins. Not press releases. Not case studies that a vendor wrote. Actual adoption patterns, by occupation, by industry.

The table below does exactly that. It draws from two GDPval snapshots — September 2025 and April 2026 — and shows where models are winning head-to-head against human experts, and by how much that win rate has moved. Read it as a prospecting map, not a report card.

What leaps out is not the highest absolute scores. It is the steepest improvements — the Real Estate Sales Agents (+24 points), the Industrial Engineers (+27 points), the Child, Family, and School Social Workers (+27 points), the Securities Sales Agents (+31 points). These are domains where the model's edge is accelerating fastest. If you operate in or adjacent to any of these sectors, you are watching a window close.

What the table cannot show you: the specific workflows, edge cases, institutional context, and process dependencies that determine whether a +24-point benchmark improvement translates into any operational value at all. That requires the context map — a static picture of your existing governance, workflows, rules, activities, tasks, people, process, and technology — before you decide where to apply AI selectively and strategically.

It might be as straightforward as capturing a product order. Scheduling an appointment. Or something more complex — like maintenance on a jet engine, or triaging an escalating guest complaint before it reaches the front desk.

With the map in hand, the benchmark table stops being abstract and starts being actionable.

A More Practical Map

The Models Are Improving Quickly But Unevenly

IndustryOccupationWins in Sept 2025 (%)Wins in April 2026 (%)Improvement (%)
Real Estate and Rental and LeasingConcierges29312
Real Estate and Rental and LeasingCounter and Rental Clerks81821
Real Estate and Rental and LeasingProperty, Real Estate, and Community Association Managers344410
Real Estate and Rental and LeasingReal Estate Brokers546713
Real Estate and Rental and LeasingReal Estate Sales Agents366024
ManufacturingBuyers and Purchasing Agents647612
ManufacturingFirst-Line Supervisors of Production and Operating Workers58580
ManufacturingIndustrial Engineers174427
ManufacturingMechanical Engineers254419
ManufacturingShipping, Receiving, and Inventory Clerks76782
Professional, Scientific, and Technical ServicesAccountants and Auditors244218
Professional, Scientific, and Technical ServicesComputer and Information Systems Managers526715
Professional, Scientific, and Technical ServicesLawyers3646-10
Professional, Scientific, and Technical ServicesProject Management Specialists426422
Professional, Scientific, and Technical ServicesSoftware Developers708212
GovernmentAdministrative Services Managers627816
GovernmentChild, Family, and School Social Workers426927
GovernmentCompliance Officers69712
GovernmentFirst-Line Supervisors of Police and Detectives497627
GovernmentRecreation Workers405616
Health Care and Social AssistanceFirst-Line Supervisors of Office and Administrative Support Workers4138-3
Health Care and Social AssistanceMedical Secretaries and Administrative Assistants446218
Health Care and Social AssistanceMedical and Health Services Managers6538-27
Health Care and Social AssistanceNurse Practitioners566711
Health Care and Social AssistanceRegistered Nurses375114
Finance and InsuranceCustomer Service Representatives597617
Finance and InsuranceFinancial Managers324412
Finance and InsuranceFinancial and Investment Analysts41443
Finance and InsurancePersonal Financial Advisors6462-2
Finance and InsuranceSecurities, Commodities, and Financial Services Sales Agents427331
Retail TradeFirst-Line Supervisors of Retail Sales Workers596910
Retail TradeGeneral and Operations Managers6762-5
Retail TradePharmacists263812
Retail TradePrivate Detectives and Investigators7069-1
Wholesale TradeFirst-Line Supervisors of Non-Retail Sales Workers698718
Wholesale TradeOrder Clerks284012
Wholesale TradeSales Managers79801
Wholesale TradeSales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products6656-10
Wholesale TradeSales Representatives, Wholesale and Manufacturing, Technical and Scientific Products47536
InformationAudio and Video Technicians30388
InformationEditors759318
InformationFilm and Video Editors173316
InformationNews Analysts, Reporters, and Journalists53607
InformationProducers and Directors294415

Source: GDPval snapshot Sept 2025 (https://arxiv.org/abs/2510.04374) and April 2026 (https://evals.openai.com/gdpval/leaderboard) Assembled by a16z


The Right Sequence

In our experience, most AI initiatives fail not because the technology is wrong, but because the sequence is wrong.

The pressure to produce practical economic benefits is real. Boards are asking. Investors are asking. The CFO has seen the productivity studies. And so teams move fast — often before they have done the one thing that matters most: understood the work deeply enough to know where AI actually fits.

We have learned that a sequence, better tuned for practical change, is this:

Look outside first. Spend real time with cross-sector adoption data or aoption teams. Understand where the technology is generating confirmed wins, and why. Find the pattern that maps to something in your own operations.

Map before you build. Create a context map of your existing workflows — every decision point, handoff, rule, and exception. This is where hidden AI opportunities live. Not in the obvious tasks. In the seams between them.

Scope surgically. Find the 30% of a process where AI can make a measurable difference, and build that. Do not try to automate the whole thing. We're not saying Pareto's law applies here, but it sure seems to have traction everywhere we've been.

Build the memory, not just the model. The institutional knowledge that lives in your people, your processes, and your historical data — structured and surfaced — is what separates a durable AI advantage from a demo that never makes it to production. And this is really trickest step in the entire sequence. Since this is all about collecting data that may been unstructured historically, or never collected at all, this may be the very data needed to create exceptional change in the enterprise.

The models are ready. In some domains they are winning head-to-head against seasoned experts. The question is no longer whether AI can do the work. It is which work, in which context, built on which institutional foundation.

That answer lives inside your organization. And finding it requires going outside first.

Stay grounded. Stay creative. Stay moving.

Np better way to keep the house from burning down.


At Strategic Machines, we build AI that carries institutional memory — not just model intelligence. Our hospitality platform treats guest history, property knowledge, and service context as the durable advantage, with voice and language model providers as interchangeable infrastructure underneath. The result is an AI concierge that knows a returning guest before it says a word.

We are deploying agents across high-value operational use cases — hospitality, scheduling, sales, and service — where context and execution are the product, not the model. We invite you to try our live agents. Request a one-time password, select an agent from the interface, and experience the difference that institutional memory makes.

Let's talk.

SOURCES AND REFERENCES

AI Value in the Enterprise — Strategic Machines (Sept 2025)

GDPval Leaderboard — OpenAI Evals

GDPval Research Paper — arXiv (Sept 2025)

GDPval Open Dataset — Hugging Face

Where Enterprises Are Actually Adopting AI — Kimberly Tan, a16z

I've Seen How AI Thinks. I Wish Everyone Could — John West, Wall Street Journal

Voice Wars: Redux — Strategic Machines