AI Applied

Dealing With the Heat

You've seen the hysteria. Maybe lived it, if you've been part of an AI Implementation Team for your company. The board slides, the consultants, the breathless headlines about productivity multipliers and workforce transformation. The executives who arrived at the last offsite convinced that the company was already behind.

And yet.

Late last year we reported on the low penetration of practical AI in the enterprise. Given the extraordinary press — and yes, hype — that AI has garnered, more than a few of us were surprised by what the MIT data revealed. The implementation gap between what the technology could do and what organizations were actually doing with it was vast. And mostly quiet.

We wanted to revisit this topic. Because while the penetration remains low, the activity is not.

A Better Yardstick

You might be familiar with GDPval from OpenAI. If not, we recommend you bookmark this benchmark.

GDPval has been in place less than a year. It is designed to measure how well AI models perform on real-world, economically valuable knowledge-work tasks drawn from 44 occupations across the top 9 U.S. GDP-contributing industries — finance, healthcare, professional services, and six others. Tasks are drawn from actual work products created by domain experts with an average of 14+ years of experience. The full benchmark covers 1,320 tasks. A gold open-sourced subset of 220 tasks — 5 per occupation — remains publicly available on Hugging Face.

The primary evaluation is refreshingly direct: blind pairwise human expert comparisons, model output versus real human expert deliverable, rated as better / as good as / worse.

This is the right kind of measure for executives thinking about applied AI. Not raw benchmarks. Not esoteric studies about emergent reasoning. How well does the model perform on tasks that happen every day in a company, judged by the people who do them for a living?

Results from late last year showed frontier models approaching — but not yet fully matching — expert-level quality. Claude Opus 4.1 performed best overall, with roughly 47.6% wins + ties versus human experts. GPT-5 was strong on accuracy and domain knowledge. And the trajectory was striking: performance more than tripled from GPT-4o in spring 2024 to GPT-5 in summer 2025, with roughly linear improvement over time.

While OpenAI has since closed the public leaderboard, the concept remains the right frame for thinking about applied AI. When you can see which occupations a model is winning at, and at what percentage, you are looking at a practical heat map. No fire. Not hype. Not benchmarks on benchmarks. A map of where the technology is ready to do real work.

Missing From the Leaderboard

The models are getting better. That part is confirmed. But knowing that Claude wins 47% of tasks against human experts tells you almost nothing about whether you should deploy it, where, or how.

Kimberly Tan, a partner at a16z, captures this precisely in her recent analysis of enterprise AI adoption:

"Jobs are ill-defined and long-tailed by nature, making them extremely difficult to fully automate. And today it's unclear how much value enterprises can get out of partial automation — if AI can do only 50 percent of a human's tasks, the importance of the non-automatable tasks likely goes up since they become the bottlenecks, increasing their relative value."

This is the trap that has killed more AI initiatives than budget constraints ever did: the pursuit of full automation in a domain where 50% automation creates no operational relief, only new bottlenecks. The bar is set too high, the scope too broad, and the initiative dies from its own ambition before it demonstrates any value at all.

The a16z bottom line: AI adoption is difficult, but finding the right natural fits requires creative thinking. Look around at what other companies are doing with applied AI — even cross-sector — before looking in at your own operations. And be surgical with objectives: too aggressive, and you kill off what could have been a quietly productive, long-running advantage; too modest, and you can't attract the sponsors or talent to make it work.

What AI Actually Is

Before we get to the heat map, it helps to understand what you're working with at a fundamental level.

John West, in his most recent article, described it as clearly as anyone has:

"If you peer into the mind of a model, what you find won't be recognizably human; it's really a thicket of statistics, producing words by splitting language into long sequences of vectors. You can think of a vector as a point on a graph... Instead of two dimensions, an LLM is turning words into vectors with many hundreds of dimensions — more than it's possible to visualize. Plotting words in a vector space makes it possible for an LLM to detect the connections among them... With enough data, an LLM can learn these relationships, so that given any word, it can predict what the next word should be, and the one after that."

Statistical prediction at enormous scale. That's the engine. Not intelligence in any human sense — but a pattern-completion machine of staggering breadth, trained on more text than any human could read in a thousand lifetimes.

This matters for executive decision-making because it tells you where these systems fail: where context is thin, where data is proprietary, where the task requires judgment shaped by institutional memory that was never written down. An LLM knows everything that has been published. It knows nothing about your company, your customers, or your processes — unless you build that in.

Which is, precisely, why the companies generating durable wins from AI are not the ones who plugged in a generic model. They are the ones who built institutional context into the system.

What We've Learned Building One

We have spent the past several months building a hospitality intelligence platform — a voice AI concierge for luxury properties. Not because voice is the interesting part. Because the guest is.

The insight that changed everything was this: the durable advantage in an AI agent is never the voice pipe. It is the memory.

Every major voice provider — OpenAI Realtime, Twilio, Vapi, ElevenLabs — is a commodity. Any of them can be swapped in a day. What cannot be swapped is six months of a hotel's best concierge conversations, extracted, structured, and loaded into the AI as living institutional memory. What cannot be replaced is a guest profile that knows a returning visitor prefers a high floor, travels with a dog, ordered the same bottle of Barolo on their last two stays, and had a billing issue that was resolved. The AI surfaces all of that in under 50 milliseconds — before it says a word.

That is not automation. That is intelligence. And the gap between the two is where AI initiatives succeed or fail. AI can synthesize data better than the best.

The lesson for any executive thinking about AI deployment is the same: figure out what your organization knows that no model could ever be trained on, then build the system that surfaces it. The model is the interface. Your data is the product.

Look Outside First

The most practically useful thing you can do before your next internal AI workshop is spend an afternoon looking cross-sector at where AI is generating confirmed, quantified wins. Not press releases. Not case studies that a vendor wrote. Actual adoption patterns, by occupation, by industry.

The table below does exactly that. It draws from two GDPval snapshots — September 2025 and April 2026 — and shows where models are winning head-to-head against human experts, and by how much that win rate has moved. Read it as a prospecting map, not a report card.

What leaps out is not the highest absolute scores. It is the steepest improvements — the Real Estate Sales Agents (+24 points), the Industrial Engineers (+27 points), the Child, Family, and School Social Workers (+27 points), the Securities Sales Agents (+31 points). These are domains where the model's edge is accelerating fastest. If you operate in or adjacent to any of these sectors, you are watching a window close.

What the table cannot show you: the specific workflows, edge cases, institutional context, and process dependencies that determine whether a +24-point benchmark improvement translates into any operational value at all. That requires the context map — a static picture of your existing governance, workflows, rules, activities, tasks, people, process, and technology — before you decide where to apply AI selectively and strategically.

It might be as straightforward as capturing a product order. Scheduling an appointment. Or something more complex — like maintenance on a jet engine, or triaging an escalating guest complaint before it reaches the front desk.

With the map in hand, the benchmark table stops being abstract and starts being actionable.

A More Practical Map

The Models Are Improving Quickly But Unevenly

Industry	Occupation	Wins in Sept 2025 (%)	Wins in April 2026 (%)	Improvement (%)
Real Estate and Rental and Leasing	Concierges	29	31	2
Real Estate and Rental and Leasing	Counter and Rental Clerks	81	82	1
Real Estate and Rental and Leasing	Property, Real Estate, and Community Association Managers	34	44	10
Real Estate and Rental and Leasing	Real Estate Brokers	54	67	13
Real Estate and Rental and Leasing	Real Estate Sales Agents	36	60	24
Manufacturing	Buyers and Purchasing Agents	64	76	12
Manufacturing	First-Line Supervisors of Production and Operating Workers	58	58	0
Manufacturing	Industrial Engineers	17	44	27
Manufacturing	Mechanical Engineers	25	44	19
Manufacturing	Shipping, Receiving, and Inventory Clerks	76	78	2
Professional, Scientific, and Technical Services	Accountants and Auditors	24	42	18
Professional, Scientific, and Technical Services	Computer and Information Systems Managers	52	67	15
Professional, Scientific, and Technical Services	Lawyers	36	46	-10
Professional, Scientific, and Technical Services	Project Management Specialists	42	64	22
Professional, Scientific, and Technical Services	Software Developers	70	82	12
Government	Administrative Services Managers	62	78	16
Government	Child, Family, and School Social Workers	42	69	27
Government	Compliance Officers	69	71	2
Government	First-Line Supervisors of Police and Detectives	49	76	27
Government	Recreation Workers	40	56	16
Health Care and Social Assistance	First-Line Supervisors of Office and Administrative Support Workers	41	38	-3
Health Care and Social Assistance	Medical Secretaries and Administrative Assistants	44	62	18
Health Care and Social Assistance	Medical and Health Services Managers	65	38	-27
Health Care and Social Assistance	Nurse Practitioners	56	67	11
Health Care and Social Assistance	Registered Nurses	37	51	14
Finance and Insurance	Customer Service Representatives	59	76	17
Finance and Insurance	Financial Managers	32	44	12
Finance and Insurance	Financial and Investment Analysts	41	44	3
Finance and Insurance	Personal Financial Advisors	64	62	-2
Finance and Insurance	Securities, Commodities, and Financial Services Sales Agents	42	73	31
Retail Trade	First-Line Supervisors of Retail Sales Workers	59	69	10
Retail Trade	General and Operations Managers	67	62	-5
Retail Trade	Pharmacists	26	38	12
Retail Trade	Private Detectives and Investigators	70	69	-1
Wholesale Trade	First-Line Supervisors of Non-Retail Sales Workers	69	87	18
Wholesale Trade	Order Clerks	28	40	12
Wholesale Trade	Sales Managers	79	80	1
Wholesale Trade	Sales Representatives, Wholesale and Manufacturing, Except Technical and Scientific Products	66	56	-10
Wholesale Trade	Sales Representatives, Wholesale and Manufacturing, Technical and Scientific Products	47	53	6
Information	Audio and Video Technicians	30	38	8
Information	Editors	75	93	18
Information	Film and Video Editors	17	33	16
Information	News Analysts, Reporters, and Journalists	53	60	7
Information	Producers and Directors	29	44	15

Source: GDPval snapshot Sept 2025 (https://arxiv.org/abs/2510.04374) and April 2026 (https://evals.openai.com/gdpval/leaderboard) Assembled by a16z

The Right Sequence

In our experience, most AI initiatives fail not because the technology is wrong, but because the sequence is wrong.

The pressure to produce practical economic benefits is real. Boards are asking. Investors are asking. The CFO has seen the productivity studies. And so teams move fast — often before they have done the one thing that matters most: understood the work deeply enough to know where AI actually fits.

We have learned that a sequence, better tuned for practical change, is this:

Look outside first. Spend real time with cross-sector adoption data or aoption teams. Understand where the technology is generating confirmed wins, and why. Find the pattern that maps to something in your own operations.

Map before you build. Create a context map of your existing workflows — every decision point, handoff, rule, and exception. This is where hidden AI opportunities live. Not in the obvious tasks. In the seams between them.

Scope surgically. Find the 30% of a process where AI can make a measurable difference, and build that. Do not try to automate the whole thing. We're not saying Pareto's law applies here, but it sure seems to have traction everywhere we've been.

Build the memory, not just the model. The institutional knowledge that lives in your people, your processes, and your historical data — structured and surfaced — is what separates a durable AI advantage from a demo that never makes it to production. And this is really trickest step in the entire sequence. Since this is all about collecting data that may been unstructured historically, or never collected at all, this may be the very data needed to create exceptional change in the enterprise.

The models are ready. In some domains they are winning head-to-head against seasoned experts. The question is no longer whether AI can do the work. It is which work, in which context, built on which institutional foundation.

That answer lives inside your organization. And finding it requires going outside first.

Stay grounded. Stay creative. Stay moving.

Np better way to keep the house from burning down.

At Strategic Machines, we build AI that carries institutional memory — not just model intelligence. Our hospitality platform treats guest history, property knowledge, and service context as the durable advantage, with voice and language model providers as interchangeable infrastructure underneath. The result is an AI concierge that knows a returning guest before it says a word.

We are deploying agents across high-value operational use cases — hospitality, scheduling, sales, and service — where context and execution are the product, not the model. We invite you to try our live agents. Request a one-time password, select an agent from the interface, and experience the difference that institutional memory makes.

Let's talk.

SOURCES AND REFERENCES

AI Value in the Enterprise — Strategic Machines (Sept 2025)

GDPval Leaderboard — OpenAI Evals

GDPval Research Paper — arXiv (Sept 2025)

GDPval Open Dataset — Hugging Face

Where Enterprises Are Actually Adopting AI — Kimberly Tan, a16z

I've Seen How AI Thinks. I Wish Everyone Could — John West, Wall Street Journal

Voice Wars: Redux — Strategic Machines