Is Google Dead?

Necessary Facts

We ran a small experiment this week, comparing results on simple searches between the Google engine and the ChatGPT engine. We asked both engines to tell us the year in which Huckleberry Finn was written by Mark Twain. The responses, which you can see below, are emblematic of the issue we see in knowledge management on a broader scale. One engine answers the question. Another opens the flood gate.

First, Google:

Then, ChatGPT (v4):

Who has time for a flood?

We recognize that reports of Google's death would be 'greatly exaggerated', at this point, but there are a few points that are worth noting:

Many commercial applications require precision search, and companies invest millions to deliver that capability through their production database. LLMs are closing the gap at a fraction of the cost;
For the first time, we have other platforms available for consumer search, which will disrupt advertising markets;
When search and workflow management are combined through LLM engines, a whole new class of applications become feasable (and that includes advertising).

Of course, in this transformative computing period, it is very important to exercise some caution and keep perspective. We track some of more formidable testing labs for very large LLM models and noted these two results, published near the end of 2023:

Briefly, the GSM8K is a dataset of over 8000 math word problems written in various languages, which provides a nice benchmark for measuring the 'problem solving skills' of LLMs. GPT4 leads the pack (so far). But if the LLMs are being trained on these same datasets - thats not helpful at all in measuring 'reasoning skills'. Its kind of like having the answers to the test before sitting for the exam.

So the 'exam score' shows the result of an independently administered high school math test, giving us a nice plot of results. The red area identifies those models which, more than likely, were trained on the datasets, given their inability to perform with the same precision on the manually administered test. As you can see from the plot, we might conclude from all of this that GPT4 has greater 'reasoning ability'.

But before we take any of these models into production, consider this:

We kind of like this metric because it strikes right at the heart of the most important questions senior executives ask about LLMs: 'Will it make stuff up?'. After all the tests are administered, exams graded, scores assembled and results evaluated, we still want to know if a GenAI model might go rogue and lie to our customers and business partners. You can read more about the leaderboard here. GPT4 certainly leads the pack, but even a 3% hallucination rate is too much for many customer touch points.

We promised in our last post that we would map out in technical and functional detail the techniques we are employing for harnessing GenAI for production. If you've been following our posts, you can probably already take a good guess at the direction we've adopted. But we thought it would be helpful to bring together a short series of posts with the most essential details and demonstrations of what we've concluded, after prototyping apps for the past two years. In our view, it is critical for companies to get this right, because adopting an enduring and scalable architecture means a fast start in the market, and a competitive position in delivering innovation outcomes for customers.

So, rather than googling for solutions to this somewhat intractable challenge, check back with us as we present a few of the dynamic dimensions of GPT.