Kolmogorov Complexity

Algorithmic Entropy

In our last post, we described the challenge that all companies experience in testing GenAI, hoping to make these platforms 'simply understood'. Here, we go a little deeper on the issue with testing language models.

In information theory, there is a little known but readily appreciated metric known as Kolmogorov complexity. We’ll save you the ‘brain freeze’ on the details and summarize this measurement as the amount of information needed to describe an object. This makes intuitive sense. A complex object takes many words or longer algorithms to accurately describe its’ size, shape, and properties. A simple object, not so much.

So it is with GenAI. AI models are not like application software. Models require enormous computing resources to train, since the models are attempting to emulate language patterns. These stochastic models are complex because language is nuanced, elliptical, imprecise, and sometimes unpredictable. Unlike application software, which is architected and built off a known set of requirements, language models are built for the unknown. By ingesting vast datasets of text (or images), these models contain statistical maps which depict how words are related. With that structure, they astonish us by performing a wide range of tasks that were never anticipated, seeming to mimic general intelligence.

Which brings us back to Kolmogorov. How do effectively test these models, demonstrating the integrity required in threshold performance for customer touch point applications? If we cannot begin to describe these objects accurately, due to its very high Kolmogorov complexity, can we accurately build a reliable and repeatable set of tests?

We’ve been following the work of a new firm in the market which is doing a reasonable job in framing model tests and benchmarks, which you can find here. Rather than getting caught in the weeds, they have developed benchmark methods which compare known facts between models (like price), and measurable facts (like speed) before introducing debatable measures (like quality). The trouble with any quality measure, which every company would like to understand, is that it depends. It depends on context, intent, assigned task, and training. So, while we know that some models handle some queries better than others, and other models are more consistent over time than some, it is unlikely than any model could ever be deemed 100% accurate. Even application software has bugs.

So where does that leave us? In our next post, we'll take you a little deeper still in the methods we've developed to frame the risk and improve quality outcomes for AI apps. By making the complex simple, we hope to avoid the algorthmic entropy that Kolmogorov warned us about!