Testing Ex Machina

Quality Assurance Strategies for AI

Ex Machina premiered a decade ago, a sci-fi thriller depicting life-like AI robots blurring the line between machines and humans. And yet here we are, staring at this new breed of technology called language models which seem to blur the line between machines and humans. While we don’t believe any of the fantasy projections about Ava ruling the world, we do subscribe to the practical requirement of testing AI apps before releasing them to production. And therein is the challenge.

As we noted in our most recent post, language models defy traditional testing techniques since, unlike traditional software, they are probabilistic, built off vast datasets and excel at fiction writing when you’re not watching. We’ve been through many rounds of testing, including ecommerce apps governed by GPT-4 only to find it will make up a product price even if the price was available on the database. For most companies, the language models are being deployed for highly productive use internally, until better methods are developed to govern the models for customer touch points.

But what are the broad testing strategies for AI Apps built on language models? For one thing, the strategy must include real-time instrumentation of the model. We see consistent responses to a fixed prompt in most instances, but not always, and that is the issue. In addition, since the scope, tone, style, relevance, and richness of model responses can vary even with the same context window, it is essential that real-time instrumentation of models be used to monitor outcomes, and intercept as needed. In addition to the continuous testing requirement, we have identified a few other testing strategies and objectives that we believe are important, if not essential:

1. Accuracy of Responses

Content Verification: Test whether the responses are factually accurate and relevant to the prompts. This can be done by matching responses against trusted data sources or established facts.
Context Appropriateness: Ensure that responses are contextually appropriate and adhere to the intended usage of the application.

2. Language and Tone

Sentiment Analysis: Monitor the tone of the responses to ensure they match the desired positivity or neutrality, especially in customer service or sensitive contexts.
Language Style and Formality: Depending on the application's audience, check whether the language style and formality match the expected standards.

3. Safety and Compliance

Filtering Harmful Language: Use text classification models to detect and alert any use of harmful, offensive, or inappropriate language.
Compliance Checks: Ensure responses comply with regulatory standards such as GDPR, HIPAA, or COPPA where applicable, particularly in handling personal or sensitive information.

4. Consistency and Reliability

Response Consistency: Test for consistency in responses to similar or repeated queries to ensure the model's reliability.
Error Rate Monitoring: Track and analyze the frequency and types of errors that occur, whether they are user input errors or model output errors.

5. Performance Metrics

Latency and Response Time: Monitor the time it takes for the LLM to respond to requests, as performance can impact user experience.
System Performance: Keep an eye on the system’s health, including CPU and memory usage, to ensure the infrastructure supporting the LLM is stable.

6. Bias and Fairness

Bias Detection: Regularly sample and analyze responses for any signs of bias—whether gender, racial, cultural, or otherwise—to ensure fairness and inclusivity.
Mitigation Strategies: Implement strategies to mitigate detected biases, possibly adjusting the model's training or prompt engineering.

7. User Feedback Integration

Feedback Loops: Incorporate user feedback mechanisms to gather insights on the quality and appropriateness of responses, which can then be used to improve the AI App with prompt and context adjustments.

We're at the beginning of a remarkable period of change, where AI will infuse and inform our products and services, helping to drive nonlinear market performance. But a key element of leveraging AI for the market is effective quality assurance, something that even Ava would understand.