Why Evals Are the New User Stories for AI

Build trust, curb hallucinations and promote your LLM apps with a rigorous, repeatable evaluation playbook.

What is an Eval?

During my stint as a tech product manager we spent a long time crafting user stories. The head of technology called these code in English language. The developers translated them into computer code and the testers used them to check that the finished product worked as expected. In short, user stories defined the outcome. Here’s an example:

“As a student, I want to be able to access foreign language courses online so that I can be prepared for studying or working abroad in the future.”

Coursera

When you are building with LLMs, user stories take the form of prompts. You might instruct no-code apps or ChatGPT to write code that solves a particular problem. The quality of the prompt matters but the key link in the chain shifts to assessing outcomes. This is where evaluations, or evals, enter the frame and are fast becoming the way product managers are judged.

Evals are surprisingly often all that you need.

Greg Brockman, Open AI President

Without high quality evals, it is difficult and time consuming to understand how the rapid evolution of LLM versions impacts on your use case.

Why Evals Matter

LLMs are stochastic models. This means they have a random probability distribution that may be analysed but cannot be predicted with precision. Most people are familiar with the concept of hallucinations, when LLMs get facts wrong or make them up. There are other common errors, such as getting different answers to the same question and multi-step answers veering off course. These are inherent issues with stochastic sampling.

An additional problem lies in the rise of no code apps. These are often created to perform a specific task, without much thought about long-term consequences such as scaling and resource consumption. What works for a quick go-to-market does not sustain a lasting product.

The ability to demonstrate reliable output is the necessary step to move from proof of concept into production. Evals are validation and testing for LLM applications. Strong evals result in more stable products that are resilient to code and model changes.

How Evals Work

There are two paths to evaluate outputs. For simpler tasks, developers can code the expected response. For example, the prompt Where will the next summer Olympics be held? should return Los Angeles. A model’s output may be evaluated with a series of ideal answers.

This can also work for coding. If a prompt requires code in a particular language, then the test code checks that the output is in that language.

When grading creative writing, or ranking research, there is no one right answer. In this case the model, or better still another, is used to grade answers. This approach is better suited to more open-ended responses.

Developing a suite of evals for your objectives helps you to understand how new models may handle your use cases. Evals can become part of the release process, to ensure the requisite accuracy is achieved before launching a new model.

Best Practice for Evals

Companies such as OpenAI provide eval templates that allow testing of common tasks. If you want your own, there are a few things to consider.

An eval should be consistent. This means tackling a subject, such as summarising legal contracts, from different angles using a series of prompts. This isolates a capability and builds confidence in the outcome.

The eval should be challenging, but possible. If a model does well on all the prompts then it may not have been tested enough. One starting point is to write evals that require a human subject matter expert to answer.

The eval must spell out what good looks like. Data should include a strong signal as to what is the right behaviour. This means high-quality answers when evaluating simple tasks and a detailed set of instructions for model grading.

Evals should be included as part of development and in production. During development they help understand how choices about design and architecture impact outcomes. In production, they are used to monitor performance and trigger fixes.

Why Perform Evals?

The highest hurdle for the adoption of AI is creating trust and confidence in outcomes. If users perceive a black box that spews out errors, they will not value a product. This is a particular issue for regulated industries, such as capital markets or insurance.

Evals help demonstrate accuracy and reliability. They explain how to mitigate risks such as bias and demonstrate a process of continuous learning. Evals are important part of showing clients how AI will continue to get better.

That said, evals are time-consuming, data dependent and require careful crafting. There will be trade-offs between accuracy and explainability of answers. Regular re-evaluation is required as models evolve.

For this reason, you may consider engaging expert help to design, create and run your evals. This might be a condition of allowing non-technical staff to use no-code tools.

Conclusion

Generating effective evals is now a core part of the development cycle when building apps powered by LLMs. Evals are used today in both the development and maintenance of AI systems and form a critical step in running AI for internal and external use cases.

Questions to Ask and Answer

  1. How do I evaluate the answers I get from LLMs?

  2. How do I check the quality of my team’s no-code workflows?

  3. How do I train my colleagues to cross-check responses from ChatGPT?

Here are 3 ways I can help:

  1. Hit reply to ask about evals.

  2. Explore AI use cases.

  3. Book a discovery call with an AI expert.

Reply

or to participate.