Creating Your First Test Cases - Libretto

Prompt evaluations are only truly meaningful if you have a robust set of test cases that cover the breadth of possible scenarios in your prompts. As is the case with most forms of testing, this can often be the most difficult part of any evaluation workflow.

Luckily, Libretto makes it as painless as possible to create and manage your test cases!

While navigating within a specific prompt template, you can easily access Test Cases management page by clicking on the Test Cases menu item on the left-hand navigation pane.

Test Case Requirements

For some key features, like Experiments, a prompt template must have a minimum of 6 test cases to be run at all. In general, though, the more test cases the better, and you should think about how many test cases you need as a function of where you are in the development cycle.

If you are exploring whether LLMs can feasibly solve your problem at all, you should have at least 5 test cases, and ideally more like 10-20. This is a number where you have enough inputs to get a sense of how the LLM succeeds and fails, but you can still fairly easily scan over the LLM responses to assess vibes.

If you are iterating on a prompt and running experiments to make it the best it can be, you should have at least 30 test cases, but ideally more like 70-200. Keep in mind that many prompt engineering techniques will only increase your prompt's performance by a few percentage points, and that means you need more test cases to separate out real performance gains from noise.

Bulk Upload

Many users have existing golden test sets or production data that they can pull for this. If this is the case you can use a .CSV file to easily and quickily upload a large number of test cases.

The format of this file should start with a header row that includes one column for each of your variables specified in the previous step and a column called targetOutput for the known-good output of the test case.

For instance, a sample file for a prompt template that tries to classify sentiment in hotel reviews may look like:

review,targetOutput
The room stank of cigarette smoke and wasn't clean,negative
All of the staff were exceptionally sweet,positive
The rate is very affordable but it's also a pretty barebones place,ambivalent The view from my room was honestly quite nice, and the breakfast was decent,positive
I with that this place was a little nicer. It was fine, though.,neutral,text5
The perks and goodies you get as a guest here are out of this world.,positve,text6

You can perform a bulk upload of test cases during the creation of your prompt template, or any point on the Test Cases management pane.

Manually Add Individual Test Cases

By selecting "Create Your First Case", you can, well... create your first test case! Manually!

Test Case Type

Each variable in your test case can be either Single Argument or Chat History. Single Argument variables consist of a single User input, whereas Chat History arguments are a series of messages that can be plugged in where the variable is. Most of the time you will use Single Argument variables. Chat history variables are generally used for an ongoing chat, where your prompt template stuffs all of the old chat messages into the middle of the array of messages and then has a templated message below, usually a User message.

For each variable within your prompt template, specify an example input you might expect from an end user. This may be the content of a blog post you'd like the AI to create a title for, a question for the AI to answer, or a selection of text you'd like the AI to summarize.

Specifying a Target Output

The target output is simply the response you'd ideally like from the LLM model. These target outputs will be what the majority of metrics are evaluated against.

Be sure to keep in mind any instructions you provided via a System prompt. Many LLM models are slightly verbose and generically genial by default.

If the responses you're getting are drastically different in format and tone from your target outputs, it's a good sign that your prompt template needs adjustment. Don't fret - this is what prompt engineering is all about!

Suggested Target Output

Having a tough time coming up with an answer? Click Suggest and Libretto will call an LLM on you're behalf and create a sample answer based on the given input.

Be sure to review the response, and if needed edit the content and format to match your desired output.

Generate Test Cases

Similar to suggesting target outputs, Libretto can generate test case inputs for you as well. You can access this feature in a similar manner, by clicking "Generate Test Cases" in the upper-right hand corner of the Test Cases management page.

Guidance

From the modal, verify the description of the paramaters and give any additional guidance on how the content of any variables should be generated. If, for instance, you had a variable {cityName}, you may want to specify that they should be U.S. cities with a population of less than a 1 million people.

Review

After confirming the number of desired test cases to be generated, a sample of variable inputs will be presented. You can either label each test case as "Good" or "Bad". For every "Bad" input, Libretto will generate replacement examples for you to review again.

Once a sufficient number of "Good" test cases have been accepted, click "Generate" and these inputs will be added to your list of test cases.

Target Output

The generated test cases will not have an output attached to them. For each new test case, specify a target output like you would for a manually created test case (per above).

Next Steps

We've got test cases to evaluate against, but how should we evaluate these test cases' target outputs from the actual response from the LLM? Let's take a deep-dive.