Quickstart - Libretto

Ready to get the best out of your prompts today? We're here to make that a reality in the quickest way possible.

If you'd like to take a deeper look at how you can customize and build your prompt template, check out our full guide.

Create a Prompt Template

Once you've created a Project, you can create a new prompt template by clicking on the "Create Prompt Template" button.

Choose a Prompt Type

Currently, you can choose between Chat and Assistant prompt templates.

Chat encapsulates most common use cases, where a series of System, Assistant, and User messages are specified. Use this type of prompt template even if you intend to use a completion-style model, like GPT-3.

Assistant lets you include files to reference within your prompt template, and leverages OpenAI's Assistants API. Note that currently, Assistant is only supported for OpenAI models, so if you use this type of prompt template, you won't be able to compare models from different providers, like Anthropic or Google.

Add your Prompt

Give your prompt a unique name in its project.
Build out your prompt template by adding System, User, and Assistant messages.
- By default, the System message will include "You are a helpful assistant". Modify this field to give the model guidance on how to respond to a User's message.
- We recommend only making one System message and then alternating User and Assistant messages. While OpenAI allows multiple System messages and having two User messages in a row, other model providers do not. Having one System message and alternating User and Assistant messages ensures that comparing models from different providers will be as easy as possible.
Specify variables for your prompt by encapsulating a variable name in curly braces {variableName} within a User message.

Every prompt template must have at least one variable. Variables are placeholders in your prompt template that take in information at runtime and change from one call to the next, whereas your prompt template is what stays the same from one LLM call to the next. You can think of variables as the inputs to your prompt. They are often user-entered data or information that has come out of your database or another data source. You can put as many variables as you want into a template.

Add or Create Test Cases

Regardless of which metrics you choose, having a sufficient number of well-formed test cases is crucial to making a test run robust enough to provide meaningful insights. A body of test cases, some times called a golden dataset or ground truth, is a set of inputs to your prompt template and optionally corresponding known-good outputs. A great place to find test cases is existing data from product usage, which you can easily bulk upload. If you don't have existing data, you can also create test cases manually.

Bulk Upload

Using a .CSV file, you can easily and quickily upload a large number of test cases.

The format of this file should start with a header row that includes one column for each of your variables specified in the previous step and a column called targetOutput for the known-good output of the test case.

For instance, a sample file for a prompt template that tries to classify sentiment in hotel reviews may look like:

review,targetOutput
The room stank of cigarette smoke and wasn't clean,negative
All of the staff were exceptionally sweet,positive
The rate is very affordable but it's also a pretty barebones place,ambivalent The view from my room was honestly quite nice, and the breakfast was decent,positive
I with that this place was a little nicer. It was fine, though.,neutral,text5
The perks and goodies you get as a guest here are out of this world.,positve,text6

A prompt template must have a minimum of 6 test cases to be evaluated or experimented on successfully.

Manually Add Test Cases

Using the left-hand pane of the prompt template, navigate to Test Cases. Here you can add individual test cases with a Target Ouput and a variable value.

Generate Test Cases

Creating meaningful test cases can often be the most difficult part of the prompt engineering process. To alleviate that, Libretto provides the ability to generate test cases via an LLM itself. Simply provide a meaningful description and specify the number of test cases you'd like to generate.

How Many Test Cases Do You Need?

It's basically always better to have more high quality test cases than fewer, and time invested in creating test cases will pay off as you iterate on your prompt. But here's what we recommend as a minimum number of test cases for various phases of development.

If you are exploring whether LLMs can feasibly solve your problem at all, you should have at least 5 test cases, and ideally more like 10-20. This is a number where you have enough inputs to get a sense of how the LLM succeeds and fails, but you can still fairly easily scan over the LLM responses to assess vibes.

If you are iterating on a prompt and running experiments to make it the best it can be, you should have at least 30 test cases, but ideally more like 70-200. Keep in mind that many prompt engineering techniques will only increase your prompt's performance by a few percentage points, and that means you need more test cases to separate out real performance gains from noise.

Choose Evaluation Metrics

Once you have some good test cases loaded up, you need to let us know how you would like to evaluate the LLM outputs. For more straightforward prompts, like sentiment analysis or named entity extraction, you probably just want to exactly compare the test case target output and the LLM output. For more complicated and generative prompts, though, you'll want to use more sophisticated evaulations.

Based on the latest research, Libretto supports a large number of methods to evaluate your prompts. Navigate to the Eval Settings tab of the left-hand panel to configure your prompt template to run against the metrics you find most meaningful. Newly initiated Test Runs will always use the most current configuration.

Format Response

Set up a regex or JSON to ensure the model's responses are evaluated accurately.

Metric Selection

Select from a wide-variety of options the metrics you'd like your test cases to be evaluated against.

For Subjective evaluations, you can specify a phrase or list of outcomes that an LLM will then judge the model's intial response. You can even create your own Scorecard system or submit a custom JavaScript function to evaluate the response.

Modifying Your Prompt Template

Now that you have test cases and evaluation strategy set up, you can iterate and play with your prompt. Using the text editors within the Playground tab, you can easily update and adjust the text, format, and model of your prompt template.

Run Tests and Experiments

You can run tests and initiate Experiments within the Playground or Leaderboard.

Model Selection

Choose from a wide list of available LLM models and providers from the drop-down or checklist.

Some model names like "GPT-4 Turbo" actually reference a more specific version of that provider's model, in which case, any experiments or tests will run against that specific version. If you'd rather make sure you run against a particular model version, make sure to select a model like "GPT-4 (preview version 0125)".

Run Tests

Any time you modify your prompt template, you can run a new test with current test cases, model selection, and evaluation settings by clicking "Save & Run Test". This will also save the current prompt template so you can load it up in the future. You can also re-run any of these tests for a given prompt-template / model combination via "Re-run" buttons on the Playground or Leaderboard.

When you run a test, it will show up in the list on the right, and you can click through to see an overview and deep-dive into the results. You can also check multiple test runs and click the "Compare" button to quickly compare two different test scenarios.

Experiments

Perhaps one of the best ways to discover better versions of your prompts is Libretto's Experiments feature. You can start an Experiment via the "Run Experiment" button in the Playground.

First, choose which type of Experiment you'd like to run: Few-Shot, Magic Words, or LLM Model Versions. Then simply select the primary evaluation metric and Click 'Run Experiment'. The primary evaluation metric is the evaluation that the Experiment will use to determine which prompts are performing better.

You can take a close look at the Variants created and their performance against your chosen metric by clicking on the Experiment in the right column of the Playground. From there, you can Promote individual Variants, which makes them available to view in the Playground and Leaderboard.

Test and Experiment Status

The Playground also provides a simple way to monitor the progress of any currently running tests or experiments. If you'd like to cancel one of these, simply click the X / Cancel button.

Prompt Template Comparison

The Leaderboard and Compare page provide a great way to evaluate the performance of different prompt versions and models.

Leaderboard

Found via the left-hand panel, the Leaderboard gives a meaningful overview on the different prompt versions and models stack up according the evaluation metrics you've chosen. From here, you can sort the results by a given metric, kick off new test runs and experiments, or navigate to other portions of the application with the desired context.

Compare

The Compare page gives a detailed view how each version / model pairing performed against each individual test case. Assessing why a certain test case performed the way it did allows you gain meaningful feedback on how to optimize the next version of your prompt.