Experiments - Libretto

Perhaps one of the best ways to discover better versions of your prompts is Libretto's Experiments feature. Based on the type of Experiment, Libretto will create new Variants - versions of your prompt template that are re-worded or differ in structure - to help you find the best possible prompt.

We are constantly adding new Experiment types based on the latest prompt engineering research.

Starting an Experiment

You can start an Experiment via the buttons in the Playground or Leaderboard. This will bring up a modal for you to choose the desired Experiment settings.

Experiment Type

First, choose which type of Experiment you'd like to run: Few-Shot, Magic Words, LLM Model Versions, or Prompt Rephrasing. Each type has their own method and number of steps for discovering new Variants.

Few Shots

A Few-Shot Experiment will inject a mix of varying numbered test cases into your prompt. This better instructs LLMs how to format its response in your desired format.

Currently, this is a two step Experiment. The first step will insert the test cases into the prompt; the second will try to combine to the best-performing Variants from the first step to create new Variants.

Magic Words

Magic word variations strategically add specific words or phrases into your prompts. These act as cues to the LLM to reframe their response with more relevant and high-quality results.

This is a longer-running experiment, that will continue to iterate, adjust, and generate new System instructions.

Model Version

Select from the list of available models to evaluate your prompt template against. By default, your current model will be selected so that there is a baseline for other model results to compare against.

This can make testing a wide variety of the latest and greatest models incredibly simple, with none of the setup and context switching.

Evals

Similar to how you chose eval metrics for your prompt template initially, select a primary eval metric for the Experiment Variants to be optimized for. Variants will also be run against currently set eval metrics for the prompt template as well for reference.

Reviewing Experiment Results

Each of the Variants created by the Experiment will be presented in a table, grouped each Experiment Step.

Performance Vs Baseline

The primary criteria we utilize to compare Variants is the "Performance Vs Baseline" metric. We do a statistical comparison of the test case results for a given variant and compare it against the initial prompt.

This metric is specified as a X.X% better/worse than the baseline. Naturally, we're looking for Variants that have the highest Performance Vs Baseline.

Promoting Variants

If you find a Variant you'd like to use in the future, you can Promote that prompt version/model pairing via the "Add to Playground" button on the top-right of the screen, or via the drop-down menu for the given Variant within the table.

Once Promoted, the Variant can be found in the Leaderboard, and can be edited in the Playground.

Starting a New Experiment

If you find a worthwhile Variant and would like to kick off the discovery cycle again, you can start a new Experiment with a given Variant by selecting the Variant from the drop-down at the top and clicking the 'Use this Variant to Start a New Experiment' button on the top-right.