Getting Started
Prompt Engineering Quickstart
Ready to get the best out of your prompts today? We're here to make that a reality in the quickest way possible.
If you'd like to take a deeper look at how you can customize and build your prompt template, check out our full guide.
Create a Prompt Template
Once you've created an account, you'll be directed to the Project Dashboard, you'll be directed to create a new Prompt Template by clicking on the "Createde Prompt Template" button.
If you've already integrated our SDK and have sent calls to Libretto, we'll have created unique Prompt Templates. Learn more about our automated-foundation flow here.
Choose a Prompt Type
Currently, you can choose between Chat and Assistant prompt templates.
Chat encapsulates most common use cases, where a series of System, Assistant, and User messages are specified. Use this type of prompt template even if you intend to use a completion-style model, like GPT-3.
Assistant lets you include files to reference within your prompt template, and leverages OpenAI's Assistants API. Note that currently, Assistant is only supported for OpenAI models, so if you use this type of prompt template, you won't be able to compare models from different providers, like Anthropic or Google.
Add your Prompt Template
- Give your prompt a unique name in its project.
- Build out your prompt template by adding System, User, and Assistant messages.
- By default, the System message will include "You are a helpful assistant". Modify this field to give the model guidance on how to respond to a User's message.
- We recommend only making one System message and then alternating User and Assistant messages. While OpenAI allows multiple System messages and having two User messages in a row, other model providers do not. Having one System message and alternating User and Assistant messages ensures that comparing models from different providers will be as easy as possible.
- Specify variables for your prompt by encapsulating a variable name in curly braces
{variableName}
within a User message.
Every prompt template must have at least one variable. Variables are placeholders in your prompt template that take in information at runtime and change from one call to the next, whereas your prompt template is what stays the same from one LLM call to the next. You can think of variables as the inputs to your prompt. They are often user-entered data or information that has come out of your database or another data source. You can put as many variables as you want into a template.
Add or Create Test Cases
Regardless of which metrics you choose, having a sufficient number of well-formed test cases is crucial to making a test run robust enough to provide meaningful insights. A body of test cases - sometimes called a golden dataset or ground truth - is a set of inputs to your Prompt Template and, optionally, corresponding desired outputs.
If you already known which test cases you'd like to use, you can bulk upload a .csv file here, create test cases manually, or have Libretto generate test cases.
However, a source of test cases your own production traffic! With a sufficient amount of traffic sent via our SDK, Libretto can also auto-generate test cases as outlined here.
Bulk Upload
Using a .CSV file, you can easily and quickily upload a large number of test cases.
The format of this file should start with a header row that includes one column for each of your variables specified in the previous step and a column called targetOutput
for the known-good output of the test case.
For instance, a sample file for a prompt template that tries to classify sentiment in hotel reviews may look like:
review,targetOutput
The room stank of cigarette smoke and wasn't clean,negative
All of the staff were exceptionally sweet,positive
The rate is very affordable but it's also a pretty barebones place,ambivalent The view from my room was honestly quite nice, and the breakfast was decent,positive
I with that this place was a little nicer. It was fine, though.,neutral,text5
The perks and goodies you get as a guest here are out of this world.,positve,text6
A prompt template must have a minimum of 6 test cases to be evaluated or experimented on successfully.
Manually Add Test Cases
After creating the Prompt Template and selecting *Add Test Cases Later, use the left-hand navigational panel to navigate to Test Cases
. Here you can add individual test cases with a Target Ouput and a variable value.
Generate Test Cases
Coming up with meaningful test cases can often be the most difficult part of the prompt engineering process. To alleviate that, Libretto provides the ability to generate test cases via an LLM itself. Simply provide a meaningful description and specify the number of test cases you'd like to generate.
How Many Test Cases Do You Need?
It's basically always better to have as many high quality test cases as you can. Time invested in creating test cases will continually pay off as you iterate on your prompt. Here's what we recommend as a minimum number of test cases for various phases of development.
If you are starting to explore whether LLMs can feasibly solve your problem, what some may call a "vibes test", you should have at least 5 test cases. Ideally at least 10-20. This is the point where you have start to have just enough input to get a sense of how the LLM succeeds and fails, while still being able to easily scan through the LLM responses.
If you are optimizing a prompt and running Experiments, you should realistically have at least 30 test cases, and ideally 100-200. Keep in mind that many prompt engineering techniques will only increase your prompt's performance by a few percentage points, and more test cases makes it easier to differentiate real performance gains from noise.
Choose Evaluation Metrics
Once you have some good test cases loaded, we need to decide how to evaluate the LLM outputs. For more straightforward prompts, like sentiment analysis or named entity extraction, you may simply want to exactly compare the test case target output and the LLM output. For more complicated and generative prompts, though, you'll want to use more sophisticated evaulations.
Based on the latest research, Libretto supports a large number of methods to evaluate your prompts. You can continually update these by navigating to the Eval Settings
tab of the left-hand panel of a given Prompt Template. Newly initiated Test Runs will always use the most current configuration.
Auto-generated Evals
When a Prompt Template is created - whether through the UI or by sending events through the SDK - we will analyze your prompt and example outputs to dynamically determine which Evals seem like the best fit. These can be manually edited, or entirely re-generated at any time within Eval Settings
.
A large portion of the time, these will include LLM-as-Judge evals, where we leverage an LLM to score your outputs according to a generated rubric.
Format Response
Set up a regex or JSON to ensure the model's responses are evaluated accurately.
Metric Selection
Select from a wide-variety of options the metrics you'd like your test cases to be evaluated against.
For Subjective evaluations, you can specify a phrase or list of outcomes that an LLM will then judge the model's intial response. You can even create your own Scorecard system or submit a custom JavaScript function to evaluate the response.
Modifying Your Prompt Template
Now that you have test cases and evaluation strategy set up, you can iterate and play with your prompt. Using the text editors within the Playground tab, you can easily update and adjust the text, format, and model of your prompt template.
Run Tests and Experiments
You can run tests and initiate Experiments within the Playground or Leaderboard.
Model Selection
Choose from a wide list of available LLM models and providers from the drop-down or checklist.
Some model names like "GPT-4o" actually reference a more specific version of that provider's model, in which case, any experiments or tests will run against that specific version. If you'd rather make sure you run against a particular model version, make sure to select a model like "GPT-4o (version 2024-08-06)".
Run Tests
Any time you modify your Prompt Template, you can run a new test with current test cases, model selection, and evaluation settings by clicking "Save & Run Test". This will also save the current prompt template so you can load it up in the future. You can also re-run any of these tests for a given prompt-template / model combination via "Re-run" buttons on the Playground or Leaderboard.
When you run a test, it will show up in the list on the right, and you can click through to see an overview and deep-dive into the results. You can also check multiple test runs and click the "Compare" button to quickly compare two different test scenarios.
Experiments
Perhaps one of the best ways to discover better versions of your prompts is Libretto's Experiments feature. You can start an Experiment via the "Run Experiment" button in the Playground.
First, choose which type of Experiment you'd like to run: Few-Shot, Magic Words, or LLM Model Versions. Then select the primary evaluation metric and Click 'Run Experiment'. The primary evaluation metric is the Eval the Experiment will use to determine which prompts are performing better.
You can take a close look at the Variants created and their performance against your chosen metric by clicking on the respective Experiment entry in the right column of the Playground. From there, you can Promote individual Variants, which makes them available to view in the Playground and Leaderboard.
Test and Experiment Status
The Playground also provides a simple way to monitor the progress of any currently running tests or experiments. If you'd like to cancel one of these, simply click the X
/ Cancel
button.
Prompt Template Comparison
The Leaderboard and Compare page provide a great way to evaluate the performance of different prompt versions and models.
Leaderboard
Found via the left-hand panel, the Leaderboard gives a meaningful overview on the different prompt versions and models stack up according the evaluation metrics you've chosen. From here, you can sort the results by a given metric, kick off new test runs and experiments, or navigate to other portions of the application with the desired context.
Compare
The Compare page gives a detailed view how each version / model pairing performed against each individual test case. Assessing why a certain test case performed the way it did allows you gain meaningful feedback on how to optimize the next version of your prompt.