Working with Prompt Templates
Leaderboard and Template Comparison
Now that we've discovered some new versions of our prompts and have completed a few successful test runs, we can finally compare them to decide which is best for us to show in production. Libretto currently offers a few ways to do so, primarily the Leaderboard and Compare page.
Leaderboard
Navigable via the left-hand menu, the Leaderboard displays the results of the latest test run for each of the version/model pairings you've created.
Here you can quickily sort and compare against any eval metric the templates have been run against. The top-performer in a given metric will be highlighted in green.
Capabilities
Similar to the Playground, you can:
- run Experiments
- re-run tests
- view a more detailed breakdown of the last test run
- copy the prompt
Compare
For a given set of test runs, the Compare page provides both an overview of how each included version/model pairing performed on the selected eval metrics, and a breakdown of how each did test case was evaluated.
Studying individual test case performance can provide crucial insights on exactly why a certain version/model pairing didn't perform as well as may be expected.
Export
If you'd like to export the results of the included test runs, you can download a .CSV file by clicking the button above the test case breakdown section.
LLM-as-Judge Grading
If your prompt has LLM-as-Judge evals, a button will be added to the header allowing you to enter the grading & calibration interface that can be used to tune judge alignment.