Leaderboard and Template Comparison - Libretto

Now that we've discovered some new versions of our prompts and have completed a few successful test runs, we can finally compare them to decide which is best for us to show in production. Libretto currently offers a few ways to do so, primarily the Leaderboard and Compare page.

Leaderboard

Navigable via the left-hand menu, the Leaderboard displays the results of the latest test run for each of the version/model pairings you've created.

Here you can quickily sort and compare against any eval metric the templates have been run against. The top-performer in a given metric will be highlighted in green.

Capabilities

Similar to the Playground, you can:

run Experiments
re-run tests
view a more detailed breakdown of the last test run
copy the prompt

Compare

For a given set of test runs, the Compare page provides both an overview of how each included version/model pairing performed on the selected eval metrics, and a breakdown of how each did test case was evaluated.

Studying individual test case performance can provide crucial insights on exactly why a certain version/model pairing didn't perform as well as may be expected.

Export

If you'd like to export the results of the included test runs, you can download a .CSV file by clicking the button above the test case breakdown section.

LLM-as-Judge Grading

If your prompt has LLM-as-Judge evals, a button will be added to the header allowing you to enter the grading & calibration interface that can be used to tune judge alignment.