Key Features
Real-Time Monitoring and Insights
Libretto offers a familiar and flexible SDK that will monitor and store - but not slow-down! - calls to your preferred LLM Provider. These Calls can then be viewed on Libretto's dashboards, where you can study LLM interactions in real-time across a variety of views.
For an overview of how to get started, explore our Quickstart guide.
Libretto SDK
We offer TypeScript and Python SDKs that act as wrappers for OpenAI's API. We are actively working on adding more providers. If you need to wrap another provider, let use know at hello@getlibretto.com.
Configuration
Our SDK uses your own provider API key and a Libretto API key. In addition to the parameters offered by the OpenAI API, we offer the option to associate your LLM calls with a given prompt template, and to redact personal information (PII).
Feedback
Of course, the answer provided by an LLM is not always correct or ideal, and it would be great to be able to proactively figure out when LLMs are misbehaving in production. Using the Feedback option, you can record user feedback from your site or app in Libretto and review that feedback in the Libretto dashboard. This lets you leverage user responses to better develop new tests and further improve your prompts.
There are two types of feedback: explicit and implicit. Explicit feedback is when you provide controls like a thumbs up and thumbs down for your users to rate the LLM response. Implicit feedback is when the user takes an action that implicitly shows either that they did or didn't like the LLM response. An example would be when you have an LLM suggest the title of a news article, but then give the user an option to edit the response. If they change the title, you can send this feedback back to Libretto to know that the answer was not ideal, along with the more correct answer that the user chose. For more information, check out the in-depth SDK guide.
Calls
The Libretto Project and Prompt Template dashboards lets you review both a summary of your LLM calls, and them individually. Depending on your SDK configuration, these will either be tied to a specific Project or a specific prompt template.
Prompt Template Detection
If you forgo assigning a Prompt Template to a call directly, we will analze your prompt and automatically create new a Prompt Template or assign it to an existing ones. If we detect new changes to your prompt, we will record those versions, which you can then review in the playground.
Flagged Calls
As your calls come in, Libretto will perform certain classifications to determine if there were any problematic occurrences within either the user input or LLM response.
These include jailbreak attempts, refusals by the LLM, toxicity, and LlamaGuard safety. You can see an aggregate number of these flagged calls on this dashboard, while getting a more detailed breakdown on the Prompt Template Dashboard.
Scored Calls
With your evals manually configured or auto-generated, Libretto will automatically run evals against a sample of your traffic (at a rate of roughly 1 in 20). These calls and their respective evaluation results can be viewed via the Scored tab.
Add to Test Cases
Find a unique prompt that wasn't covered in your test set? When you click on an event, you can easily add the user's arguments to your prompt template and the LLM's output as a test case, making your test set more robust. If the LLM flubbed the response, you can click "Edit & Add to Tests" to give a better response to your test.
Rate & Grading
Provide your own feedback on the response, by either directly Rating each the response via the UI, or utilize a custom scorecard to automate the Grading of your calls.
Drift
A real concern for any LLM application is Model Drift, where the responses of a given model provider change significantly without any clear indication of how or why. When you've reach the traffic threshold for auto-generating test cases outlined above, Libretto will initiate a separate Drift Dashboard for that Prompt Template.
We start by running all test cases a large number of times against the model you used, to then determine a base response. From there, we'll run these test cases again every day, and measure any potential variance between the base response and the new one. We aggregate and chart those variations to determine if model drift is likely.
You can view our conclusion at glance in the Project Dashboard, where each Prompt Template will be labeled as No Drift
, Drift Possible
, or Drift Likely
. If you click on the label, you can reach the Drift Dashboard for a deeper dive into the results.