Real-time Monitoring
Events and Monitoring
The Libretto Events dashboard lets you review all of your LLM calls. Depending on your SDK configuration, these will either be tied to a specific Project or a specific prompt template.
On any Project or Prompt Template page, you can navigate to the Events dashboard by selecting the Production Calls tab on the left-hand navigation.
Reviewing Events
As your code calls the Libretto SDK and LLM responses are returned, the dashboard table will populated in real-time. Here you'll see both the templatized argument and response, the call latency, date & time, and template version.
Grouping
Utilizing the toolbar at the top of the table, you can group calls by Call Type, a specific Group, environment, or date.
Rate Or Improve
By clicking on the Actions menu in the right-most column, you can rate the the LLM's response or enter an improved version. These are stored as Feedback as if it was sent via the Libretto SDK or OpenAI API.
Add Test Cases
Find a production call that would make a good test case? Using the right-hand menu, select 'Add Test Case' or, if you'd like to make changes, select 'Edit and Add Test Case' to add the instance to your test set.
Sourcing test cases from real data not only expedites and simplifies the process of reaching a sufficient test case threshold, but also is an excellent way to ensure your test set resembles your actual usage.
Grading
If you want to find some good test cases in your production data, it sometimes can be a bit difficult to find the best candidates. Libretto offers the ability to run automated grading on a sample of production calls over a certain period of time. You can access this feature by clicking Grade Suggested Test Cases, selecting a past time period to sample calls from, and starting it via Start Grader.
Grading Process
We leverage the classification capabilities of LLMs - much like some of your own prompts may - to output a discrete Grade for the selected calls. By default, we use a scorecard that consists of "Great" , "Good", "Poor" and "Terrible".
Review Grading
By default, all grades submitted by the system are labeled as Pending until user review. Once the grading system has outputted pending grades, you or a team member can review the grade for a given batch of calls.
You can either suggest an improved response for the LLM or accept the response as is.
From here, you can select from the approved calls to become test cases.