Real-time Monitoring

Events and Monitoring

The Libretto Events dashboard lets you review all of your LLM calls. Depending on your SDK configuration, these will either be tied to a specific Project or a specific prompt template.

On any Project or Prompt Template page, you can navigate to the Events dashboard by selecting the Production Calls tab on the left-hand navigation.


Reviewing Events

As your code calls the Libretto SDK and LLM responses are returned, the dashboard table will populated in real-time. Here you'll see both the templatized argument and response, the call latency, date & time, and template version.

Grouping

Utilizing the toolbar at the top of the table, you can group calls by Call Type, a specific Group, environment, or date.

Rate Or Improve

By clicking on the Actions menu in the right-most column, you can rate the the LLM's response or enter an improved version. These are stored as Feedback as if it was sent via the Libretto SDK or OpenAI API.

Add Test Cases

Find a production call that would make a good test case? Using the right-hand menu, select 'Add Test Case' or, if you'd like to make changes, select 'Edit and Add Test Case' to add the instance to your test set.

Sourcing test cases from real data not only expedites and simplifies the process of reaching a sufficient test case threshold, but also is an excellent way to ensure your test set resembles your actual usage.


Grading

If you want to find some good test cases in your production data, it sometimes can be a bit difficult to find the best candidates. Libretto offers the ability to run automated grading on a sample of production calls over a certain period of time. You can access this feature by clicking Grade Suggested Test Cases, selecting a past time period to sample calls from, and starting it via Start Grader.

Grading Process

We leverage the classification capabilities of LLMs - much like some of your own prompts may - to output a discrete Grade for the selected calls. By default, we use a scorecard that consists of "Great" , "Good", "Poor" and "Terrible".

Review Grading

By default, all grades submitted by the system are labeled as Pending until user review. Once the grading system has outputted pending grades, you or a team member can review the grade for a given batch of calls.

You can either suggest an improved response for the LLM or accept the response as is.

From here, you can select from the approved calls to become test cases.

Previous
Python SDK