Changelog - Libretto

Libretto is continually releasing new features, improvements, and bug fixes. Here's a list of our most important updates:

Changes in October 2024

Rate limiting for API events and prompt template creation
Better handling of user authentication and authorization
More accurate exact text matching for evaluating responses
Boolean scoring options for generation criteria
Ability to persist evaluations without creating test cases
Better handling of database connections and timeouts
Improved caching for embeddings and other operations
Revamped signin page and authentication flow
New projects dashboard interface
Better parameter display and test case management
Improved organization management features
Changes in September 2024
Improved performance and loading times across the application, particularly for prompts pages with many prompts
Added ability to hide matching or passing results in comparison views for clearer analysis
Enhanced test case generation capabilities, including support for chat history variables
Added new model support for Perplexity 3.1 models (and deprecated 3.0 versions)
Improved CSV exports with clearer column headers and more descriptive filenames
Added tooltips showing better/worse distributions in comparison views
Added compact view toggle to show more data in tables
Added copy buttons for easy copying of parameters and responses
Improved visualization of complex response chains with better formatting and tooltips
Added tracking of user access events and logging
Fixed various UI issues including tooltip display and table scrolling behavior

Added new evaluation types including Exact Text Match, Toxicity Check, and LLM-as-Judge to help assess prompt outputs
Improved chain monitoring interface to better track and visualize sequences of prompt calls
Enhanced test case management with better error handling and the ability to edit test parameters
Updated model defaults to use GPT-4o mini for better performance
Added ability to view human grades alongside LLM grades in test results
Improved JSON output display and formatting across the application
Added floating support button for easier access to help
Enhanced date picker interface on the production calls page
Fixed various UI issues including scrolling behavior and layout alignment
Added tooltips and better error messages throughout the interface
Improved handling of chat history in prompts
Added new Terms of Service page

Added support for viewing and grading test cases with multimodal content (images + text)
Improved grading workflow UI with better scrolling, navigation between test cases, and starting at first ungraded item
Added ability to auto-generate evaluation criteria outside of prompt creation flow
Added pagination controls for browsing assistant threads, making it easier to navigate between conversations
Improved prompt editing experience with unsaved changes warning when navigating away
Added support for Google's Gemini 1.5 models with chat-style messaging
Enhanced JSON result viewing with better expansion controls and formatting
Fixed issues with test case generation and evaluation flows
Improved handling of tool calls and function responses in test cases
Added ability to share grading results between organization members
Updated criteria scoring to better handle both positive and negative rubrics
Fixed various UI bugs related to scrolling, navigation and data display

Added Terms of Service requirement for users
Added support for Claude 3.5-Sonnet model
Added ability to use fetch in JavaScript evaluations
Improved test case generation:
- Now generates more test cases by default
- Better spacing and alignment of UI elements
- Added "Generate Test Cases" button for empty states
Enhanced evaluation features:
- New alignment score formula for auto-calibration
- Improved progress tracking during calibration
- Added ability to create new judge evaluations
- Added tooltips showing criteria and rubric details
Fixed several model-related issues:
- Fixed temperature display in test summaries
- Fixed Llama 3 instruction handling
- Fixed context window sizes for models
- Improved handling of Anthropic model details
UI Improvements:
- Cleaner navigation styling
- Better error displays in test case tables
- Added support for pasting plain text in editors
Added support for Anthropic tool use in playground
Improved bulk upload functionality to support chat history variables

Added support for OpenAI's Assistants API Threads, enabling more complex conversational interactions
Added GPT-4 with Vision (gpt-4o) support for both regular prompts and assistants
Added Perplexity models as new model options
Improved test case management:
- Added ability to edit and review test cases in bulk
- Added columns showing individual variables in test case tables
- Made target outputs optional for test cases
- Prevented duplicate test case creation
Enhanced project organization:
- Added ability to archive projects and prompts
- Added 404 pages for invalid projects/prompts
- Improved project metrics and filtering
UI Improvements:
- Added version dates in playground version selector
- Improved table layouts and pagination
- Added documentation link in header
- Fixed various UI glitches and animation issues
Added ability to export test cases with full argument details
Added safeguards to prevent concurrent test runs of the same scenario

Added a new Leaderboard page to compare performance across different prompt versions and models
Improved test case management:
- Added ability to suggest test case outputs automatically
- Added validation for function names and parameters
- Fixed issues with test case argument editing and generation
- Made test case list more organized with better sorting and filtering
Enhanced model support:
- Added GPT-4 Turbo and GPT-4 Turbo 2024-04-09 models
- Added Llama 3 support via Groq and Replicate
Improved experiment workflow:
- Fixed issues with experiment status updates and cancellation
- Added better handling of partial test run updates
- Improved performance of experiment calculations
UI improvements:
- Made playground layout more spacious and improved scrolling
- Improved formatting of numbers and classification results
- Added confirmation dialog for file deletion
- Fixed various layout and scrolling issues
Performance optimizations:
- Improved pagination and polling logic
- Optimized database queries and caching
- Reduced payload sizes for large test runs
- Added better handling of rate limits