Releases: letta-ai/letta-evals
Releases · letta-ai/letta-evals
letta-evals: v0.7.0
letta-evals: v0.6.1
letta-evals: v0.6.0
0.6.0 (2025-10-29)
Features
- add eval website (#104) (2daaf0c)
- Fix workflow (#105) (62207bd)
- Support letta code as builtin target (#101) (fe1ae2f)
Bug Fixes
- Fix kwargs in run function (#97) (c0f64b0)
- Remove duplicate gpt 4.1 results (#95) (2b9092c)
- Update Sonnet 4.5 cost (#96) (3b7a39d)
Refactors
- Add extra vars to Sample (#100) (9ce87cd)
- Make target spec a discriminated union (#103) (762b520)
- Refactor AgentTarget to LettaAgentTarget (#98) (952d6f8)
- Refactor Target to AbstractAgentTarget (#99) (bb632e0)
Documentation
- Add letta code example to READMEs (#102) (2fe0f68)
- adjust (2b16cd3)
- patch svg, update site (64c3491)
Chores
letta-evals: v0.5.0
0.5.0 (2025-10-23)
Features
- Add agent_id to visualization (#91) (2f77348)
- Add agent-as-judge support for rubric grading (#77) (ae4878e)
- Add summary tables on suite finish for all display types (rich, simple) (#92) (19e1e1c)
- Support anthropic models as grader (#83) (c38cf1f)
- Support default Letta judge agent with new
letta_judgegrader kind (#86) (b4bfd6c)
Bug Fixes
- Add defensive check for run_id from streaming chunk (#75) (7d34884)
- Add pre-fill trick for Anthropic json output (#84) (2a4fd4a)
- Fix retry logic for failing agent (#74) (ecd5d5a)
- Fix typo in chunks appending (#82) (d888474)
- OpenRouter for Kimi (#76) (2cd6192)
- Print out chunks on run_id error (#81) (8b71a64)
Performance Improvements
Refactors
- Flatten package imports for easier pip usage (#89) (24fd61a)
- Rename
rubrictomodel_judge(#87) (0047d3f) - Use Pydantic discriminated union for GraderSpec types (#88) (bb21f1b)
Documentation
letta-evals: v0.4.1
letta-evals: v0.4.0
letta-evals: v0.3.2
letta-evals: v0.3.1
letta-evals: v0.3.0
0.3.0 (2025-10-20)
Features
- Add filesystem benchmark generator (#48) (c2be72a)
- Add max samples to display (#47) (c821171)
- Add memory block built-in extractor (#51) (d171a10)
- Remove hardcoding metric to accuracy (#45) (62f41c4)
- Support passing in handles instead of just model configs (#41) (927e70a)
- Use
gpt-5-minias rubric grader model (#46) (4a4a0ff)
Bug Fixes
- Cannot access local variable 'stream' error (#33) (a989a16)
- Expunge send_message and disable tool rules (7775596)
- Fix streaming bug returns partial results (8d3e3d8)
- Model, status and metric columns after evaluation completes (#34) (b47c508)
- Update leaderboard task suites (#35) (436ce6f)
Refactors
Documentation
Chores
letta-evals: v0.2.0
0.2.0 (2025-10-15)
Features
- Add builtin tool output/arguments extractors (5133966)
- add core memory update benchmark (#21) (e0261de)
- add filesystem eval (#24) (55b902d)
- add letta leaderboard (#8) (ae68e22)
- Add model configs and multi-model runners (cf6a707)
- Add programmatic agent creation (fd820a2)
- Add support for re-grading cached evaluation trajectories (#19) (8feab64)
- Clean up results.json schema (#18) (ec72e2b)
- Flatten directories further (759638c)
- Implement decorator based custom functions (e53ee19)
- Refactor to use TaskGroups (9bc810b)
- Support custom extractors/Python tool evaluators (fc1b2f7)
- support multiple metrics (#27) (b0fa023)
- Support relative paths for custom graders (6f3cda3)
- Support streaming for stability (ef18ef6)
- update together configs (#20) (79d7890)