Skip to content

Releases: letta-ai/letta-evals

letta-evals: v0.7.0

04 Nov 19:37
b3df7d8

Choose a tag to compare

0.7.0 (2025-11-04)

Features

  • Evaluate multiple models with letta-code (#113) (f98933b)
  • Support multiple graders in gates with weighted average and logical combinations (#117) (d0d0add)

Bug Fixes

Documentation

  • Add examples for multiple metric gates (#118) (739e4de)
  • Add reference to multi grader gate example in top level README (#119) (7ad732d)
  • Clean up README of Claude references (#114) (a41f9d0)

letta-evals: v0.6.1

30 Oct 21:12
54b1e96

Choose a tag to compare

0.6.1 (2025-10-30)

Bug Fixes

Chores

letta-evals: v0.6.0

30 Oct 01:40
c3800bb

Choose a tag to compare

0.6.0 (2025-10-29)

Features

Bug Fixes

Refactors

  • Add extra vars to Sample (#100) (9ce87cd)
  • Make target spec a discriminated union (#103) (762b520)
  • Refactor AgentTarget to LettaAgentTarget (#98) (952d6f8)
  • Refactor Target to AbstractAgentTarget (#99) (bb632e0)

Documentation

Chores

letta-evals: v0.5.0

23 Oct 01:16
cb1bc98

Choose a tag to compare

0.5.0 (2025-10-23)

Features

  • Add agent_id to visualization (#91) (2f77348)
  • Add agent-as-judge support for rubric grading (#77) (ae4878e)
  • Add summary tables on suite finish for all display types (rich, simple) (#92) (19e1e1c)
  • Support anthropic models as grader (#83) (c38cf1f)
  • Support default Letta judge agent with new letta_judge grader kind (#86) (b4bfd6c)

Bug Fixes

  • Add defensive check for run_id from streaming chunk (#75) (7d34884)
  • Add pre-fill trick for Anthropic json output (#84) (2a4fd4a)
  • Fix retry logic for failing agent (#74) (ecd5d5a)
  • Fix typo in chunks appending (#82) (d888474)
  • OpenRouter for Kimi (#76) (2cd6192)
  • Print out chunks on run_id error (#81) (8b71a64)

Performance Improvements

Refactors

  • Flatten package imports for easier pip usage (#89) (24fd61a)
  • Rename rubric to model_judge (#87) (0047d3f)
  • Use Pydantic discriminated union for GraderSpec types (#88) (bb21f1b)

Documentation

letta-evals: v0.4.1

21 Oct 20:45
499d616

Choose a tag to compare

0.4.1 (2025-10-21)

Bug Fixes

  • Use rubric grader for file system task (#63) (03ef537)

Documentation

Chores

letta-evals: v0.4.0

21 Oct 17:35
0a76789

Choose a tag to compare

0.4.0 (2025-10-21)

Features

Bug Fixes

Chores

letta-evals: v0.3.2

20 Oct 22:33
453a773

Choose a tag to compare

0.3.2 (2025-10-20)

Features

  • Add visualization library and simple visualization configurations (#55) (57d483d)

letta-evals: v0.3.1

20 Oct 21:10
cb2ce16

Choose a tag to compare

0.3.1 (2025-10-20)

Bug Fixes

letta-evals: v0.3.0

20 Oct 20:13
9c7dff9

Choose a tag to compare

0.3.0 (2025-10-20)

Features

  • Add filesystem benchmark generator (#48) (c2be72a)
  • Add max samples to display (#47) (c821171)
  • Add memory block built-in extractor (#51) (d171a10)
  • Remove hardcoding metric to accuracy (#45) (62f41c4)
  • Support passing in handles instead of just model configs (#41) (927e70a)
  • Use gpt-5-mini as rubric grader model (#46) (4a4a0ff)

Bug Fixes

  • Cannot access local variable 'stream' error (#33) (a989a16)
  • Expunge send_message and disable tool rules (7775596)
  • Fix streaming bug returns partial results (8d3e3d8)
  • Model, status and metric columns after evaluation completes (#34) (b47c508)
  • Update leaderboard task suites (#35) (436ce6f)

Refactors

  • Support passing in token, base_url, and project_id programmatically (#36) (1e3780a)

Documentation

  • Add README for memory block extraction (#52) (49ccf25)

Chores

  • Configurable retries and timeout (9ee97b6)
  • Report average metrics across attempted and total samples (#50) (f2b4f7a)
  • Separate files for headers and summary (#49) (9f3dbcc)
  • Update examples to use letta_v1_agent (#31) (62a6ab6)
  • Update model configs (91afcb8)

letta-evals: v0.2.0

15 Oct 22:23
353fe54

Choose a tag to compare

0.2.0 (2025-10-15)

Features

  • Add builtin tool output/arguments extractors (5133966)
  • add core memory update benchmark (#21) (e0261de)
  • add filesystem eval (#24) (55b902d)
  • add letta leaderboard (#8) (ae68e22)
  • Add model configs and multi-model runners (cf6a707)
  • Add programmatic agent creation (fd820a2)
  • Add support for re-grading cached evaluation trajectories (#19) (8feab64)
  • Clean up results.json schema (#18) (ec72e2b)
  • Flatten directories further (759638c)
  • Implement decorator based custom functions (e53ee19)
  • Refactor to use TaskGroups (9bc810b)
  • Support custom extractors/Python tool evaluators (fc1b2f7)
  • support multiple metrics (#27) (b0fa023)
  • Support relative paths for custom graders (6f3cda3)
  • Support streaming for stability (ef18ef6)
  • update together configs (#20) (79d7890)

Chores

  • Prepare repo for Pypi publishing (#28) (6ad92b2)
  • Rename ideal to ground truth (9367c05)