Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

jbragg · 2025-07-17T19:54:39Z

Prepare for https://github.com/allenai/astabench-issues/issues/277 and similar use

The agenteval.json file gets a new field that looks like the following (haven't tested publishing / updating the hf schema):

  "eval_spec": {
      "solver": "path/to/solvers/perplexity_base.py@perplexity_solver",
      "solver_args": "{}",
      "model": "perplexity/sonar-deep-research",
      "model_args": "{}",
      "task_args": "{\"litqa_args\": [], \"litqa_kwargs\": {}}",
      "revision": {
        "type": "git",
        "origin": "[email protected]:org/repo.git",
        "commit": "abcd123"
      },
      "packages": "{\"inspect_ai\": \"0.3.106\"}"
    }

I would be open to pruning back the piped solver/model and their args, but figured those might be useful as part of a pointer to the agent source / run command. (For now, in the leaderboard viewer I have only explicitly constructed a source url pointing to the repo at the git revision, which doesn't include that information).

UPDATE: I also piped through task_args and packages to get better visibility into whether submissions have been run in a standardized/correct way.

regan-huff · 2025-07-17T21:40:33Z

Do you know if lb view still works if you mix files with and without these fields in the same dataset? Do we need to keep results from before and after the change separated?

jbragg · 2025-07-17T22:09:22Z

@regan-huff My understanding is that if we update the schema with only new fields as in this PR, then we don't need to keep files with/out those fields separate, and the ones with missing fields should get None. We could test by updating the schema to include these new fields and see if nothing breaks.

regan-huff · 2025-07-17T23:04:58Z

ok, I did a test and what I observed is that I can mix the data, but I had to delete the existing readme to get the schema to update with upload_summary_to_hf. Not surprisingly, the new data would not load with the old schema.

(test-agent-eval) 16:01:10 ~/src/agent-eval lb-view-source-url $ agenteval lb view --repo-id allenai/asta-bench-test-results --config 1.0.0-dev1 --split validation
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.34k/2.34k [00:00<00:00, 2.52MB/s]
(…)aloney-sandwich_2025-07-17T23-01-09.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38.5k/38.5k [00:00<00:00, 25.7MB/s]
Generating validation split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 188.49 examples/s]
                              id                                             Agent Agent description User/organization Submission date Logs                                                     Source Openness Agent tooling                                                            LLM base  Overall Overall cost lit cost data cost code cost discovery cost
2025-07-17 23:01:09.064808+00:00                                  baloney-sandwich                               Regan      2025-07-17 None https://github.com/allenai/asta-bench-private/tree/2ac7433   Closed      Standard                                           [gemini/gemini-2.0-flash] 0.115244         None     None      None      None           None
2025-06-09 21:03:14.658669+00:00           Basic ReAct (task tools, report editor)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.012516         None     None      None      None           None
2025-06-09 21:05:49.252775+00:00 Basic ReAct (task tools, report editor w/ submit)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.010538         None     None      None      None           None
2025-06-09 21:04:02.400490+00:00              Basic ReAct (task tools, no editors)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.008756         None     None      None      None           None
2025-06-09 21:02:33.921472+00:00            Basic ReAct (task tools, table editor)                            miked-ai      2025-06-09 None                                                       None     None          None [gpt-4o-mini-2024-07-18, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.004971         None     None      None      None           None
2025-06-09 21:04:59.642327+00:00  Basic ReAct (task tools, table editor w/ submit)                            miked-ai      2025-06-09 None                                                       None     None          None [gpt-4o-mini-2024-07-18, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.001100         None     None      None      None           None

regan-huff · 2025-07-17T23:26:48Z

I'm seeing the init of LeaderboardViewer (which calls EvalResult.model_validate) fail when pointed at existing data when EvalSpec gets moved under TaskResult so I think if the leaderboard picked up this change right now it would break.

What's the rationale for changing the structure of EvalResult?

jbragg · 2025-07-17T23:41:54Z

I pushed a fix for that error (should have tested, thanks).

What's the rationale for changing the structure of EvalResult?

This PR seeks to serialize the eval_spec information somewhere in EvalResult so that the lb viewer can construct the details about how the agent was run like the source code. eval_specs used to be a field on the main EvalResult, which was previously excluded from serialization. I moved it inside each TaskResult now, since there is a 1:1 mapping (this seemed cleaner than the original implementation in this PR which kept eval_specs as a field on EvalResult, which would need associating with task names and joining).

…in a compliant way

…-url

regan-huff · 2025-07-24T19:12:56Z

View at:
https://pypi.org/project/agent-eval/0.1.18/
Successfully published version 0.1.18

jbragg force-pushed the jbragg/lb-view-source-url branch from 691f426 to 5444efc Compare July 17, 2025 20:01

Pipe through evalspecs and git revision for leaderboard

33de052

jbragg force-pushed the jbragg/lb-view-source-url branch from 5444efc to 33de052 Compare July 17, 2025 20:09

jbragg requested review from regan-huff, mdarcy220 and AmberRose2 July 17, 2025 20:17

Nest evalspec under taskresult to simplify

0fc2d8c

jbragg removed request for mdarcy220 and AmberRose2 July 17, 2025 23:11

jbragg force-pushed the jbragg/lb-view-source-url branch from 987fe75 to 9d21b7b Compare July 17, 2025 23:22

Improve error handling in process_eval_logs

b74f079

jbragg force-pushed the jbragg/lb-view-source-url branch from 9d21b7b to b74f079 Compare July 17, 2025 23:25

jbragg added 2 commits July 17, 2025 16:34

maintain backwards compatibility with old results missing evalspec

094feb6

fix docstring

99f9ce2

regan-huff approved these changes Jul 17, 2025

View reviewed changes

jbragg added 3 commits July 20, 2025 14:00

sort keys for serialization consistency

678b298

log task_args and packages for understanding if submissions were run …

f88cb03

…in a compliant way

Merge remote-tracking branch 'origin/main' into jbragg/lb-view-source…

5496bdd

…-url

jbragg changed the title ~~Pipe through evalspecs and git revision for leaderboard~~ Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard Jul 21, 2025

jbragg mentioned this pull request Jul 23, 2025

Rescore existing results #28

Draft

regan-huff added 2 commits July 24, 2025 12:07

Update pyproject.toml

0dfbf41

Merge branch 'main' into jbragg/lb-view-source-url

376697d

regan-huff merged commit bd4a897 into main Jul 24, 2025
3 checks passed

regan-huff deleted the jbragg/lb-view-source-url branch July 24, 2025 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

Uh oh!

jbragg commented Jul 17, 2025 •

edited

Loading

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

jbragg commented Jul 17, 2025

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

jbragg commented Jul 17, 2025

Uh oh!

Uh oh!

regan-huff commented Jul 24, 2025

Uh oh!

Uh oh!

Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

Uh oh!

Conversation

jbragg commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

jbragg commented Jul 17, 2025

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

regan-huff commented Jul 17, 2025

Uh oh!

jbragg commented Jul 17, 2025

Uh oh!

Uh oh!

regan-huff commented Jul 24, 2025

Uh oh!

Uh oh!

jbragg commented Jul 17, 2025 •

edited

Loading