Skip to content

Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 24, 2025

Conversation

jbragg
Copy link
Collaborator

@jbragg jbragg commented Jul 17, 2025

Prepare for https://github.com/allenai/astabench-issues/issues/277 and similar use

The agenteval.json file gets a new field that looks like the following (haven't tested publishing / updating the hf schema):

  "eval_spec": {
      "solver": "path/to/solvers/perplexity_base.py@perplexity_solver",
      "solver_args": "{}",
      "model": "perplexity/sonar-deep-research",
      "model_args": "{}",
      "task_args": "{\"litqa_args\": [], \"litqa_kwargs\": {}}",
      "revision": {
        "type": "git",
        "origin": "[email protected]:org/repo.git",
        "commit": "abcd123"
      },
      "packages": "{\"inspect_ai\": \"0.3.106\"}"
    }

I would be open to pruning back the piped solver/model and their args, but figured those might be useful as part of a pointer to the agent source / run command. (For now, in the leaderboard viewer I have only explicitly constructed a source url pointing to the repo at the git revision, which doesn't include that information).

UPDATE: I also piped through task_args and packages to get better visibility into whether submissions have been run in a standardized/correct way.

@jbragg jbragg force-pushed the jbragg/lb-view-source-url branch from 691f426 to 5444efc Compare July 17, 2025 20:01
@jbragg jbragg force-pushed the jbragg/lb-view-source-url branch from 5444efc to 33de052 Compare July 17, 2025 20:09
@regan-huff
Copy link
Contributor

Do you know if lb view still works if you mix files with and without these fields in the same dataset? Do we need to keep results from before and after the change separated?

@jbragg
Copy link
Collaborator Author

jbragg commented Jul 17, 2025

@regan-huff My understanding is that if we update the schema with only new fields as in this PR, then we don't need to keep files with/out those fields separate, and the ones with missing fields should get None. We could test by updating the schema to include these new fields and see if nothing breaks.

@regan-huff
Copy link
Contributor

ok, I did a test and what I observed is that I can mix the data, but I had to delete the existing readme to get the schema to update with upload_summary_to_hf. Not surprisingly, the new data would not load with the old schema.

(test-agent-eval) 16:01:10 ~/src/agent-eval lb-view-source-url $ agenteval lb view --repo-id allenai/asta-bench-test-results --config 1.0.0-dev1 --split validation
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.34k/2.34k [00:00<00:00, 2.52MB/s]
(…)aloney-sandwich_2025-07-17T23-01-09.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38.5k/38.5k [00:00<00:00, 25.7MB/s]
Generating validation split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 188.49 examples/s]
                              id                                             Agent Agent description User/organization Submission date Logs                                                     Source Openness Agent tooling                                                            LLM base  Overall Overall cost lit cost data cost code cost discovery cost
2025-07-17 23:01:09.064808+00:00                                  baloney-sandwich                               Regan      2025-07-17 None https://github.com/allenai/asta-bench-private/tree/2ac7433   Closed      Standard                                           [gemini/gemini-2.0-flash] 0.115244         None     None      None      None           None
2025-06-09 21:03:14.658669+00:00           Basic ReAct (task tools, report editor)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.012516         None     None      None      None           None
2025-06-09 21:05:49.252775+00:00 Basic ReAct (task tools, report editor w/ submit)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.010538         None     None      None      None           None
2025-06-09 21:04:02.400490+00:00              Basic ReAct (task tools, no editors)                            miked-ai      2025-06-09 None                                                       None     None          None      [gpt-4o-2024-08-06, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.008756         None     None      None      None           None
2025-06-09 21:02:33.921472+00:00            Basic ReAct (task tools, table editor)                            miked-ai      2025-06-09 None                                                       None     None          None [gpt-4o-mini-2024-07-18, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.004971         None     None      None      None           None
2025-06-09 21:04:59.642327+00:00  Basic ReAct (task tools, table editor w/ submit)                            miked-ai      2025-06-09 None                                                       None     None          None [gpt-4o-mini-2024-07-18, meta-llama/Llama-4-Scout-17B-16E-Instruct] 0.001100         None     None      None      None           None

@jbragg jbragg force-pushed the jbragg/lb-view-source-url branch from 987fe75 to 9d21b7b Compare July 17, 2025 23:22
@jbragg jbragg force-pushed the jbragg/lb-view-source-url branch from 9d21b7b to b74f079 Compare July 17, 2025 23:25
@regan-huff
Copy link
Contributor

I'm seeing the init of LeaderboardViewer (which calls EvalResult.model_validate) fail when pointed at existing data when EvalSpec gets moved under TaskResult so I think if the leaderboard picked up this change right now it would break.

Screenshot 2025-07-17 at 4 21 59 PM

What's the rationale for changing the structure of EvalResult?

@jbragg
Copy link
Collaborator Author

jbragg commented Jul 17, 2025

I pushed a fix for that error (should have tested, thanks).

What's the rationale for changing the structure of EvalResult?

This PR seeks to serialize the eval_spec information somewhere in EvalResult so that the lb viewer can construct the details about how the agent was run like the source code. eval_specs used to be a field on the main EvalResult, which was previously excluded from serialization. I moved it inside each TaskResult now, since there is a 1:1 mapping (this seemed cleaner than the original implementation in this PR which kept eval_specs as a field on EvalResult, which would need associating with task names and joining).

@jbragg jbragg changed the title Pipe through evalspecs and git revision for leaderboard Pipe through evalspecs w/ git revision, packages, and solver/model/task args for leaderboard Jul 21, 2025
@jbragg jbragg mentioned this pull request Jul 23, 2025
@regan-huff regan-huff merged commit bd4a897 into main Jul 24, 2025
3 checks passed
@regan-huff regan-huff deleted the jbragg/lb-view-source-url branch July 24, 2025 19:10
@regan-huff
Copy link
Contributor

View at:
https://pypi.org/project/agent-eval/0.1.18/
Successfully published version 0.1.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants