No longer maintained
Toolkit for collecting datasets for Agents and Planning models and running evaluation pipelines.
pip install requirements.txtWe use Hydra library for evaluation pipeline.
Each configuration is specified in eval.yaml format:
# @package _global_
hydra:
job:
name: ${agent.name}_${agent.model_name}_[YOUR_ADDITIONAL_TOKEN_OR_NOTHING]
run:
dir:[YOUR_PATH_TO_OUTPUT_DIR]/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
defaults:
- _self_
- data_source: hf
- env: code_engine
- agent: planningWhere you can define the datasource, env and agent you want to evaluate. We present several implementations for each defined in sub yamls:\
| field | options |
|---|---|
data_source |
hf.yaml |
env |
code_engine.yaml http.yaml few_shot.yaml |
agent |
few_shot.yaml planning.yaml vanilla.yaml reflexion.yaml tree_of_thoughts.yaml adapt.yaml |
The challenge is to generate project template -- small compilable project that can be described in 1-5 sentences containing small examples of all mentioned libraries/technologies/functionality.
Dataset of template-related repos collected GitHub are published to HuggingFace 🤗. Details about the dataset collection and source code is placed in template_generation directory.
To run the evaluation pipeline, please execute the following command in your console:
python3 -m src/template_generation/run_eval --multirun agent=planning agent.model_name=gpt-3.5-turbo-1106,gpt-4-1106-preview
| Model | Metrics |
|---|---|