-
Job.description. Optional human-readable description field on Job. -
Job.imagePrefix/Job.image_prefix. Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files. -
Tag-based filtering. New
TagFiltermodel withinclude_tagsandexclude_tags, used at two levels:Job.taskFilters/Job.task_filters— select tasks by metadata tagsJob.sampleFilters/Job.sample_filters— select samples by metadata tags
-
JobTask.args. Per-task argument overrides. Allows a job to pass task-specific arguments (e.g.base_url,dataset_path) to individual tasks. -
Task.systemMessage/Task.system_message. System prompt override at the task level. -
Task.sandboxParameters/Task.sandbox_parameters. Pass-through dictionary for sandbox plugin configuration. -
Task.files/Task.setup. Task-level file and setup declarations. Task-levelfilesstack with sample-levelfiles(sample wins on key conflict). Sample-levelsetupoverrides task-levelsetup. -
Variant
task_parameters. Variants can now declaretask_parameters, an arbitrary dict merged into the task config at runtime. -
module:tasksyntax. Task function references can now usemodule.path:function_nameformat for Python tasks.
-
Task.taskFunc→Task.func. Renamed model field to match the YAML key name. JSON serialization key changes from"task_func"to"func". Both Dart and Python packages must update in lockstep. -
Sandbox registry is now configurable. The hardcoded
kSandboxRegistryandkSdkChannelsmaps are extracted fromeval_set_resolver.dartand made data-driven, allowing non-Flutter projects to define their own sandbox configurations. -
Removed
workspaceandtestsfrom task and sample YAML. Replaced byfiles(a{destination: source}map) andsetup(a shell command string). These are Inspect AI's nativeSamplefields. The oldworkspace:/tests:keys and their path/git/template sub-formats are no longer supported. -
Consolidated sandbox config.
Job.sandboxEnvironment,Job.sandboxParameters,Job.imagePrefixcollapsed into a singleJob.sandboxmap (keys:environment,parameters,image_prefix). -
Consolidated Inspect AI eval arguments. Individual top-level Job fields (
retryAttempts,failOnError,logLevel,maxTasks, etc.) collapsed into a singleJob.inspectEvalArguments/Job.inspect_eval_argumentspass-through dict. -
inspect_task_argsis now a pass-through dict. Individual sub-fields (model,epochs,time_limit, etc.) are no longer typed on theTaskmodel. The entireinspect_task_argssection is passed through as-is to Inspect AI'sTask()constructor. -
Removed
JobTask.systemMessage. System message is now set at the task level viaTask.systemMessage. -
Variant field renames.
context_files→files,skill_paths→skills. Variant-level task restriction usesinclude-variants/exclude-variantson the job'stasks.<id>object instead of task-levelallowed_variants.
- Added
docs/reference/yaml_config.mdwith complete field-by-field reference tables. - Updated
docs/reference/configuration_reference.mdwith new examples and directory structure. - Updated
docs/guides/config.md.
dataset_config_pythonpackage. Python port of the Dart config package (dataset_config_dart), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models forJob,Task,Sample,EvalSet,Variant,Dataset, andContextFile. Exposesresolve()andwrite_eval_sets()as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML.
-
Renamed
dataset_config→dataset_config_dart. The Dart config package was renamed for clarity alongside the new Python package. -
Renamed
dash_evals_config→dataset_config_python. The Python config package was renamed from its original name for consistency with the Dart package.
-
eval_configDart package. New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. ProvidesConfigResolverfacade plus direct access toYamlParser,JsonParser,EvalSetResolver, andEvalSetWriter. -
Dual-mode eval runner. The Python runner now supports two invocation modes:
run-evals --json ./eval_set.json— consume a JSON manifest produced by the Dart CLIrun-evals --task <name> --model <model>— run a single task directly from CLI arguments
-
Generalized task functions. Task implementations are now language-agnostic by default. Flutter-specific tasks (
flutter_bug_fix,flutter_code_gen) are thin wrappers around the genericbug_fixandcode_gentasks. New tasks:analyze_codebase,mcp_tool,skill_test. -
New Dart domain models.
EvalSet,Task,Sample,Variant, andTaskInfomodels in themodelspackage map directly to the Inspect AI evaluation structure.
-
Removed Python
registries.py. Task/model/sandbox registries are removed. Task functions are now discovered dynamically viaimportlib(short names like"flutter_code_gen"resolve automatically). -
Removed
TaskConfigandSampleConfig. Replaced byParsedTask(intermediate parsing type ineval_config) andSample(Inspect AI domain model). -
Removed legacy Python config parsing. The
config/parsers/directory,load_yamlutility, and associated model definitions have been removed fromeval_runner. Configuration is now handled by the Darteval_configpackage. -
Models package reorganized. Report-app models (used by the Flutter results viewer) moved to
models/lib/src/report_app/. The top-levelmodels/lib/src/now contains inspect-domain models. -
Dataset utilities moved.
DatasetReader,filesystem_utils, and discovery helpers moved fromeval_configtoeval_cli.
-
Variant format changed from list to named map. Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via
allowed_variantsin theirtask.yaml.Before (list format):
variants: - baseline - { mcp_servers: [dart] }
After (named map format):
# job.yaml variants: baseline: {} mcp_only: { mcp_servers: [dart] } context_only: { context_files: [./context_files/flutter.md] } full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
# task.yaml (optional — omit to accept all job variants) allowed_variants: [baseline, mcp_only]
-
Removed
DEFAULT_VARIANTSregistry. Variants are no longer defined globally inregistries.py. Each job file defines its own variants. -
Removed
variantsfromJobTask. Per-task variant overrides (job.tasks.<id>.variants) are replaced by task-levelallowed_variantswhitelists.