Inspect WandB

Integration between Inspect and Weights & Biases, including support for both the Models API for experiment tracking, and Weave for evaluation analysis and transcripts.

Demo Video

Check out this brief demo video for an overview of Inspect WandB

WIP: Integrating Inspect WandB with Inspect AI for LLM Evaluations 🚀 - Watch Video

If you prefer to read, you can check out a tutorial on the Inspect WandB docs site

Usage

Inspect WandB can be installed with:

pip install inspect-wandb

To install the optional Weave extra:

pip install inspect-wandb[weave]

Once Inspect WandB is installed in an environment authenticated with Weights & Biases (either by running wandb login or setting WANDB_API_KEY), the integration will be enabled for future Inspect runs by default. The Inspect logger output will link to the Models dashboard where you can track runs, and also, if you have enabled the weave extra, to the Weave dashboard where you can visualise eval results.

Some configuration options are available, including adjusting wandb config, settings tags, and adjusting Weave trace naming. To dive deeper with Inspect WandB, please see the documentation at https://inspect-wandb.readthedocs.io/en/latest/

Examples

The following are some examples of the types of data that can be automatically logged to W&B when Inspect WandB is enabled:

Models

The Models integration allows you to track each Inspect eval or eval-set run as a WandB run. This can be useful for having a shared source-of-truth for which evals have been run, as well as storing exact configurations for faithful reproductions in future.

Inspect evals tracked in W&B Runs table

Reproduction information tracked in a W&B Run, including Inspect metadata

Weave

The Weave integration traces Inspect evaluations, allowing you to track and analyse performance of different models on multiple tasks, visualise and compare result sets, and dig into individual transcripts.

Table of Inspect evaluations with score summaries in Weave

Trace tree of an Inspect task, with the main solver transcript selected for a given sample

Comparison of performance on AgentHarm between Claude 4 Sonnet and GPT 4o-mini

Contributing

Please see our contributing guidelines if you'd like to make contributions to Inspect WandB

Feedback

We welcome all feedback; the best way to get in touch to discuss the project is the Inspect Community #inspect_wandb Slack Channel

Project notes

This project was primarily developed by DanielPolatajko, Qi Guo, Matan Shtepel, and supervised by Justin Olive. It was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github		.github
.vscode		.vscode
docs		docs
inspect_wandb		inspect_wandb
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inspect WandB

Demo Video

Usage

Examples

Models

Weave

Contributing

Feedback

Project notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

UKGovernmentBEIS/inspect_wandb

Folders and files

Latest commit

History

Repository files navigation

Inspect WandB

Demo Video

Usage

Examples

Models

Weave

Contributing

Feedback

Project notes

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages