GitHub - Giskard-AI/phare: Phare is a LLM benchmark that evaluates models across key AI security & safety dimensions

Install notes

Install uv
Clone this repo:

git clone https://github.com/Giskard-AI/phare

Install the requirements:

uv sync
source .venv/bin/activate

Setup secrets: Running the benchmark will requires tokens for calling the different models. Here is a list of expected env variables:

OPENAI_API_KEY
GEMINI_API_KEY
ANTHROPIC_API_KEY
OPENROUTER_API_KEY

Benchmark setup

To setup the benchmark, simply run:

python 01_setup_benchmark.py --config_path <path_to_config>.yaml --save_path <path_to_save_benchmark>.db

The Hugging Face repository and the path to the files for each submodule should be set in benchmark_config.yaml, under the hf_dataset and the data_path keys. Each category should have the following structure:

name: <category_name>
hf_dataset: giskardai/phare
data_path: <path_to_data>
tasks:
    - name: <task_name>
      scorer: <scorer_name>
      type: <task_type>
      description: <task_description>

Each task should provide a name, type, description and its associated scorer. Path to data should point to the folder under the Hugging Face repository, containing the jsonl files for each tasks.

For example, in the giskardai/phare repository, the hallucination/debunking as the <path_to_data> with misconceptions as <task_name> indicates the hallucination/debunking/misconceptions.jsonl file.

Inside the jsonl files, each line should have the following format:

{
    "id": "question_uuid",
    "messages": [{"role": "user", "content": "..."}, ...],
    "metadata": {
        "task": "category_name/task_name",
        "language": "en",

    },
    "evaluation_data": {
        ...
    }
}

Add a new category

To add a new task, follow these steps:

Add it in the benchmark_config.yaml file, with the correct data_path and a list of tasks.
Implement the required scorers used in the tasks of the categories in the scorers folder and add it to the SCORERS inside scorers/get_scorer.py.

Add a model

To add a new model, simply add it in the benchmark_config.yaml file, under the models key. You can also change the evaluation models in the evaluation_models key.

Run the benchmark

To run the benchmark, simply run:

python 02_run_benchmark.py <path_to_benchmark.db> --max_evaluations_per_task <int>

The max_evaluations_per_task argument is optional, it sets the maximum number of evaluations per task.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
results		results
scorers		scorers
.gitignore		.gitignore
.python-version		.python-version
01_setup_benchmark.py		01_setup_benchmark.py
02_run_benchmark.py		02_run_benchmark.py
README.md		README.md
benchmark_config.yaml		benchmark_config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Install notes

Benchmark setup

Add a new category

Add a model

Run the benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Giskard-AI/phare

Folders and files

Latest commit

History

Repository files navigation

Install notes

Benchmark setup

Add a new category

Add a model

Run the benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages