Smooth Control Evaluation Toolkit

This repository provides a lightweight, training-free framework for evaluating the smooth control of attribute intensity in text generated by Large Language Models (LLMs). It implements the core evaluation pipeline described in the paper: "Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs".

The toolkit allows you to quantify the intensity of a chosen attribute (e.g., formality, anger, conciseness) in a set of generated text responses by using an LLM as a judge and calculating Elo ratings based on pairwise comparisons.

Workflow

The process is simple and broken down into four main steps:

Prepare Data: Prepare your queries and the corresponding model-generated responses. You can use the benchmark queries we provide or bring your own.
Generate Pairs: Run a script to perform pairwise comparisons of your responses using an LLM judge (e.g., GPT-4o).
Calculate Ratings: Process the comparison results to compute an Elo rating for each response.
Analyze: Use the final ratings to measure the performance of your control method.

How to Use

1. Prepare Your Data

First, populate the data/ directory with your own files:

data/queries.json: This file is provided in the repository. It contains a list of benchmark queries, each with a unique id and text.
data/responses.json: A JSON file containing the LLM-generated responses you want to evaluate. Each entry should have a unique id, the query_id it corresponds to, and the text of the response.

2. Generate Pairwise Comparisons

The scripts/generate_pairs.ipynb notebook handles the comparison task.

Configure: Before running, set your OPENAI_API_KEY as an environment variable. Customize the attribute you want to evaluate by editing the ATTRIBUTE variable (e.g., "Formality", "Happiness") and modify the prompt in templates/comparison_prompt.txt if needed.
Run: The script sends pairs of responses to an LLM, which judges which response has a higher intensity of the specified attribute. It uses an efficient Swiss-style pairing system to minimize API calls.
Output: The results are saved to data/pairwise.jsonl.

3. Calculate Elo Ratings

Next, run the scripts/calculate_elo.ipynb notebook.

Run: This script reads the pairwise comparison data from data/pairwise.jsonl.
Output: It calculates the final Elo rating for each response and saves the results in two formats:
- data/elo_ratings.json: A JSON file with detailed ratings.
- data/elo_ratings.csv: A CSV file sorted by rating for easy analysis.

4. Analyze Results

The final Elo rating for each response represents its measured attribute intensity. You can now use these scores to evaluate your control method. For instance, you can plot the Elo ratings against your intended control values (e.g., discrete levels from 0 to 9) to assess calibration, consistency, and control range.

Citation

If you use this toolkit in your research, please cite the original paper:

@article{zhou2024evaluating,
  title={Evaluating the smooth control of attribute intensity in text generation with LLMs},
  author={Zhou, Shang and Yao, Feng and Dong, Chengyu and Wang, Zihan and Shang, Jingbo},
  journal={arXiv preprint arXiv:2406.04460},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
scripts		scripts
templates		templates
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Smooth Control Evaluation Toolkit

Workflow

How to Use

1. Prepare Your Data

2. Generate Pairwise Comparisons

3. Calculate Elo Ratings

4. Analyze Results

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

shangdatalab/Smooth-Control

Folders and files

Latest commit

History

Repository files navigation

Smooth Control Evaluation Toolkit

Workflow

How to Use

1. Prepare Your Data

2. Generate Pairwise Comparisons

3. Calculate Elo Ratings

4. Analyze Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages