This repository provides a lightweight, training-free framework for evaluating the smooth control of attribute intensity in text generated by Large Language Models (LLMs). It implements the core evaluation pipeline described in the paper: "Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs".
The toolkit allows you to quantify the intensity of a chosen attribute (e.g., formality, anger, conciseness) in a set of generated text responses by using an LLM as a judge and calculating Elo ratings based on pairwise comparisons.
The process is simple and broken down into four main steps:
- Prepare Data: Prepare your queries and the corresponding model-generated responses. You can use the benchmark queries we provide or bring your own.
- Generate Pairs: Run a script to perform pairwise comparisons of your responses using an LLM judge (e.g., GPT-4o).
- Calculate Ratings: Process the comparison results to compute an Elo rating for each response.
- Analyze: Use the final ratings to measure the performance of your control method.
First, populate the data/ directory with your own files:
data/queries.json: This file is provided in the repository. It contains a list of benchmark queries, each with a uniqueidandtext.data/responses.json: A JSON file containing the LLM-generated responses you want to evaluate. Each entry should have a uniqueid, thequery_idit corresponds to, and thetextof the response.
The scripts/generate_pairs.ipynb notebook handles the comparison task.
- Configure: Before running, set your
OPENAI_API_KEYas an environment variable. Customize the attribute you want to evaluate by editing theATTRIBUTEvariable (e.g.,"Formality","Happiness") and modify the prompt intemplates/comparison_prompt.txtif needed. - Run: The script sends pairs of responses to an LLM, which judges which response has a higher intensity of the specified attribute. It uses an efficient Swiss-style pairing system to minimize API calls.
- Output: The results are saved to
data/pairwise.jsonl.
Next, run the scripts/calculate_elo.ipynb notebook.
- Run: This script reads the pairwise comparison data from
data/pairwise.jsonl. - Output: It calculates the final Elo rating for each response and saves the results in two formats:
data/elo_ratings.json: A JSON file with detailed ratings.data/elo_ratings.csv: A CSV file sorted by rating for easy analysis.
The final Elo rating for each response represents its measured attribute intensity. You can now use these scores to evaluate your control method. For instance, you can plot the Elo ratings against your intended control values (e.g., discrete levels from 0 to 9) to assess calibration, consistency, and control range.
If you use this toolkit in your research, please cite the original paper:
@article{zhou2024evaluating,
title={Evaluating the smooth control of attribute intensity in text generation with LLMs},
author={Zhou, Shang and Yao, Feng and Dong, Chengyu and Wang, Zihan and Shang, Jingbo},
journal={arXiv preprint arXiv:2406.04460},
year={2024}
}