Skip to content

shangdatalab/Smooth-Control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Smooth Control Evaluation Toolkit

This repository provides a lightweight, training-free framework for evaluating the smooth control of attribute intensity in text generated by Large Language Models (LLMs). It implements the core evaluation pipeline described in the paper: "Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs".

The toolkit allows you to quantify the intensity of a chosen attribute (e.g., formality, anger, conciseness) in a set of generated text responses by using an LLM as a judge and calculating Elo ratings based on pairwise comparisons.

Workflow

The process is simple and broken down into four main steps:

  1. Prepare Data: Prepare your queries and the corresponding model-generated responses. You can use the benchmark queries we provide or bring your own.
  2. Generate Pairs: Run a script to perform pairwise comparisons of your responses using an LLM judge (e.g., GPT-4o).
  3. Calculate Ratings: Process the comparison results to compute an Elo rating for each response.
  4. Analyze: Use the final ratings to measure the performance of your control method.

How to Use

1. Prepare Your Data

First, populate the data/ directory with your own files:

  • data/queries.json: This file is provided in the repository. It contains a list of benchmark queries, each with a unique id and text.
  • data/responses.json: A JSON file containing the LLM-generated responses you want to evaluate. Each entry should have a unique id, the query_id it corresponds to, and the text of the response.

2. Generate Pairwise Comparisons

The scripts/generate_pairs.ipynb notebook handles the comparison task.

  • Configure: Before running, set your OPENAI_API_KEY as an environment variable. Customize the attribute you want to evaluate by editing the ATTRIBUTE variable (e.g., "Formality", "Happiness") and modify the prompt in templates/comparison_prompt.txt if needed.
  • Run: The script sends pairs of responses to an LLM, which judges which response has a higher intensity of the specified attribute. It uses an efficient Swiss-style pairing system to minimize API calls.
  • Output: The results are saved to data/pairwise.jsonl.

3. Calculate Elo Ratings

Next, run the scripts/calculate_elo.ipynb notebook.

  • Run: This script reads the pairwise comparison data from data/pairwise.jsonl.
  • Output: It calculates the final Elo rating for each response and saves the results in two formats:
    • data/elo_ratings.json: A JSON file with detailed ratings.
    • data/elo_ratings.csv: A CSV file sorted by rating for easy analysis.

4. Analyze Results

The final Elo rating for each response represents its measured attribute intensity. You can now use these scores to evaluate your control method. For instance, you can plot the Elo ratings against your intended control values (e.g., discrete levels from 0 to 9) to assess calibration, consistency, and control range.


Citation

If you use this toolkit in your research, please cite the original paper:

@article{zhou2024evaluating,
  title={Evaluating the smooth control of attribute intensity in text generation with LLMs},
  author={Zhou, Shang and Yao, Feng and Dong, Chengyu and Wang, Zihan and Shang, Jingbo},
  journal={arXiv preprint arXiv:2406.04460},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •