Skip to content

Jiaxu-Zhao/bias-attack

Repository files navigation

LLM Bias Attacks

This is the official repository for "Understanding Large Language Model Vulnerabilities to Social Bias Attacks" by Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, Mykola Pechenizkiy.

Table of Contents

Installation

We need the newest version of FastChat fschat==0.2.23 and please make sure to install this version. The llm-attacks package can be installed by running the following command at the root of this repository:

pip install -e .

Models

Please follow the instructions to download Vicuna-7B or/and LLaMA-2-7B-Chat first (we use the weights converted by HuggingFace here). Our script by default assumes models are stored in a root directory named as /DIR. To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/individual_xxx.py (for individual experiment) and experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiment). An example is given as follows.

    config.model_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more models
    ]
    config.tokenizer_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more tokenizers
    ]
pip install livelossplot

Experiments

The experiments folder contains code to reproduce GCG experiments on AdvBench.

  • To run individual experiments with harmful behaviors and harmful strings (i.e. 1 behavior, 1 model or 1 string, 1 model), run the following code inside experiments (changing vicuna to llama2 and changing behaviors to strings will switch to different experiment setups):
cd launch_scripts
bash run_gcg_individual.sh vicuna behaviors
  • To perform multiple behaviors experiments (i.e. 25 behaviors, 1 model), run the following code inside experiments:
cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2
  • To perform transfer experiments (i.e. 25 behaviors, 2 models), run the following code inside experiments:
cd launch_scripts
bash run_gcg_transfer.sh vicuna 2 # or vicuna_guanaco 4
  • To perform evaluation experiments, please follow the directions in experiments/parse_results.ipynb.

##Acknowledgements This repository makes use of code from the paper: Universal and Transferable Adversarial Attacks on Aligned Language Models

Citation

If you find this useful in your research, please consider citing:

@inproceedings{zhao2025understanding,
  title={Understanding Large Language Model Vulnerabilities to Social Bias Attacks},
  author={Zhao, Jiaxu and Fang, Meng and Ye, Fanghua and Xu, Ke and Zhang, Qin and Zhou, Joey Tianyi and Pechenizkiy, Mykola},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={17620--17636},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors