This is the official repository for "Understanding Large Language Model Vulnerabilities to Social Bias Attacks" by Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, Mykola Pechenizkiy.
We need the newest version of FastChat fschat==0.2.23 and please make sure to install this version. The llm-attacks package can be installed by running the following command at the root of this repository:
pip install -e .Please follow the instructions to download Vicuna-7B or/and LLaMA-2-7B-Chat first (we use the weights converted by HuggingFace here). Our script by default assumes models are stored in a root directory named as /DIR. To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/individual_xxx.py (for individual experiment) and experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiment). An example is given as follows.
config.model_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # more models
]
config.tokenizer_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # more tokenizers
]pip install livelossplotThe experiments folder contains code to reproduce GCG experiments on AdvBench.
- To run individual experiments with harmful behaviors and harmful strings (i.e. 1 behavior, 1 model or 1 string, 1 model), run the following code inside
experiments(changingvicunatollama2and changingbehaviorstostringswill switch to different experiment setups):
cd launch_scripts
bash run_gcg_individual.sh vicuna behaviors- To perform multiple behaviors experiments (i.e. 25 behaviors, 1 model), run the following code inside
experiments:
cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2- To perform transfer experiments (i.e. 25 behaviors, 2 models), run the following code inside
experiments:
cd launch_scripts
bash run_gcg_transfer.sh vicuna 2 # or vicuna_guanaco 4- To perform evaluation experiments, please follow the directions in
experiments/parse_results.ipynb.
##Acknowledgements This repository makes use of code from the paper: Universal and Transferable Adversarial Attacks on Aligned Language Models
If you find this useful in your research, please consider citing:
@inproceedings{zhao2025understanding,
title={Understanding Large Language Model Vulnerabilities to Social Bias Attacks},
author={Zhao, Jiaxu and Fang, Meng and Ye, Fanghua and Xu, Ke and Zhang, Qin and Zhou, Joey Tianyi and Pechenizkiy, Mykola},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={17620--17636},
year={2025}
}