LLM Bias Attacks

This is the official repository for "Understanding Large Language Model Vulnerabilities to Social Bias Attacks" by Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, Mykola Pechenizkiy.

Installation

We need the newest version of FastChat fschat==0.2.23 and please make sure to install this version. The llm-attacks package can be installed by running the following command at the root of this repository:

pip install -e .

Models

Please follow the instructions to download Vicuna-7B or/and LLaMA-2-7B-Chat first (we use the weights converted by HuggingFace here). Our script by default assumes models are stored in a root directory named as /DIR. To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/individual_xxx.py (for individual experiment) and experiments/configs/transfer_xxx.py (for multiple behaviors or transfer experiment). An example is given as follows.

    config.model_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more models
    ]
    config.tokenizer_paths = [
        "/DIR/vicuna/vicuna-7b-v1.3",
        ... # more tokenizers
    ]

pip install livelossplot

Experiments

The experiments folder contains code to reproduce GCG experiments on AdvBench.

To run individual experiments with harmful behaviors and harmful strings (i.e. 1 behavior, 1 model or 1 string, 1 model), run the following code inside experiments (changing vicuna to llama2 and changing behaviors to strings will switch to different experiment setups):

cd launch_scripts
bash run_gcg_individual.sh vicuna behaviors

To perform multiple behaviors experiments (i.e. 25 behaviors, 1 model), run the following code inside experiments:

cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2

To perform transfer experiments (i.e. 25 behaviors, 2 models), run the following code inside experiments:

cd launch_scripts
bash run_gcg_transfer.sh vicuna 2 # or vicuna_guanaco 4

To perform evaluation experiments, please follow the directions in experiments/parse_results.ipynb.

##Acknowledgements This repository makes use of code from the paper: Universal and Transferable Adversarial Attacks on Aligned Language Models

Citation

If you find this useful in your research, please consider citing:

@inproceedings{zhao2025understanding,
  title={Understanding Large Language Model Vulnerabilities to Social Bias Attacks},
  author={Zhao, Jiaxu and Fang, Meng and Ye, Fanghua and Xu, Ke and Zhang, Qin and Zhou, Joey Tianyi and Pechenizkiy, Mykola},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={17620--17636},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
api_experiments		api_experiments
data		data
experiments		experiments
llm_attacks		llm_attacks
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
pip_proxy.sh		pip_proxy.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Bias Attacks

Table of Contents

Installation

Models

Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Bias Attacks

Table of Contents

Installation

Models

Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages