Repository for the paper:
Detecting Backdoored LoRAs from Weights Alone
Anonymous authors
Under review as a conference paper at COLM 2026 (double-blind)
This project studies static backdoor detection for LoRA adapters without model execution, trigger search, or training-data access.
Given a LoRA adapter, the method reconstructs projection-wise updates in attention (q, k, v, o), extracts spectral/geometric features, and applies a calibrated logistic detector to produce a poison score.
The repository includes:
- Data generation scripts for benign, poisoned, and test adapter banks.
- A weight-only detector implementation.
- Calibration and held-out evaluation scripts.
- Analysis and plotting scripts for layer/rank sensitivity and cross-model geometry.
For each selected layer and projection, the pipeline computes a compact descriptor from LoRA updates using five statistics:
- Leading singular value (
sigma_1) - Frobenius norm (
||deltaW||_F) - Spectral energy concentration
- Spectral entropy
- Kurtosis of flattened update entries
Projection-wise descriptors are concatenated into a 20-dimensional representation and standardized. A logistic regression model maps this representation to a score, and a threshold is selected on validation data (gap-based when strictly separable, otherwise Youden-style).
.
├── bankCreation/ # Adapter-bank construction scripts
├── core/ # Detector and feature extraction logic
├── evaluation/ # Calibration, evaluation, and analysis scripts
├── plotScripts/ # Figure generation scripts
├── resultsFinal/ # Generated figures/reports (organized by subfolder)
├── config.py # Global experiment configuration
└── requirements.txt # Python dependencies
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtIf required by your workflow, create a .env file with authentication tokens (for example HF_TOKEN).
python bankCreation/benignBank.py
python bankCreation/poisonBank.py
python bankCreation/testSet.py
python bankCreation/build_reference_bank.pypython evaluation/calibrate_detector.pyThis step fits the detector and writes calibration artifacts (threshold, reports, distributions).
python evaluation/evaluate_test_set.pyThis step writes held-out metrics and error analysis artifacts.
Current plotting scripts read inputs from plotScripts/... and write outputs to resultsFinal/... by subfolder:
plotScripts/layerRankPlots/plot_layer_rank_heatmap.pyplotScripts/layerRankPlots/plot_backdoor_detection_correlation.pyplotScripts/hackTokenPlots/plot_hack_tokens.pyplotScripts/crossModelPlots/cross_model_similarity.py --plots-only
Example:
python plotScripts/layerRankPlots/plot_layer_rank_heatmap.py
python plotScripts/layerRankPlots/plot_backdoor_detection_correlation.py
python plotScripts/hackTokenPlots/plot_hack_tokens.py
python plotScripts/crossModelPlots/cross_model_similarity.py --plots-only- Backbones: Llama-3.2-3B-Instruct, Qwen2.5-3B, Gemma-2-2B
- Detector input: projection-wise spectral/geometric signature from LoRA weights
- Calibration/test protocol: separate calibration bank and held-out test bank
- Attack families: rare-token and contextual trigger poisoning
- Poisoning rates: 1%, 3%, and 5%
For full definitions, equations, and ablations, refer to the submitted manuscript.
- Main script behavior is controlled by
config.py. - Most evaluation scripts assume expected folder layouts under the project root.
- Generated artifacts (plots, reports, caches) can be large; keep runs isolated if needed.
This repository is anonymized for double-blind review. Please cite the final camera-ready paper once available.