Implementation for the paper Evaluating Tuning Strategies for Sequence Generation with Protein Language Models.
Adapted (forked and modified/enhanced) from https://github.com/corolla-johnson/mkultra, which is a prompt
tuning implementation for NLP, supporting GPT-2.
The prompt tuning method was developed by Lester et al., see Lester et al. The power of scale for parameter-efficient prompt tuning.
For setup, you can reconstruct our conda environment for prompt tuning by the environment.yaml file.
This implementation currently supports ProtGPT2 (Ferruz et al. ProtGPT2 is a deep unsupervised language model for protein design.)
and RITA (Hesslow et al. RITA: a Study on Scaling Up Generative Protein Sequence Models.). See mkultra/tuning.py for implementation details.
Training scripts for RITA and ProtGPT2 are RITA_prompt_tuning.py and ProtGPT2_prompt_tuning.py, respectively.
They can be configured with parameters specified in a JSON config, see training_configs/ folder and the
Trainer documentation in mkultra/trainers.py. You can enable memory tracking during training, but this makes the training slower.
The training configurations in training_configs/ are the ones that we used for our paper. This includes the configs for training and evaluation, as well as those for runtime measuring and memory tracking.
An example notebook for training a prompt for RITA is RITA_prompt_tuning_example.ipynb, open in Colab here.
You can train the model with Fasta datasets if you use the FastaDataset class (mkultra.sequence_loader.py)
as dataset input for a PyTorch DataLoader.
Dataset preprocessing as in RITA can be done with the prepare_dataset.ipynb notebook in the utils folder.
Also, have a look into the script utils/clustering.sh to see the configuration we used to cluster our datasets using MMseqs2. The current setup of the dataset notebook and clustering script are for clustering with 100% sequence similarity threshold, but you can adjust that.
We also provide the datasets that we used for our experiments in the datasets/ folder.
They contain sequences from the Pfam family PF03272.
The dataset (datasets/InterProUniprotPF03272.fasta) was downloaded from InterPro
on January 5, 2023 (Paysan-Lafosse et al. InterPro in 2022.). Then, we removed all sequences containing an X, which created the dataset InterProUniprotPF03272_Xremoved.fasta.
We clustered our data with MMseqs2(Steinegger et al. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. And: Steinegger et al. Clustering huge protein sequence sets in linear time.). The datasets clustered with a sequence similarity threshold of 100%,
which were our main datasets, are InterProUniprotPF03272prepared_<train|validation|test>.fasta. Datasets clustered with different
thresholds can be found as InterProUniprotPF03272_<threshold>_<train|validation|test>.fasta
This implementation currently supports perplexity evaluation as perplexity per token.
See RITA_prompt_comparison_basemodel.py and RITA_prompt_comparison_trainvaltest.py, which are the evaluation scripts that
we used for our experiments. If you want to write an evaluation script yourself, have a look at mkultra/evaluator.py.
We further evaluated the amount of generated sequences that were classified by ProtCNN (Bileschi et al. Using deep learning to annotate the protein universe.) as belonging to our target family PF03272.
The script we used for that is protcnn.py, taken and adapted from the official ProtCNN notebook. As ProtCNN uses Tensorflow and we wanted to keep it separate from our other scripts, we created an extra conda environment for it, specified in environment-protcnn.yml. For using the script, you have to
download the ProtCNN model and vocabulary as described in the official notebook:
wget -qN https://storage.googleapis.com/brain-genomics-public/research/proteins/pfam/models/single_domain_per_sequence_zipped_models/seed_random_32.0/5356760.tar.gz
tar xzf 5356760.tar.gz
wget https://storage.googleapis.com/brain-genomics-public/research/proteins/pfam/models/single_domain_per_sequence_zipped_models/trained_model_pfam_32.0_vocab.json
You can set a fixed sliding window size and/or stride, or let the prediction be run on all possible windows of minimum size 50 (or the sequence length if shorter) for a sequence. Our code also supports running an ensemble of multiple ProtCNN models (as described in the ProtCNN paper), for this you have to modify the list of saved models in the script and download the respective additional models. Further, in single-model mode, you can set a probability threshold to discard predictions with a lower probability. In our final experiments, we used no fixed window size or stride, a single model, and a probability threshold of 0.5.
In addition to ProtCNN, we also evaluated protein family prediction with HMMER with profile-specific gathering thresholds, which you can run as follows:
hmmsearch --cut_ga --tblout <path/to/output/file> <path/to/HMM/file> <path/to/sequences/file>
The HMMER runs for all our sets of generated sequences are bundled in the script hmmer_search.sh. We use the Pfam profile HMM downloaded from InterPro for the family.
To reproduce our evaluations of protein activity (see Johnson et al. Computational Scoring and Experimental Evaluation of Enzymes Generated by Neural Networks.), run activity_prediction_metrics/metrics.py for all generated datasets. See also activity_prediction_metrics/metrics.ipynb for an interactive version.
Because these evaluations need specific dependencies, use an extra conda environment that is created with the same packages as specifed in the environment-activity.yml file. Furthermore, the ESM likelihood computation requires the Github repo https://github.com/seanrjohnson/protein_gibbs_sampler. We advise to run the notebook activity_prediction_metrics/metrics.ipynb once initially, because it installs all required dependencies additional to those from the environment-activity.yml.
Then, use activity_prediction_metrics/aggregate.ipynb to aggregate the results. This counts the activity predictions for each generated dataset and writes everything into activity_prediction_metrics/activity.csv for further usage, e.g. plotting.
For generating sequences, instantiate a prompt tuning model (see mkultra/tuning.py) and then load and add a prompt (see mkultra/checkpoint_loader.py) that was trained for that type of model, as for example in RITA_prompt_sequence_generation.py. In our experiments, we generated 193 sequences (size of our test set) in batches of 10.
In our paper, we compare the performance of prompt-tuned models to that of finetuned models. For finetuning, we use the run_clm.py script by Huggingface with the same batch sizes as for prompt tuning. You have to add trust_remote_code=True to the model loading call in line 376.
To generate the datasets as txt, you can use the prepare_dataset.ipynb notebook in the utils folder.