This repo contains data and scripts to demonstrate how Sentence-Transformers can be used with protein Language Models, in particular ESM models, as demonstrated in the paper Optimizing protein language models with Sentence Transformers, NeurIPS (2023).
Please note that this implementation requires GPUs.
git clone https://github.com/PeptoneLtd/contrastive-finetuning-plms.git
cd contrastive-finetuning-plms
pip install -r full_env.txt
Two minimal examples showing how to train a solubility and disorder prediction are provided.
scripts/solubility_search_seeds.py
scripts/disorder_st_avg.py
Note that the scripts take the data from the data
folder and might require adjusting of the paths depending on the environment setting.
For the disorder
task in case of a large scale search, one might consider caching the frozen residue level representations from ESM,
as currently it automatically downloads those from huggingface on-the-fly.
If you use this work in your research, please cite the the relevant software:
@inproceedings{adopt2,
title = {Optimizing protein language models with Sentence Transformers},
author = {Istvan Redl and Fabio Airoldi and Sandro Bottaro and Albert Chung and Oliver Dutton and Carlo Fisicaro and Patrik Foerch and Louie Henderson and Falk Hoffmann and Michele Invernizzi and Benjamin M J Owens and Stefano Ruschetta and Kamil Tamiola},
booktitle = {Proceedings of the NeurIPS Workshop on Machine Learning in Structural Biology},
year = {2023},
note = {Workshop Paper},
url = {https://www.mlsb.io/papers_2023/Optimizing_protein_language_models_with_Sentence_Transformers.pdf}
}
This source code is licensed under the Apache 2.0 license found in the LICENSE
file in the root directory of this source tree.