UHC: Uncovering the Hidden Code: A Study of Protein Latent Encoding
Uncovering the Hidden Code: A Study of Protein Latent Encoding
Supervisor: Jacopo Staiano
Applied NLP Project, Spring 2023
The figures show the separation of proteins that do different things in the latent space. Once the network has been trained to perform a downstream task. The learned latent space captures the details and characteristics of what the protein does. I used different colours to characterise different proteins for which we have statistical evidence that they have similar functions. Because of the sequence similarity. The figures are shown in this order: img-clusters-token.png
; img-clusters-nn.EMBEDDING.png
; img-clusters-ESM2.png
. I discuss the results figures in more detail in the report.
- Installation: Instructions on how to install and set up the project locally. Include any dependencies that need to be installed.
- Usage: Instructions on how to use the project.
- Contributing: Instructions on how to contribute to the project.
- License: Information about the project's license.
I report here the file to create a conda environment with all the requirements.
conda create --name <env-name> --file ./requirements.txt
Then, you can activate the environment with:
conda activate <env-name>
I release the data used in the experiments in a google drive folder. The structure of the folder is the following:
data
├───proteins
│ ├───albumin
│ │ ├───sequences_bpe
│ │ │ |___ BPE_model_albumin.model
│ │ │ |___ BPE_model_albumin.vocab
│ │ │ |___ sequences_BPE_str.pkl
│ │ │ |___ sequences_BPE.pkl
│ │ | |___ sequences_pck.pkl
│ │ | |___ sequences.fasta
│ │ | |___ sequences.txt
| | |___ GenBank-id.fasta
│ ├───aldehyde
| | |___ ...
│ ├───catalase
| | |___ ...
│ ├───collagen
| | |___ ...
│ ├───elastin
| | |___ ...
│ ├───erythropoietin
│ | |___ ...
│ ├───fibrinogen
│ | |___ ...
│ ├───hemoglobin
│ | |___ ...
│ ├───immunoglobulin
│ | |___ ...
│ ├───insulin
│ | |___ ...
│ ├───keratin
│ | |___ ...
│ ├───myosin
│ | |___ ...
│ ├───protein_kinase
│ | |___ ...
│ ├───trypsin
│ | |___ ...
│ ├───tubulin
│ | |___ ...
├───token_axa
│ ├─── proteins_tokenized_padded.pt
| |___lightning_logs
├───token_axa_nn_embedding
│ ├─── proteins_tokenized_padded.pt
| |___lightning_logs
├───esm2
│ ├─── esm_embedding_0_3.pkl
│ ├─── esm_embedding_4_7.pkl
│ ├─── esm_embedding_8_10.pkl
│ ├─── esm_embedding_11_14.pkl
│ ├─── esm_embedding_label.pkl
| |___ esm_sequences_dict.pkl
| |___experiment
| | |___ esm_X_embedding_dataset_AVG.pt
| | |___ pred_emb.pt
| | |___ pred_label.pt
| | |___lightning_logs
|___human_gene_go_term
| |___ uniprot-annotations.fasta
| |___ uniprot-annotations.tsv
| |___ with_structure_uniprot-annotations.fasta
| |___ with_structure_uniprot-annotations.tsv
| |___ dictionary_all_go_tensor_padded.pkl
| |___ dictionary_all_go_tensor.pkl
| |___ dictionary_all_go.pkl
__________________________________________________
Concerning the human_gene_go_term
I retrieved the go term of almost 20000 human gene. Additionally, I have the pre-computed embedding of them obtainded using ProtBERT. The latent representation of them is available residue-per-residue or per-protein. In order to facilitate further investigations, I found the precomputed embedding per-protein of the entire Swiss-Prot database. In the corresponding notebook, in the folder experiments
, you can find the code to extract the embedding.
Additionally, in the folders you can find also jupyter-notebooks with the architectures and the models checkpoints. The moldels checkpoints are available also here under the folder experiments > model_ckpt
.
In the folder experiments
, you can find scripts and juptyer notebooks to run the experiments. The jupyter notebooks are self-explanatory. In the file module.py
inside the folder experiments
, you can find the functions implemented to preprocess the data. In separate files you can find the architecture implemented in each version. Inside the experiments
folder you will find an additional short description for each file.
The folder img
, contains the images of the cross-weighting block and the architecture proposed in the discussion of the paper. Additionally, you can find the images of the results of the experiments. Specifically, the images are the following:
img-crossweighting.png
: the cross-weighting block;img-architecture.png
: the architecture proposed in the discussion of the report;img-table-metrics.png
: the table with the metrics trend;img-clusters-token.png
: the clusters obtained by the model using a fixed vocabulary;img-clusters-nn.EMBEDDING.png
: the clusters obtained by the model using a learnable vocabulary;img-clusters-ESM2.png
: the clusters obtained by the model using the ESM2 embedding;
🚧🚧 Other investigations discussed in the report are under development 🚧🚧.
We have made our implementation publicly available, and we welcome anyone who would like to join us and contribute to the project.
If you have suggestions or ideas for further improvemets/research please feel free to contact me.
- riccardo tedoldi: @riccardotedoldi
The code is licensed under the MIT license, which you can find in the LICENSE file.
@misc{Tedoldi2023,
title = {Uncovering the Hidden Code: A Study of Protein Latent Encoding},
author = {Riccardo Tedoldi},
year = {2023},
url = {https://github.com/r1cc4r2o/UHC}
}