UHC: Uncovering the Hidden Code: A Study of Protein Latent Encoding

Uncovering the Hidden Code: A Study of Protein Latent Encoding

Riccardo Tedoldi

Supervisor: Jacopo Staiano

Applied NLP Project, Spring 2023

The figures show the separation of proteins that do different things in the latent space. Once the network has been trained to perform a downstream task. The learned latent space captures the details and characteristics of what the protein does. I used different colours to characterise different proteins for which we have statistical evidence that they have similar functions. Because of the sequence similarity. The figures are shown in this order: img-clusters-token.png; img-clusters-nn.EMBEDDING.png; img-clusters-ESM2.png. I discuss the results figures in more detail in the report.

Installation

I report here the file to create a conda environment with all the requirements.

conda create --name <env-name> --file ./requirements.txt

Then, you can activate the environment with:

conda activate <env-name>

Dataset

I release the data used in the experiments in a google drive folder. The structure of the folder is the following:

data
├───proteins
│   ├───albumin
│   │   ├───sequences_bpe
│   │   │   |___ BPE_model_albumin.model
│   │   │   |___ BPE_model_albumin.vocab
│   │   │   |___ sequences_BPE_str.pkl
│   │   │   |___ sequences_BPE.pkl
│   │   |   |___ sequences_pck.pkl
│   │   |   |___ sequences.fasta
│   │   |   |___ sequences.txt
|   |   |___ GenBank-id.fasta
│   ├───aldehyde
|   |   |___ ...
│   ├───catalase
|   |   |___ ...
│   ├───collagen
|   |   |___ ...
│   ├───elastin
|   |   |___ ...
│   ├───erythropoietin
│   |   |___ ...
│   ├───fibrinogen
│   |   |___ ...
│   ├───hemoglobin
│   |   |___ ...
│   ├───immunoglobulin
│   |   |___ ...
│   ├───insulin
│   |   |___ ...
│   ├───keratin
│   |   |___ ...
│   ├───myosin
│   |   |___ ...
│   ├───protein_kinase
│   |   |___ ...
│   ├───trypsin
│   |   |___ ...
│   ├───tubulin
│   |   |___ ...
├───token_axa
│   ├─── proteins_tokenized_padded.pt
|   |___lightning_logs
├───token_axa_nn_embedding
│   ├─── proteins_tokenized_padded.pt
|   |___lightning_logs
├───esm2
│   ├─── esm_embedding_0_3.pkl
│   ├─── esm_embedding_4_7.pkl
│   ├─── esm_embedding_8_10.pkl
│   ├─── esm_embedding_11_14.pkl
│   ├─── esm_embedding_label.pkl
|   |___ esm_sequences_dict.pkl
|   |___experiment
|   |    |___ esm_X_embedding_dataset_AVG.pt
|   |    |___ pred_emb.pt
|   |    |___ pred_label.pt
|   |    |___lightning_logs
|___human_gene_go_term
|   |___ uniprot-annotations.fasta
|   |___ uniprot-annotations.tsv
|   |___ with_structure_uniprot-annotations.fasta
|   |___ with_structure_uniprot-annotations.tsv
|   |___ dictionary_all_go_tensor_padded.pkl
|   |___ dictionary_all_go_tensor.pkl
|   |___ dictionary_all_go.pkl
__________________________________________________

Concerning the human_gene_go_term I retrieved the go term of almost 20000 human gene. Additionally, I have the pre-computed embedding of them obtainded using ProtBERT. The latent representation of them is available residue-per-residue or per-protein. In order to facilitate further investigations, I found the precomputed embedding per-protein of the entire Swiss-Prot database. In the corresponding notebook, in the folder experiments, you can find the code to extract the embedding.

Additionally, in the folders you can find also jupyter-notebooks with the architectures and the models checkpoints. The moldels checkpoints are available also here under the folder experiments > model_ckpt.

Usage

In the folder experiments, you can find scripts and juptyer notebooks to run the experiments. The jupyter notebooks are self-explanatory. In the file module.py inside the folder experiments, you can find the functions implemented to preprocess the data. In separate files you can find the architecture implemented in each version. Inside the experiments folder you will find an additional short description for each file.

The folder img, contains the images of the cross-weighting block and the architecture proposed in the discussion of the paper. Additionally, you can find the images of the results of the experiments. Specifically, the images are the following:

img-crossweighting.png: the cross-weighting block;
img-architecture.png: the architecture proposed in the discussion of the report;
img-table-metrics.png: the table with the metrics trend;
img-clusters-token.png: the clusters obtained by the model using a fixed vocabulary;
img-clusters-nn.EMBEDDING.png: the clusters obtained by the model using a learnable vocabulary;
img-clusters-ESM2.png: the clusters obtained by the model using the ESM2 embedding;

🚧🚧 Other investigations discussed in the report are under development 🚧🚧.

Contributing

We have made our implementation publicly available, and we welcome anyone who would like to join us and contribute to the project.

Contact

If you have suggestions or ideas for further improvemets/research please feel free to contact me.

riccardo tedoldi: @riccardotedoldi

License

The code is licensed under the MIT license, which you can find in the LICENSE file.

To cite this work

@misc{Tedoldi2023,
    title   = {Uncovering the Hidden Code: A Study of Protein Latent Encoding},
    author  = {Riccardo Tedoldi},
    year    = {2023},
    url  = {https://github.com/r1cc4r2o/UHC}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
experiments		experiments
img		img
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UHC: Uncovering the Hidden Code: A Study of Protein Latent Encoding

Table of Contents

Installation

Dataset

Usage

Contributing

Contact

License

To cite this work

About

Releases

Packages

Languages

r1cc4r2o/UHC

Folders and files

Latest commit

History

Repository files navigation

UHC: Uncovering the Hidden Code: A Study of Protein Latent Encoding

Table of Contents

Installation

Dataset

Usage

Contributing

Contact

License

To cite this work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages