In this project I implemented in Python the HP model for protein folding.
The HP model is a simple model that can help to understand basic folding behaviours of proteins using Monte Carlo simulation over the free energy of the bounds. It approximate the protein sequence with only two category of aminoacids, H (hydrophobic) and P (polar). For more info look at this paper or at the paragraph below.
I created a command line application that can take the sequence of a protein (even with the 20 distinct aminoacids) and can run a HP model simulation of the folding process of the protein at given temperatures using, optionally, annealing algorithms and giving as outputs the energy evolution of the system, the structure of the protein the native structure energy (or the minimum energy configuration founded) and the compactness.
This project can be used to have a first impact of the behaviour of a protein and can be used to study the transitions to native states of proteins as function of temperature and bounding energy. Can be also done different test and comparison, for example seeing the different behaviours of similar proteins when two random adjacent aminoacids are switched.
From terminal move into the desired folder and clone this repository using the following command:
git clone https://github.com/TommyGiak/HP_model.git
Once the github repository is cloned use the following command to run the script with the default configuration file as input:
python main.py
the default configuration file is config.txt. Of course all the parameters can be setted, for further instruction look at the following paragraph.
To use a specific configuration file, instead, use:
python main.py <filename>
To create a personalized configuration file look here
The requirement to run this application are:
To change the protein sequence, the structure and the other parameters you can modify the config.txt file or create a configuration file exnovo as described below.
The protein sequence can be written in the config.txt file at the sequence variable. The sequence is caps sensitive so the letters must be upper case only. The sequence doesn't need the quotation marks " or ' .
This sequence can be already given as H/P only monomers (hydrophobic or polar) or with the 20 different amino-acids coded upper case, is this case the sequence will be converted to H/P.
The number of folding steps can be setted in the config.txt file at the folding_step variable.
The other options are:
- choosing if use annealing algorithm or not, setting annealing TRUE or FALSE.
- using a specific initial structure for the protein, setting use_structure TRUE and inserting a correct structure in structure
- set the initial temperature, using the variable T. If the variable annealing is TRUE the temperature decreases linearly to zero during the evolution of the protein, in the other case the temperature remains constant.
- create or not the gif of the process at the and of the evolution, using the variable create_gif TRUE or FALSE.
- random seed selection: you can specify the random seed to use or insert seed = None to generate a random one that will be printed when the starting of the simulation.
To create a personalized configuration file you can just copy the syntax of the config.txt changing the parameters as you prefer. To use it in the simulation follow the instruction here.
The new configuration file can have any extension.
The repository contain these pyhton files:
- main.py: runs the evolution and save the results in the /data folder
- protein_class.py: contains a class named
Protein
which contains and save the protein information and implement all the main function for the evolution of the system - utils.py: contains different functions to validate the structures of the proteins and to support the evolution of the protein, including reordering of the input file configurations
- plots.py: contains funtion to plot the results
- test.py: contains the test functions to test the code
- config.txt: contains the information needed by the program as input
- config_test.txt: contains the information to start the test of the application; just to be sure to don't have problems, I suggest you to don't modify it. To run the test is sufficient to write from terminal, inside the cloned repository, "pytest test.py" (pytest is required).
As already mentioned before, the HP model for protein folding is a model for protein folding that simplifies the complex process of predicting protein structures by focusing on two types of amino acids: hydrophobic (H) and polar (P). It assumes that hydrophobic amino-acids cluster together inside the protein to avoid water, while polar amino acids are on the protein's surface. This model is a useful educational tool to teach the basics of protein folding but is highly simplified compared to real protein folding, which involves many other factors and interactions. Researchers often use more sophisticated models for accurate predictions.
The algorithm for the protein folding is implemented in the Protein
class in the protein_class.py file, with the support of some more generic function in the utils.py file.
Each folding step involve the following steps:
- choose a random monomer in the protein: a random integer from
$0$ to$l-1$ where$l$ is the lenght of the protein sequence. The sampled monomer will be the starting point for the movement of the protein. - sample a random integer in
$[1,8]$ to randomly select a type of move, the movement are implemented in the functiontail_fold
in the utils.py file. The movement are: 1 = 90° clockwise rotation, 2 = 90° anticlockwise rotation, 3 = 180° rotaion, 4 = x-axis refletion, 5 = y-axis reflection, 6 = 1 and 3 quadrant bisector symmetry, 7 = 2 and 4 quadrant bisector symmetry and 8 = movement on a digaonal of a random monomer. The movement can also be choosen a priori, if it is not specified a random one is selected. - the new protein is validated: if the protein sequence overlap (or the distance between neighbours is different from one) the process restart from the step 1.
- if the protein structure is valid the new structure (the folded protein) is passed.
Once a new structure is generated by a folding step, the energy of this new structure is computed and the acception follows the Metropolis algorithm.
If the energy of the new protein structure is less than the previous one, the new structure is always accepted, if instead the energy is grater the new structure is accepted with probability:
where
N.B. I approximate
As example I used a simulation for a the protein sequence of Myoglobin (Camelus dromedarius) taken from here.
I simulated 100000 folding steps as indicated in the config.txt file.
I started the simulation with
The whole process took
The results are also present in the data folder, anyway, starting with a linear structure, I obtained the following results:
- energy minimum:
- compactness maximum:
- evolution process:
- energy evolution profile:
- compactness evolution profile:
It is easy to see from the plots how the energy and the compactness stabilize when the temperature