Skip to content
/ vcf2prot Public

Accelerate the generation of personalized proteomes from a Variant calling format (VCF) file and a reference proteome using graphical processing units (GPUs).

License

Notifications You must be signed in to change notification settings

ikmb/vcf2prot

Repository files navigation

IKMB_LOGO

VCF2Prot

Project Aim

Generating sample-specific FASTA files from a consequence called VCF-file and a reference proteome.

Execution Logic and Requirements

Input Requirements

  1. A reference fasta file containing transcript ids as sequence identifiers and the protein sequences of each transcript, for example,
>TRANS_ID
TRANS_SEQ_LINE1
TRANS_SEQ_LINE2 
>TRANS_ID
TRANS_SEQ_LINE1
.
.
.

That is, the parser expects every char between > and '\n' to be the transcript name. Also, make sure that the ids used in the file are the same as in the VCF file. Otherwise, the program will not be able to function properly.

  1. A VCF file containing the variants observed in the study population. The VCF file should be generated by BCF/csq as vcf2prot has been optimized to decode it's bit-mask and to parse it's consequence field. The file should also be phased and in a flat-VCF not BCF format.

Notes

  1. The only exception is when the python wrapper is used which work directly with BCF tabix indexed files.

  2. You can decode a BCF file into a VCF using the following command:

bcftools view PATH_TO_BCF -O v -o PATH_TO_VCF

Hardware Requirements

GPU version

The GPU version of vcf2prot expects Nvidia-GPU to be accessible on the system. During the development we utilized Tesla V100 SXM2 32GB.

CPU version

Expects a modern multi-core CPU with a big enough RAM to hold the whole file in memory. During development a compute node with 512 GB of RAM and a twin intel Xeon CPU were used.

Software Requirements

The GPU version of the code can be compiled on a Linux-system with an available NVCC compiler and an Nvidia GPU.

The CPU version of the code can be compiled on a Linux and Mac OS system with Cargo.

To ensure a correct code compilation, make sure you are using Rust version 1.65.0 or higher.

Execution Logic

vcf2prot execution logic can be separate into the following main steps:

  1. Reading and parsing the input VCF file where the file is read as a UTF-8 encoded string, patients' names are extracted, and records are filtered where only records with a supported protein coding effect are included into the next step. List of alterations supported by the current version can be found at file list_supported_alterations.txt.

  2. Once the VCF records have been filtered, bit-masks are decoded and combined with the consequence mutation to generate a hash-table linking each patient to a collection of mutation observed in both of the patients haplotypes.

  3. For each patient, mutations are grouped by the transcript id, i.e. all mutations occurring on a specific transcript are combined together.

  4. For each collection of mutations, mutations are translated into instructions, at that stage mutations are checked for logical errors, e.g. Mutational Engulfment, Where one mutation is a subset of another mutation, or Multiple annotations, where for the same position is annotated with more than one mutation. Also semantic-equivalence where two mutations are different at the genetic level but are equivalent at the protein level is taken place leading to a much smaller and a more consistence definition of alterations at the protein-level. In case any logical error was encountered, a waring message is printed to the standard output descriptor and the transcript is filtered out. Finally, instructions are interpreted and a simple representation for the sequence transcript is generated, internally, this is represented as a vector of Tasks.

  5. After encoding each transcript into tasks, all transcripts are concatenated end-to-end to generate a vector of tasks describing the generation of all sequences in the haplotype.

  6. Next, a backend engine is used to execute the tasks and generate the sequences. This engine can be a collection of CPU-threads or an execution stream on the GPU.

  7. Finally, the generated personalized proteomes are written to the Desk either as a flat FASTA files or in a compressed format.

Usage

Two mandatory inputs are needed by the tool, the first is the VCF containing the consequences calling and the second is a FASTA file containing reference sequences.

Example

Clone the project

git clone https://github.com/ikmb/vcf2prot

Please note that git usually comes pre-installed on most Mac OS and Linux systems. If git is not available at your system, you can install it from here

Change directory to the vcf2prot

cd vcf2prot 

Please notice that after calling git, a directory named vcf2prot in the directory from which git has been called will be available.

Is vcf2prot installed ?

To follow along, make sure the executable vcf2prot has been installed on your system and is available on your PATH. Incase it is not installed, check the installation guideline below.

Export Env variables

Let's Inspect the SIR on the CPU before execution, instruction's generation and the Task's arrays

export DEBUG_CPU_EXEC=TRUE
export INSPECT_TXP=TRUE
export INSPECT_INS_GEN=TRUE

for more details about the meaning of the exported variables, check the Environment Variables section below

Unzipped the example file and reference sequences
gunzip examples/*
Create a new directory to store the results
mkdir results 
Call vcf2prot with some example data
Note

A pre-compiled versions of VCF2Prot for MacOS and Linux can be found at the bins directory, choose the correct version for your operator system, i.e. Linux and MacOS, and then call VCF2Prot accordingly.

Pre-compiled Linux version
./bins/Linux/vcf2prot -f examples/example.vcf -r examples/reference_sequences.fasta -v -g st -o results
Pre-compiled MacOS version
./bins/MacOS/vcf2prot -f examples/example.vcf -r examples/reference_sequences.fasta -v -g st -o results
Locally-built version
./target/release/vcf2prot -f examples/example.vcf -r examples/reference_sequences.fasta -v -g st -o results

Where the o flag determines the path to write the FASTA files, and the v for printing log statement.

Environment Variables

vcf2prot also utilizes environmental variable heavily to customize its behavior, the list of environmental variable utilized by the vcf2prot is shown below:

  1. DEBUG_GPU => the input arrays to the GPU are inspected for indexing error, incase of an indexing error the full input table is printed and the index of the row with the first indexing error is also printed to the standard output descriptor.

  2. DEBUG_CPU_EXEC => Inspect the vector of tasks provided to the input CPU execution engine, incase of an indexing error the full input table is printed and index of the row with the first indexing error is also printed to the standard output descriptor.

  3. DEBUG_TXP="Transcript_ID" => This flag exports a transcript id that will be used for debugging, while the IR for the transcript is being created different infos will be logged to the output descriptor.

  4. INSPECT_TXP => If set, after each transcript is translated into instruction an inspection function will be called to check the correctness of translation, if the translation failed then the code will panic and error will be printed to the output descriptor.

  5. INSPECT_INS_GEN => Inspect the translation process from mutations to instructions, as of version 0.1.3 two logical errors are inspected, first, multiple annotations, where more than one mutation are observed at the same position in the protein backbone, or through mutational overlap and engulfment where two mutations overlap in length, for example, insertion at position 60 with 7 amino acids and then a missense mutation at position 64.

  6. PANIC_INSPECT_ERR => If set the code will panic if inspecting the translation from mutation to instruction failed. This is an override of the default behavior were an error message is generated and printed to the output stream.

Using BCFtools/csq

As stated above, VCF2Prot can be only used with BCFtools/CSQ called VCF files, files can be called with CSQ as follow

bcftools csq -g Homo_sapiens.GRCh38.106.chromosome.1.gff3  -f Homo_sapiens.GRCh38.dna.chromosome.1.fa input_phased_file.vcf\
 -O v\ -o input_phased_called_file.vcf -n 64

Here, n represent the number consequences, kindly, check BCFtools/csq repository for more details.

Compilation from source

CPU Version

Note

Compiling the following code will be produce a CPU only version, that means that providing the code will panic if the GPU is specified as an engine, i.e. the parameter -g is set to gpu.

  1. Install Rust from the official website

  2. Clone the current repository

git clone https://github.com/ikmb/vcf2prot
  1. Change the directory to vcf2prot
cd vcf2prot
  1. Change to the cpu_only branch
git checkout cpu_only
  1. build the project
cargo build --release 
  1. Access the binary executable from the target directory
cd target/release
./vcf2prot -h # This print the help statement 
  1. add the binary to your PATH

GPU Version (Experimental)

Note

The GPU version is an experimental version and shall only be used for software development purposes

The following GPU code is only compatible with CUDA and NVIDIA GPUs

  1. Install Rust from the official website

  2. Clone the current repository or Download the source code using the project Github page

git clone https://github.com/ikmb/vcf2prot
  1. Change the direction to vcf2prot
cd vcf2prot
  1. Make sure the following environmental variable are set CUDA_HOME and LD_LIBRARY_PATH, please set the value of these according to your system.

  2. Use any text editor and update the following information in the build script, build.rs which is located the at the root directory,

    println!("cargo:rustc-link-search=native=/opt/cuda/11.0/lib64/"); // 8th line in the current version
    println!("cargo:rustc-link-search=native=/path two cuda lib64 directory"); // 8th line in the updated version
  1. Build the project
cargo build --release 
  1. Access the binary executable from the target directory
cd target/release
./vcf2prot -h # This print the help statement 

Troubleshooting

Problem

error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory

solution

This problem will be encountered in case any of the two environmental variable, CUDA_HOME and LD_LIBRARY_PATH, are not defined or set. For a permanent solution please update your .bashrc to have these two variables exported.

Problem

Calling Cargo build produce, error: Permission denied (os error 13)

solution

This problem usually happens when there is problem with access permission and can solved with by:

cargo clean && cargo build --release

or using

sudo chown -R $(whoami) PATH_TO_PROJECT

where path PATH_TO_PROJECT points to the project directory, i.e. the directory where the file has been cloned into.

Docker Image

using DockerHub

VCF2Prot is currently available at DockerHub Here. This image can be build locally using docker as follow:

docker pull ikmb/vcf2prot

This automatically, download the latest version, to use a specific version, the following command shall be used:

docker pull ikmb/vcf2prot:0.1.4

Building locally

The container for VCF2Prot can be build as follow

  1. make sure the docker is installed and it running using, the installation details is available here
docker run --rm hello-world

If this worked out correctly and printed the help "Hello from Docker!" then feel free to jump to step number 2, otherwise, depending on the error message two things can be done. First, the error message was permission denied then try running the same command above as a sudo user, i.e.

sudo docker run --rm hello-world

Otherwise if the message was unable to connect to the daemon and you are working on a mac OS then start first docker image and then start first the docker desktop application from the application pad then try again using

docker run --rm hello-world
  1. Clone the repository
git clone https://github.com/ikmb/vcf2prot
  1. Change the directory to vcf2prot
cd vcf2prot
  1. Build the container
docker build -t vcf2prot . 
  1. Run the container
docker run vcf2prot -h 

Output format

The generated FASTA files by VCF2Prot has the following format:

  1. Header: which is made up of the transcript name and either '_1' to represent transcript containing alterations arising from the first haplotype or '_2' to represent alterations arising from the second haplotype.

  2. body: which contain the generated personalized protein sequences

Contact

For further questions, please feel free to open an issue here or send an email to the developers at [email protected] or through twitter @HeshamElAbd16

Funding

The project was funded by the German Research Foundation (DFG) (Research Training Group 1743, ‘Genes, Environment and Inflammation’)

IKMB_LOGO

About

Accelerate the generation of personalized proteomes from a Variant calling format (VCF) file and a reference proteome using graphical processing units (GPUs).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published