This repository like a double-edged sword serves two purposes:
- Running cross-species analyses on the data collected by the Cross-Species project of the Systems Biology of Aging Group
- Reproducing the analysis of "Machine learning analysis of longevity-associated gene expression landscapes in mammals" paper
If you are using the code or data from this project, please do not forget to reference our paper. If you have any questions regarding the data, the code, or the paper, feel free to contact Systems Biology of Aging Group or open an issue on github.
On this figure we illustrate the core elements of the Cross-Species ML pipeline:
For downloading and preparing the indexes of reference genomes and transcriptomes species-notebooks repository can be used.
For RNA-Seq processing of samples quantification pipeline can be used.
For uploading Compara orthology data as well as quantified data of our samples to GraphDB database species-notebooks repository can be used.
To reproduce stage I and II models current yspecies repository can be used (see documentation below) There are dedicated notebooks devoted to those stages:
- stage_one_shap_selection notebook contains stage one shap_selection code
- stage_two_shap_selection notebook contains stage two shap_selection code
Linear models are implemented in cross-species-linear-models repository Bayesian networks analysis and multilevel Bayesian linear modelling are available at: bayesian_networks_and_bayesian_linear_modeling repository
In the same time, results of both of these models can be pulled by DVC in the current yspecies repository
To generate a ranked table current yspecies repository can be used (see documentation below) There is a dedicated results_intersections notebook devoted to generating ranked tables.
To reproduce this stage you can use stage_three_shap_selection notebook notebook in the notebooks folders
In the data folder one keeps input, interim and output data.
Before you start running anything do not forget to dvc pull the data and after commiting do not forget to dvc push it!
The pipeline is run by running dvc stages (see dvc.yaml file)
Most of the analysis is written in jupyter notebooks in the notebooks folder.
Each stage runs (and source controls input-outputs) corresponding notebooks using papermill software (which also stores output of the notebooks to data/notebooks)
You can either use micromamba/conda/anaconda or docker container to setup the project.
First you have to create a Conda environment or Micromamba environment for the project: Micromamba is a superior alternative to Conda with very similar API.
To create environment you can do:
micromamba create --file environment.yaml
micromamba activate yspecies
If any errors occur when setting up please, read known issues on the bottom of README.md If the problem is not mentioned there - feel free to open a github issue.
Then you have to pull the data with DVC, for this you should activate yspecies environment, and then:
dvc pull
NOTE: we keep the data at GoogleDrive, so on the first run of dvc pull
it may give you a link to allow access to your GoogleDrive to download the project data, like this:
We are grateful for @shcheklein and @dmpetrov for their help with DVC configuration.
After authentication, you can run any of the pipelines with:
dvc repro
or can run jupyter notebooks to explore notebooks on your own (see running notebooks section)
Alternatively, you can use docker container that already contains micromamba environment with everything pre-installed. Get inside the container with:
docker run -i -t --network host quay.io/comp-bio-aging/yspecies:latest
Micromamba environment will be automatically activated inside the container. To reproduce the pipelines you can run:
dvc repro
You can also pull the data and start jupyterlab to work with notebooks
dvc pull
jupyter lab notebooks --allow-root
DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>:
dvc repro
Most of the stages also produce notebooks together with files in the output
There are several key notebooks in the projects. All notebooks can be run either from jupyter (by jupyter lab notebooks) or command-line by dvc repro.
- select_samples notebook does preprocessing to select right combination of samples, genes and species. Most of other notebooks depend on it
- stage_one_shap_selection notebook contains stage one shap_selection code
- stage_two_shap_selection notebook contains stage two shap_selection code
- stage_three_shap_selection notebook contains stage three shap_selection code
- results_intersections notebook is used to compute intersection tables taken from several analysis methods (linear,causal and shap)
- For each of the stages there are also stage__optimize notebooks which contain hyper-parameter optimization code
You can run notebooks manually by activating yspecies environment and running:
jupyter lab notebooks
and then running the notebook of our choice. However, keep in mind that notebooks depend on each other. In particular, select_samples notebook generates the data for all others.
Most of the code is packed into classes. The workflow is build on top of scikitlean Pipelines. For the in-depth description of the pipeline read Cross-Species paper.
Yspecies package has the following modules:
- dataset - ExpressionDataset class to handle cross-species samples, genes, species metadata and expressions
- partition - classes required for sci-kit-learn pipeline starting from ExpressionDataset going to SortedStratification
- helpers - auxiliary methods
- preprocess - classes for preprocessing steps of the cross-species pipeline
- config - project-specific config values (for example, folder locations)
- tuning - classes for hyperparametric optimization
- workflow - general classes with advanced scikit-learn workflow building blocks
- models - cross-validation models and metrics
- selection - LightGBM and SHAP-based feature selection
- explanations - FeatureSelection results, plots and auxiliary methods to explor them
- utils - various utility functions and classes
- workflow - helper classes required to reproduce pipelines in the paper
The code in yspecies folder is a conda package that is used inside notebooks. There is also an option to use a conda version of the package
One of the key classes is ExpressionDataset class:
e = ExpressionDataset("5_tissues", expressions, genes, samples)
e
It allows indexing by genes:
e[["ENSG00000073921", "ENSG00000139687"]]
#or
e.by_genes[["ENSG00000073921", "ENSG00000139687"]]
By samples:
e.by_samples[["SRR2308103","SRR1981979"]]
Both:
e[["ENSG00000073921", "ENSG00000139687"],["SRR2308103","SRR1981979"]]
ExpressionDataset class has by_genes and by_samples properties which allow indexing and filtering. For instance filtering only blood tissue:
e.by_samples.filter(lambda s: s["tissue"]=="Blood")
The class is also Jupyter-friendly with repr_html() method implemented
Key logic from the start until partitioning of the data according to sorted stratification
Classes with data:
- FeatureSelection - specifies which fields we want to select from ExpressionDataset's species, samples, genes
- EncodedFeatures - class responsible for encoding of categorical features
- ExpressionPartitions - data class with results of partitioning
Transformers:
- DataExtractor - transformer that get ExpressionDataset and extracts data from it according to FeatureSelection instruction
- DataPartitioner - transformer that does sorted stratification
This module is responsible for ShapBased selection
Classes with data:
- Fold - results of one Fold
Auxilary classes:
- ModelFactory - used by ShapSelector to initialize the model
- Metrics - helper methods to deal with metrics
Transformers:
- ShapSelector - key transformer that does the learning
Module that contains final results
- FeatureResults is a key class that contains selected features, folds as well as auxiliary methods to plot and investigate results
Here we list workarounds for some typical problems connected with running the repository:
- error trying to exec 'cc1plus': exe: No such file or directory
Such error emerges when g is not installed: The workaround is simple:
sudo apt install g
-
Failures to download the files: if one or more files were not downloaded, re-run dvc pull again!
-
Windows and MAC-specific errors.
Even though yspecies seems to work on MAC and windows, we used Linux as our main operating system and did not test it thoroughly on Windows and Mac, so feel free to report any issues with them.