Authors | Project | Documentation | Build Status | License | Code Quality | Coverage |
---|---|---|---|---|---|---|
N. Curti | DNetPRO |
|
|
|
Official implementation of the DNetPRO algorithm published on Scientific Reports by Curti et al.
- Overview
- Theory
- Prerequisites
- Installation
- Efficiency
- Usage
- Testing
- Table of contents
- Contribution
- References
- FAQ
- Authors
- License
- Acknowledgment
- Citation
Methods that select variables for multi-dimensional signatures based on single-variable performance can have limits in predicting higher-dimensional signature performance. As shown in Fig.1(a), in which both variables taken singularly perform poorly, but their performance becomes optimal in a 2-dimensional combination, in terms of linear separation of the two classes.
It is known that complex separation surfaces characterize classification tasks associated to image and speech recognition, for which Deep Networks are used successfully in recent times, but in many cases biological data, such as gene or protein expression, are more likely characterized by a up/down-regulation behavior (as shown in Fig.1(b) top), while more complex behaviors (e.g. a optimal range of activity, Fig.1(b) bottom) are much less likely. Thus, discriminant-based methods (and logistic regression methods alike) can very likely provide good classification performances in these cases (as demonstrated by our results with DNetPRO) if applied in at least two-dimensional spaces. Moreover, the of these methods (that generate very simple class separation surfaces, i.e. linear or quadratic) guarantee that a of a signature based on lower-dimensional signatures is feasible.
This consideration are relevant in particular for microarray data where we face on a small number of samples compared to a huge amount of variables (gene probes).
This kind of problem, often called problem (where N
is the number of features, i.e variables, and S
is the number of samples), tend to be prone to overfitting and they are classified to ill-posed.
The difficulty on the feature extraction can also increase due to noisy variables that can drastically affect the machine learning algorithms.
Often is difficult to discriminate between noise and significant variables and even more as the number of variables rises.
In this project we propose a new method of features selection - DNetPRO, Discriminant Analysis with Network PROcessing - developed to outperform the mentioned above problems. The method is particularly designed to gene-expression data analysis and it was tested against the most common feature selection techniques. The method was already applied on gene-expression datasets but my work focused on the benchmark of it and on its optimization for Big Data applications. The pipeline algorithm is made by many different steps and only a part of it was designed to biological application: this allow me to apply (part of) the same techniques also in different kind of problems with good results (see [10.1140/epjds/s13688-018-0168-2]).
The DNetPRO
algorithm produces multivariate signatures starting from all the couples of variables analyzed by a Discriminant Analysis.
For this reason, it can be classified as a combinatorial method and the computational time for the exploration of variable space is proportional to the square of the number of underlying variables (ranging from 10^3
to 10^5
in a typical high-throughput omics study).
This behavior allows to overcome some of the limits of single-feature selection methods and it provides a hard-thresholding approach compared to projection-based variable selection methods.
The combinatorial evaluation is the most time-expensive step of the algorithm and it needs an accurate algorithmic implementation for Big Data applications.
A summary of the algorithm is shown in the following pseudo-code.
Data: Data Matrix (N, S)
Result: List of putative signaturesDivide the data into training and test by a Hold-Out method;
FOR
couple
← (feature_1, feature_2) ∈Couples
DOLeave-One-Out cross validation;
Score estimation using the Classifier;END
Sorting of the couples in ascending order according to their score;
Threshold over the couples score (K-best couples);FOR
component
∈connected_components
DOIF
reduction
Iteratively pendant node removal;
ENDSignature evaluation using the Classifier;
END
So, given an initial dataset, with S
samples (e.g. cells or patients) each one described by
- Separation of available data into a
training
and atest
set (typically 66/33, or 80/20). - Estimation of the classification performance according to the desired metric on the training set of all
$S(S−1)/2$ feature pairs
through a computationally fast and reproducible cross-validation procedure (leave-one-out cross validation was chosen). In the applications provided in this work we used the Matthew Coefficient as a metric for performance estimation of a discriminant analysis. The results are mapped into a fully connected symmetric weighted network, with nodes corresponding to features and link weights corresponding to performance of the node couples. - Selection of top-performing pairs through a hard-thresholding procedure, that removes links (and nodes) from the initial fully connected network: every connected component obtained is considered as a putative classification signature. The threshold value can be tuned according to a desired minimum-performance value or considering a minimum number of nodes/features in the signature. The threshold value can be determined also by testing each of the obtained performances as a possible cut-off via a cross validation of the entire signature extraction procedure.
- [Optional] In the hypothesis that node degree is associated to the global feature performance in combination with the other features, to reduce the size of an identified signature, the pendant nodes of the signature network, i.e., nodes with degree equal to one, can be removed. This procedure can be applied once, or recursively until the core network, i.e., a network with all nodes with at least two links, is reached. We have tested the efficacy of this empirical approach in some real cases [10.3233/JAD-190480, 10.1007/BF02951333], obtaining a smaller-dimensional signature with comparable performance, even if there is not a solid theoretical basis supporting this procedure.
- [Optional] The classifier used in the feature selection and the final classification does not need to be a Discriminant Analysis classifier, but can in principle be any classifier. Moreover, the classifier used in the feature selection does not need to be the same one used for the final evaluation of the obtained signature.
- (a) All signatures are applied onto the test set to estimate their performance, producing more than one final signature.
OR
- (b) To identify a unique best performing signature, a further cross validation step can be applied, with a further
dataset
splitting into training (to identify the multiple signatures), test (to identify the best signature) and validation set (to evaluate the best signature performance).
We would stress that this method is completely independent to the choose of the classification algorithm, but, from a biological point-of-view, a simple one is preferable to keep an easy interpretability of the results. The geometrical simplicity of the resulting class-separation surfaces, in fact, allows an easier interpretation of the results, as compared to very powerful but black-box methods like nonlinear-kernel SVM or Neural Networks. These are the reasons which lead us to use very simple classifier methods in our biological applications as diag-quadratic Discriminant Analysis or Quadratic Discriminant Analysis. Both these methods allow fast computation and an easy interpretation of the results. A linear separation might not be common in some classification problems (e.g. image classification), but it is very likely in biological systems, where many responses to perturbation consist in an increase or decrease of variable values (e.g. expression of genes or proteins, see Fig.1(b)).
A second direct gain by the couples evaluation is related to the network structure: the DNetPRO
network signatures allow a hierarchical ranking of features according to their centrality compared to other methods.
The underlying network structure of the signature could suggests further methods to improve its dimensionality reduction based on network topological properties to fit real application needs, and it could help to evaluate the cooperation of variables for the class identification.
In the end, we remark that our signatures have a purely statistical relevance by being generated with a purpose of maximal classification performance, but sometimes the selected features (e.g. genes, DNA loci, metabolites) can be of clinical and biological interest, helping to improve knowledge on the mechanism associated to the studied phenomenon [10.1101/gr.155192.113, 10.1200/JCO.2008.19.2542, 10.1101/gr.193342.115, 10.18632/oncotarget.5718].
C++ supported compilers:
The DNetPRO
project is written in C++
and it supports also older standard versions (std=c++1+).
The package installation can be performed via CMake
.
The CMake
installer provides also a DNetPRO.pc
, useful if you want link to the DNetPRO
using pkg-config
.
The only dependency of the C++
version of the project is given by the parse_args
library.
A complete list of instruction for its installation is available here.
If you prefer, you can also use the submodule
version of the library and leave to the current CMake
the responsibility of its installation.
In this case you need to provide to the cmake
command line the extra-flag of FORCE_USE_SUBMODULES:BOOL=ON
(see next for further details about it).
You can also use the DNetPRO
package in Python
using the Cython
wrap provided inside this project.
The only requirements are the following:
- numpy >= 1.15
- networkx >= 2.2
- cython >= 0.29
- scikit-learn>=1.3.2
- pandas >= 0.24.2
The Cython
version can be built and installed via CMake
enabling the -DPYWRAP
variable.
You can use also the DNetPRO
package in Python
using the Cython
wrap provided inside this project.
The Python
wrap guarantees also a good integration with the other common Machine Learning tools provided by scikit-learn
Python
package; in this way you can use the DNetPRO
algorithm as an equivalent alternative also in other pipelines.
Like other Machine Learning algorithm also the DNetPRO
one depends on many parameters, i.e its hyper-parameters, which has to be tuned according to the given problem.
The Python
wrap of the library was written according to scikit-optimize
Python
package to allow an easy hyper-parameters optimization using the already implemented classical methods.
-
Follow your system prerequisites (below)
-
Clone the
DNetPRO
package from this repository, or download a stable release
git clone https://github.com/Nico-Curti/DNetPRO.git
cd DNetPRO
git submodule update --init --recursive
DNetPRO
could be built with CMake and Make or with the build scripts in the project. Example:
Linux | MacOS | Windows | |
---|---|---|---|
Script | ./build.sh |
./build.sh |
./build.ps1 |
Ubuntu
- Define a work folder, which we will call WORKSPACE in this tutorial: this could be a "Code" folder in our home, a "c++" folder on our desktop, whatever you want. Create it if you don"t already have, using your favourite method (mkdir in bash, or from the graphical interface of your distribution). We will now define an environment variable to tell the system where our folder is. Please note down the full path of this folder, which will look like
/home/$(whoami)/code/
echo -e "\n export WORKSPACE=/full/path/to/my/folder \n" >> ~/.bashrc
source ~/.bashrc
- Open a Bash terminal and type the following commands to install all the prerequisites.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y gcc-8 g++-8
wget --no-check-certificate https://cmake.org/files/v3.13/cmake-3.13.1-Linux-x86_64.tar.gz
tar -xzf cmake-3.13.1-Linux-x86_64.tar.gz
export PATH=$PWD/cmake-3.13.1-Linux-x86_64/bin:$PATH
sudo apt-get install -y make git dos2unix ninja-build
git config --global core.autocrlf input
git clone https://github.com/physycom/sysconfig
- Build the project with CMake (enable or disable OMP with the define -DOMP:
cd $WORKSPACE
git clone https://github.com/Nico-Curti/DNetPRO
cd DNetPRO
git submodule update --init --recursive
mkdir -p build
cd build
cmake ..
make -j
cmake --build . --target install
cd ..
macOS
- If not already installed, install the XCode Command Line Tools, typing this command in a terminal:
xcode-select --install
-
If not already installed, install Homebrew following the official guide.
-
Open the terminal and type these commands
brew update
brew upgrade
brew install gcc@8
brew install cmake make git ninja
-
Define a work folder, which we will call WORKSPACE in this tutorial: this could be a "Code" folder in our home, a "c++" folder on our desktop, whatever you want. Create it if you don"t already have, using your favourite method (mkdir in bash, or from the graphical interface in Finder). We will now define an environment variable to tell the system where our folder is. Please note down the full path of this folder, which will look like /home/$(whoami)/code/
-
Open a Terminal and type the following command (replace /full/path/to/my/folder with the previous path noted down)
echo -e "\n export WORKSPACE=/full/path/to/my/folder \n" >> ~/.bash_profile
source ~/.bash_profile
- Build the project with CMake (enable or disable OMP with the define -DOMP:
cd $WORKSPACE
git clone https://github.com/Nico-Curti/DNetPRO
cd DNetPRO
git submodule update --init --recursive
mkdir -p build
cd build
cmake ..
make -j
cmake --build . --target install
cd ..
Windows (7+)
-
Install Visual Studio 2017 from the official website
-
Open your Powershell with Administrator privileges, type the following command and confirm it:
PS \> Set-ExecutionPolicy unrestricted
-
If not already installed, please install chocolatey using the official guide
-
If you are not sure about having them updated, or even installed, please install
git
,cmake
and an updatedPowershell
. To do so, open your Powershell with Administrator privileges and type
PS \> cinst -y git cmake powershell
-
Restart the PC if required by chocolatey after the latest step
-
Install PGI 18.10 from the official website (the community edition is enough and is free; NOTE: install included MS-MPI, but avoid JRE and Cygwin)
-
Activate license for PGI 18.10 Community Edition (rename the file
%PROGRAMFILES%\PGI\license.dat-COMMUNITY-18.10
to%PROGRAMFILES%\PGI\license.dat
) if necessary, otherwise enable a Professional License if available -
Define a work folder, which we will call
WORKSPACE
in this tutorial: this could be a "Code" folder in our home, a "cpp" folder on our desktop, whatever you want. Create it if you don"t already have, using your favourite method (mkdir in Powershell, or from the graphical interface in explorer). We will now define an environment variable to tell the system where our folder is. Please note down its full path. Open a Powershell (as a standard user) and type
PS \> rundll32 sysdm.cpl,EditEnvironmentVariables
- In the upper part of the window that pops-up, we have to create a new environment variable, with name
WORKSPACE
and value the full path noted down before. If it not already in thePATH
(this is possible only if you did it before), we also need to modify the "Path" variable adding the following string (on Windows 10 you need to add a new line to insert it, on Windows 7/8 it is necessary to append it using a;
as a separator between other records):
%PROGRAMFILES%\CMake\bin
- If
vcpkg
is not installed, please follow the next procedure, otherwise please jump to #12
PS \> cd $env:WORKSPACE
PS Code> git clone https://github.com/Microsoft/vcpkg.git
PS Code> cd vcpkg
PS Code\vcpkg> .\bootstrap-vcpkg.bat
- Open a Powershell with Administrator privileges and type
PS \> cd $env:WORKSPACE
PS Code> cd vcpkg
PS Code\vcpkg> .\vcpkg integrate install
- Open a Powershell and build
DNetPRO
using thebuild.ps1
script
PS \> cd $env:WORKSPACE
PS Code> git clone https://github.com/Nico-Curti/DNetPRO
PS Code> cd DNetPRO
PS Code\DNetPRO> .\build.ps1
We recommend the use of CMake
for the installation since it is the most automated way to reach your needs.
First of all make sure you have a sufficient version of CMake
installed (3.9 minimum version required).
If you are working on a machine without root privileges and you need to upgrade your CMake
version a valid solution to overcome your problems is provided here.
With a valid CMake
version installed first of all clone the project as:
git clone https://github.com/Nico-Curti/DNetPRO
cd DNetPRO
git submodule update --init --recursive
The you can build the DNetPRO
package with
mkdir -p build
cd build && cmake .. && cmake --build . --target install
or more easily
./build.sh
if you are working on a Windows machine the right script to call is the build.ps1
.
The CMake
command line can be customized according to the following parameters:
-DOMP:BOOL
: Enable/Disable the OpenMP support for multi-threading computation-DBUILD_DOCS:BOOL
: Enable/Disable the build of docs using Doxygen and Sphinx-DPYWRAP:BOOL
: Enable/Disable the build of Python wrap of the library via Cython (see next section for Python requirements)-DFORCE_USE_SUBMODULES:BOOL
: Force the use of submodules or the already installed versions of the required libraries.
🚩 Note |
---|
All the variables above are set to OFF by default! |
🚩 NOTE |
---|
If you are working under Windows OS and you are familiar with vcpkg package, you can find the series of ports script for vcpkg integration of parseargs in the relative directory.Copying those scripts in the vcpkg project folder you can install both the libraries via vcpkg! Then you should use them directly via CMAKE_TOOLCHAIN_FILE |
If the ParseArgs library is not found during the building process, the submodules of the current project directory are used.You can manually force the use of submodules using the -DFORCE_USE_SUBMODULE:BOOL=ON flag during the configuration.If you have manually installed the libraries you can pass to the CMake configuration the variable -DParseArgs_DIR:PATH=/path/to/parseargs/share/parseargs . |
The Python
installation can be performed with or without the C++
installation.
The Python
installation is always executed using setup.py
script.
If you have already build the DNetPRO
C++
library the installation is performed faster and the Cython
wrap directly links to the last library installed.
Otherwise the full list of dependencies is build.
In both cases the installation steps are:
graph LR;
A(Install<br>Requirements) -->|python -m pip install -r requirements.txt| B(Install<br>DNetPRO)
B -->|python setup.py install| C(Package<br>Install)
B -->|python setup.py develop --user| D(Development<br>Mode)
The installation of the Python modules requires the CMake support and all the listed above libraries.If you are working under Window OS we require the usage of VCPKG for the installation of the libraries and a precise configuration of the environment variables.In particular you need to set the variables VCPKG_ROOT=/path/to/vcpkg/rootdir/ and VCPKG_DEFAULT_TRIPLET=x64-windows .A full working example of OS configuration can be found in the CI actions of the project, available here |
All the CMake flags are set internally in the setup.py script with default values.You can manually turn on/off the multi-threading support passing the flag --omp at the setup command line, i.e. python setup.py develop --user --omp |
As described in the above sections, the DNetPRO
is a combinatorial algorithm and thus it requires a particular accuracy in the code implementation to optimize as much as possible the computational performances.
The theoretical optimization strategies, described up to now, have to be proved by quantitative measures.
The time evaluation was performed using the Cython
(C++
wrap) implementation against the pure Python
(naive) implementation showed in the previous snippet.
The time evaluation was performed using the timing
Python
package in which we can easily simulate multiple runs of a given algorithm.
In our simulations, we monitored the three main parameters related to the algorithm efficiency: the number of samples, the number of variables and (as we provided a parallel multi-threading implementation) the number of threads used.
For each combination of parameters, we performed 30 runs of both algorithms and we extracted the minimum execution time.
The tests were performed on a classical bioinformatics server (128 GB RAM memory and 2 CPU E5-2620, with 8 cores each).
The obtained results are shown in following Figure.
In each plot, we fixed two variables and we evaluated the remaining one.
In all our simulations, the efficiency of the (optimized) Cython
version is easily visible and the gap between the two implementations reached more than 10^4
seconds.
On the other hand, it is important to highlight the scalability of the codes against the various parameters.
While the code performances scale quite well with the number of features (Fig. 1) in both the implementations, we have a different trend varying the number of samples (Fig. 2): the Cython
rend starts to saturate almost immediately, while the computational time of the Python
implementation continues to grow.
This behavior highlights the results of the optimizations performed on the Cython
version which allows the application of the DNetPRO
algorithm also to larger datasets without loosing performances.
An opposite behavior is found monitoring the number of threads (ref Fig. 3): the Python
version scales quite well with the number of threads, while Cython
trend is more unstable.
This behavior is probably due to a non-optimal schedule in the parallel section: the work is not equally distributed along the available threads and it penalizes the code efficiency, creating a bottleneck related to the slowest thread.
The above results are computed considering a number of features equal to 90 and, thus, the parallel section distributes the 8100 (N x N
) iterations along the available threads: when the number of iterations is proportional to the number of threads used (12, 20 and 30 in our case), we have a maximization of the time performances.
Despite of this, the computational efficiency of the Cython
implementation is so much better than the Python
one that its usage is indisputable.
You can use the DNetPRO
algorithm into pure-Python modules or inside your C++ application.
The easiest usage of DNetPRO
algorithm is given by the example provided in the example folder.
This script includes an easy-to-use command line to run the DNetPRO
algorithm on a dataset stored into a file.
./bin/DNetPRO_couples
Usage: ./DNetPRO_couples -f <std :: string> -o <std :: string> [-frac <float> ] [-bin <bool> ] [-verbose <bool> ] [-probeID <bool> ] [-nth <int> ]
DNetPRO couples evaluation 2.0
optional arguments:
-f, --input Input filename
-o, --output Output filename
-s, --frac Fraction of results to save
-b, --bin Enable Binary output
-q, --verbose Enable stream output
-p, --probeID ProbeID name to skip
-n, --nth Number of threads to use
If you are interested in using DNetPRO
algorithm inside your code you can simply import the dnetpro_couples.h
and call the dnetpro_couples
function.
Then all the results will be stored into a easy-to-manage score
object.
The DNetPRO
object is totally equivalent to a scikit-learn
feature-selection method and thus it provides the member functions fit
(to train your model) and predict
(to test a trained model on new samples).
First of all you need to import the DNetPRO
modules and then simply call the training/testing functions.
import pandas as pd
from DNetPRO import DNetPRO
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
X = pd.read_csv("./example/data.txt", sep="\t", index_col=0, header=0)
y = np.asarray(X.columns.astype(float).astype(int))
X = X.transpose()
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.33, random_state=42)
dnet = DNetPRO(estimator=GaussianNB(), n_jobs=4, verbose=True)
Xnew = dnet.fit_transform(X_train, y_train)
print("Best Signature: {}".format(dnet.get_signature()[0]))
print("Score: {:.3f}".format(dnet.score(X_test, y_test)))
The Python version of the package is tested using pytest
.
To install the package in development mode you need to add also this requirement:
- pytest == 3.0.7
The full list of python test scripts can be found here.
Description of the folders related to the C++
version.
Directory | Description |
---|---|
example | Example script for the C++ version of the code. Use the command line helper for a full description of the required parameters. |
hpp | Implementation of the C++ template functions and objects used in the DNetPRO algorithm. |
include | Definition of the C++ function and objects used in the DNetPRO algorithm. |
src | Implementation of the C++ functions and objects used in the DNetPRO algorithm. |
Description of the folders related to the Python
version (base directory DNetPRO
).
Directory | Description |
---|---|
example | Python version of the C++ example. |
lib | List of Cython definition files. |
source | List of Cython implementation objects. |
test | List of test scripts for the Python wraps. |
Description of the folders containing the scripts used for the paper simulations.
Directory | Description |
---|---|
toy | Implementation of the Python scripts used for the simulations on synthetic datasets. |
pipeline/TCGA | Implementation of the Python scripts used for the simulations on the TCGA datasets. |
timing | Implementation of the Python scripts for the performances evaluation. |
Any contribution is more than welcome ❤️. Just fill an issue or a pull request and we will check ASAP!
See here for further informations about how to contribute with this project.
Boccardi, Virginia et al. Cognitive Decline and Alzheimer"s Disease in Old Age: A Sex-Specific Cytokinome Signature. 1 Jan. 2019 : 911 – 918. https://doi.org/10.3233/JAD-190480
Chen, B., Hong, J. & Wang, Y. The minimum feature subset selection problem. J. of Comput. Sci. & Technol. 12, 145–153 (1997). https://doi.org/10.1007/BF02951333
Mizzi, C., Fabbri, A., Rambaldi, S. et al. Unraveling pedestrian mobility on a road network using ICTs data during great tourist events. EPJ Data Sci. 7, 44 (2018). https://doi.org/10.1140/epjds/s13688-018-0168-2
Belkin, M., Niyogi, P. Semi-Supervised Learning on Riemannian Manifolds. Machine Learning 56, 209–239 (2004). https://doi.org/10.1023/B:MACH.0000033120.25363.1e
Miao, Z., Balzer, M.S., Ma, Z. et al. Single cell regulatory landscape of the mouse kidney highlights cellular differentiation programs and disease targets. Nat Commun 12, 2277 (2021). https://doi.org/10.1038/s41467-021-22266-1
Levine JH, Simonds EF, Bendall SC, Davis KL, Amir el-AD, Tadmor MD, Litvin O, Fienberg HG, Jager A, Zunder ER, Finck R, Gedman AL, Radtke I, Downing JR, Pe"er D, Nolan GP. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell. 2015 Jul 2;162(1):184-97. doi: 10.1016/j.cell.2015.05.047. Epub 2015 Jun 18. PMID: 26095251; PMCID: PMC4508757.
Katia Scotlandi, Daniel Remondini, Gastone Castellani, Maria Cristina Manara, Filippo Nardi, Lara Cantiani, Mirko Francesconi, Mario Mercuri, Anna Maria Caccuri, Massimo Serra, Sakari Knuutila, and Piero Picci, Journal of Clinical Oncology 2009 27:13, 2209-2216
Cenik C, Cenik ES, Byeon GW, Grubert F, Candille SI, Spacek D, Alsallakh B, Tilgner H, Araya CL, Tang H, Ricci E, Snyder MP. Integrative analysis of RNA, translation, and protein levels reveals distinct regulatory variation across humans. Genome Res. 2015 Nov;25(11):1610-21. doi: 10.1101/gr.193342.115. Epub 2015 Aug 21. PMID: 26297486; PMCID: PMC4617958.
Terragna C., Remondini D., Martello M., Zamagni E., Pantani L., Patriarca F., Pezzi A., Levi G., Offidani M., Proserpio I., Sabbata G., Tacchetti P., Cangialosi C., et al The genetic and genomic background of multiple myeloma patients achieving complete response after induction therapy with bortezomib, thalidomide and dexamethasone (VTD). Oncotarget. 2016; 7: 9666-9679. Retrieved from https://www.oncotarget.com/article/5718/text/
- How can I properly set the C++ compiler for the Python installation?
If you are working on a Ubuntu machine pay attention to properly set the environment variables related to the C++
compiler.
First of all take care to put the compiler executable into your environmental path:
ls -ltA /usr/bin | grep g++
Then you can simply use the command to properly set the right aliases/variables
export CXX=/usr/bin/g++
export CC=/usr/bin/gcc
but I suggest you to put those lines into your .bashrc
file (one for all):
echo "export CC=/usr/bin/gcc" >> ~/.bashrc
echo "export CXX=/usr/bin/g++" >> ~/.bashrc
I suggest you to not use the default Python
compiler (aka x86_64-linux-gnu-g++
) since it can suffer of many issues during the compilation if it is not manually customized.
🚩 Note |
---|
If you are working under Windows OS a complete guide on how to properly configure your MSVC compiler can be found here |
- I installed the
DNetPRO
Python package following the instructions but I have anImportError
when I try to import the package as in the examples
This error is due a missing environment variable (which is not automatically set by the installation script).
All the C++
libraries are searched into the OS directory tree starting from the information/paths hinted by the LD_LIBRARY_PATH
environment variable.
When you install the DNetPRO
library the produced .so
, .dll
, .dylib
files are saved into the lib
directory created into the project root directory.
After the installation you must add this directory into the searching path.
You can add this information editing the configuration file of your Unix
-like system, i.e
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/DNetPRO/project/directory/lib/" >> ~/.bashrc
echo "export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/path/to/DNetPRO/project/directory/lib/" >> ~/.bashrc
or adding the LD_LIBRARY_PATH
to your set of environment variables (especially for Windows
users).
- I installed the
DNetPRO
C++ library with theFORCE_USE_SUBMODULES:BOOL=ON
flag and now I don"t find the installed libraries and header files
The DNetPRO
was developed as standalone library, also if it depends by other external libraries.
All the dependencies are stored as submodule packages, and they could be installed before the DNetPRO
installation (according to the documentation of the libraries) or alongside it, forcing the usage of submodules.
If the FORCE_USE_SUBMODULES
is enabled, the installation uses the local versions of submodules libraries and, therefore, the outputs are sent to the submodule share
and lib
folders.
This behavior is not a real issue of the package, since following a "classical" installation, i.e. with all the requirements pre-installed, the outputs are correctly set.
A correct output destination is achieved also installing the library via vcpkg
(see next FAQ for more details about it).
- How can I install the library via
VCPKG
dependency manager?
The DNetPRO
library is not yet supported via vcpkg
(I have not submitted any PR yet).
However, in the cmake
folder you can find a complete directory-tree named vcpkg
.
You can simply copy&paste the entire vcpkg
folder over the original (cloned here) project to manage the entire installation of the library also via vcpkg.
🚩 Note |
---|
Since no releases have been published yet, the portfile is not complete and you need to manually set the REF and SHA512 variables! |
All the submodule dependencies provide the same vcpkg "support" via copy&paste. For submodules not released yet, the editing of the related variables is mandatory! |
- Nico Curti git, unibo
- Enrico Giampieri git, unibo
- Gastone Castellani unibo
- Daniel Remondini git, unibo
See also the list of contributors who participated in this project.
The DNetPRO
package is licensed under the MIT "Expat" License.
Thanks goes to all contributors of this project.
If you have found DNetPRO
helpful in your research, please consider citing the original paper
@article{Curti2022,
author={Curti, Nico and Levi, Giuseppe and Giampieri, Enrico and Castellani, Gastone and Remondini, Daniel},
title={A network approach for low dimensional signatures from high throughput data},
journal={Scientific Reports},
year={2022},
month={Dec},
day={23},
volume={12},
number={1},
pages={22253},
abstract={One of the main objectives of high-throughput genomics studies is to obtain a low-dimensional set of observables---a signature---for sample classification purposes (diagnosis, prognosis, stratification). Biological data, such as gene or protein expression, are commonly characterized by an up/down regulation behavior, for which discriminant-based methods could perform with high accuracy and easy interpretability. To obtain the most out of these methods features selection is even more critical, but it is known to be a NP-hard problem, and thus most feature selection approaches focuses on one feature at the time (k-best, Sequential Feature Selection, recursive feature elimination). We propose DNetPRO, Discriminant Analysis with Network PROcessing, a supervised network-based signature identification method. This method implements a network-based heuristic to generate one or more signatures out of the best performing feature pairs. The algorithm is easily scalable, allowing efficient computing for high number of observables ({\$}{\$}10^3{\$}{\$}--{\$}{\$}10^5{\$}{\$}). We show applications on real high-throughput genomic datasets in which our method outperforms existing results, or is compatible with them but with a smaller number of selected features. Moreover, the geometrical simplicity of the resulting class-separation surfaces allows a clearer interpretation of the obtained signatures in comparison to nonlinear classification models.},
issn={2045-2322},
doi={10.1038/s41598-022-25549-9},
url={https://doi.org/10.1038/s41598-022-25549-9}
}
or just this repository
@misc{DNetPRO,
author = {Curti, Nico},
title = {{DNetPRO pipeline}: Implementation of the DNetPRO pipeline for TCGA datasets},
year = {2019},
url = {https://github.com/Nico-Curti/DNetPRO},
publisher = {GitHub},
howpublished = {\url{https://github.com/Nico-Curti/DNetPRO}}
}