Overview
SocialGene is a complex ETL workflow centered around analyzing protein and genomic context similarity; built for natural product drug discovery but also with broader applications.
The general flow starts with creating a SocialGene Neo4j database. This is done using the Nextflow workflow which handles...
- Downloading proteins and/or genomes from NCBI, MIBiG, etc, or using local genomes (genbank format)
- Creating a set of non-redundant proteins from the input proteins/genomes
- Downloading HMM models from multiple sources (or using local HMMs), creating a non-redundant set of models, and upconverting to the latest HMMER format (optional)
- HMM-annotating the non-redundant proteins (optional)
- Comparing the non-redundant proteins via all-vs-all DIAMOND BLASTp (optional)
- Clustering the non-redundant proteins with MMseqs2 cascaded clustering (optional)
- Annotating input genomes with antiSMASH (optional)
- Downloading and linking all of NCBI Taxonomy (optional)
- Creating/Cleaning/Transforming all input and produced data for import into a Neo4j graph database
- Importing/Creating the Neo4j graph database (optional)
- And more...
Most steps are optional which allows you to customize the output database to your needs.
What is the output?
- Output files for the processing steps. e.g. BLASTp database, MMseqs2 index, non-redundant HMM model file, etc.
- All outputs files are tab separated, gzipped files which are then imported into Neo4j
- A Neo4j database containing all the data and relationships
The Neo4j database can be interacted with in multiple ways including:
- Through the SocialGene Python package (or Neo4j's Python package or other drivers)
- Through SocialGene's Django application (web-based GUI) (not yet released)
- Using "Neo4j Browser" or directly from a number of other tools that use or have Neo4j plugins (Cytoscape, Gephi, yFiles/yWorks, etc.)
Components
- A Nextflow/nf-core pipeline (github.com/socialgene/sgnf)
- If you are trying to create your own SocialGene database this is where you want to start
- It coordinates data download, manipulation and the creation of a SocialGene database
- A Python library (github.com/socialgene/sgpy)
- Contains the code for most of the data transformations used to build the database
- Contains functions for manipulation sequence, domain, and genomic context data
- Contains functions for manipulating SocialGene/Neo4j databases
- e.g. adding antismash results and MIBiG metadata which require graph-modifications so can't be directly imported within the Nextflow pipeline
Citations
Citing software is important because it helps developers justify to granting bodies (or bosses) that their software is useful and should be funded.
Socialgene should be cited as:
TODO: