-
Notifications
You must be signed in to change notification settings - Fork 4
Home
- 0. overview of major result files
- 1. more info on prepTG
- 2. more info on fai
- 3. more info on zol
- 4. basic usage examples
- 4.1 selecting parameters for fai and zol
- 5. tutorial ‐ a detailed walkthrough
- 5.1 tutorial for using zol with output from fast.genomics and CAGECAT
- 5.2 tips and recs for gene cluster neighborhood visualization
- 5.3 visualization of 1000s of gene clusters using cgc or cgcg
- 5.4 horizontal or lateral transfer assessment of gene clusters using salt
- 6. dependencies
- 7. premade prepTG dbs
- 8. overview of prior updates
- 9. more info on assessing BGC-ome novelty for a strain using abon
- 9.2 more info on assessing phage-ome novelty for a strain using atpoc
- 9.3 more info on assessing plasmid-ome novelty for a strain using apos
prepTG
processes and performs gene-calling or gene-mapping on an input set of genomes to ease and optimize downstream searches using fai. It can take in as input either FASTA files or GenBank files with CDS features. If FASTA files are provided it will by default use pyrodigal to make gene calls; however, if genomes are eukaryotic and the user provides a reference proteome, then miniprot will be used to perform gene mapping - to allow for processing of eukaryotic genomes without gene calling available.
fai
is a program to search for additional instances of a gene-cluster or genomic locus in some set of target genomes. Inspired by cblaster, CORASON, ClusterFinder, MultiGeneBlast, etc. It leverages DIAMOND alignment similar to cblaster and runs fairly rapidly (allowing it to scale to thousands of genomes and even work on metagenomic assemblies). fai features some key differentiating options relative to other software: (i) can assess syntenic similarity of candidate homologous gene clusters to the query gene cluster, (ii) can allow for looser criteria thresholds for gene cluster detection in target genomes if multiple neighborhoods are identified as homologous and on scaffold edges (thus improving fragmented gene cluster identification due to assembly issues) - similar to lsaBGC-Expansion, (iii) filter secondary neighborhoods - e.g. homologous gene neighborhoods to the query which meet the criteria but are not the best match.
zol
is a program to create table reports showing ortholog group conservation, annotation, and evolutionary stats for any gene-cluster or locus of interest. At it's core it performs ortholog group inference de novo across gene-cluster instances similar to CORASON, but uses an InParanoid-like algorithm. Tables are similar but currently more in-depth and feature some different statistics than lsaBGC-PopGene reports. zol produces a basic heatmap, but for more publication-ready figures - please see cgc or cgcg - which can produce summarized visualizations of large numbrs of homologous gene clusters. If fai is used for finding the gene cluster instances to input into zol, then zol can also filter genes or gene cluster instances which are potentially incomplete due to being located near a scaffold edge. zol also has an option to use skani to select representative gene clusters and ease computational requirements, together with an option for re-inflation to incoporate proteins from non-representative gene cluster instances into ortholog groups using CD-HIT clustering. Skipping In-Paranoid type ortholog group determination is also an option if users are limited in memory and aiming to process thousands of gene clusters, in which case CD-HIT clustering is used to determine protein clusters.
Important
Critically, with the development of some key options, together, fai and zol enable high-throughput detection of orthologs across multi-species datasets comprising of thousands of genomes.
cgc and cgcg are commandline tools to generate publication quality figures from zol results. cgc produces a collapsed gene cluster with bar plots shown for select quantitative statistics computed in zol atop a consensus gene cluster representation. cgcg produces a network visual of ortholog groups (nodes) where edges represent information on syntenic ordering information.
abon, atpoc, and apos allow users to check if BGCs, temperate phages, and plasmids from their favorite strain/genome are conserved or novel relative to other genomes available from the isolate's respective genus.
salt is a program to assess support for lateral/horizontal transfer per gene cluster instance detected by fai. It reports stats such as codon usage dissimilarity between gene clusters genes and genes in the background genome, whether plasmid/virus-associated proteins are found on the same scaffold as the gene cluster instance, and how distant the gene cluster instances is from transposons.
BiG-SCAPE is a great and light-weight program for systematically inferring gene cluster families (GCFs) based on domain sequences. zol-scape is a simple wrapper to run zol for all GCFs determined by BiG-SCAPE and complement CORASON visual results. We also have zol integrated into lsaBGC-Pan, which similarly to BiG-SCAPE determines GCFs but has some differences in functionality and approach.