05. Clustering BGCs into GCFs

lsaBGC-Cluster.py

In recent times (mid 2021), BiG-SCAPE (Munoz, Selem-Mjoica, Mullowney, et al. 2020) has revolutionized clustering of individual biosynthetic gene clusters (BGCs) into gene cluster families (GCFs) for lineage specific exploration and mapping of biosynthetic potential; however, it's design is better suited to cluster extremely diverse BGCs from across the entire domain of bacteria and works at finding homology on the level of protein domains.

If only a single lineage is of interest however, the precision of clustering can be improved by using whole protein sequences with a standard well established homolog identification tool, such as OrthoFinder. In lsaBGC-Cluster.py, and subsequent programs, we utilize homology relationships between proteins from antiSMASH (Blin et al. 2019) identified by OrthoFinder (Emms and Kelly 2019), with the goal to group together orthologous BGCs across isolate genomes into a single GCF. The simplified scope of the clustering (within a single lineage) in combination with the ability to utilize full protein sequences enables us to avoid under- and over-clustering BGCs to form GCFs, which we aim to ideally represent a set of orthologous BGCs where paralogous instances are split into separate GCFs where appropriate. Importantly, this definition of GCFs differs from how the BiG-SCAPE authors define it as sets of BGCs from "across a range of organisms that are linked to a highly similar natural product chemotype".

BiG-SCAPE introduced three distinct metrics to assess grouping BGCs into the same GCF with all three metrics based on protein domains. Thus, sequences between homologous BGCs can be left un-utilized to help guide clustering. Working in the context of across the domain of bacteria, using protein domains makes a lot of sense and can help avoid incorporating noise from the sequences in between the domains; however, if working in the confines of a single lineage (e.g. species or genus), these stretches of sequence between domains can be extremely informative. By identifying homology across full protein sequences, we are further able to employ a very simple clustering approach whereby we only assess how many homolog groups BGCs share in common to determine orthologous GCFs. The clustering is primarily controlled by two relatively simple to understand parameters, the Jaccard similarity index (how many homolog groups are shared by a pair of BGCs normalized by the number of homolog groups present in either BGC; default value = 50; range of accepted values = 0 to 100) and an inflation parameter (is a parameter for the Markov graph-based clustering algorithm - basically the larger this parameter the more granular the ; default value = 1.4; range of accepted values = 0.8 to 5). Our parameter criteria (using protein overlap, protein sequence similarity, & protein order similarity) were greatly influenced by BiG-SCAPE!!!

Selection of Parameter Values for Jaccard Similarity Cutoff and MCL Inflation

Are you someone who likes to use a single cutoff for different taxa in an effort to standardize comparing results across them? Or are you someone who likes to tailor their parameters to each specific taxonomy of interest? Whichever one you are, we have you covered with sensible default parameters that should generally work well as well as an option to automatically test different combinations of parameters. We strongly advocate for this latter option however, as the scope of the lineage might influence which clustering parameters are best suited for the analyses of interest for you (e.g. if working in the context of just a single species or sub-species, perhaps a more course clustering configuration could lead to reduced singleton BGCs left unassigend to a GCF while still preserving the assumption of orthology between BGC instances within such clusters).

Users can simply run lsaBGC-Cluster.py with the --run_parameter_tests flag the first time around to generate a PDF report (8x11 for easy printing, for those in the US at least) to examine which combination of the Jaccard Similarity index and the MCL Inflation parameter best control the various tradeoffs between performing granular vs. coarse clustering.

Here is an example report:

Example lsaBGC-Cluster Parameters Report.pdf

Here is a description of the figures the report will feature:

Title Page:

The plot on the title page shows the number of GCFs which resulted (generally want to maximize as this will lead to more clean delineation of orthologous BGCs from paralogous BGCs) vs. the number of singleton BGCs which remain un-clustered(want to minimize to be inclusive of orthologous variants of BGCs which have significantly diverged/altered) when using different values of the Jaccard Index and MCL Inflation parameters. Each point corresponds to results from clustering the BGCs into GCFs at a specific Jaccard Index and MCL Inflation combination. A jitter is added to prevent overlapping points, so interpret the plot carefully.

2nd Page:

The following 2nd page in the report contains four heatmaps, colored in different shades of the primary colors. Each cell in the heatmap corresponds to statistic assessing the quality of the GCF clustering performed using some specific MCL Inflation value (x-axis) and a specific Jaccard Similarity cutoff (y-axis).

Blue Heatmap (upper left) - The shading represents the ratio of the number of BGCs which are singletons relative to the number of BGCs which were grouped into GCFs. The larger this ratio, the darker the shading.
Green Heatmap (upper right) - The shading represents the number of GCFs formed from clustering with specific parameters. The larger this count, the darker the shading.
Yellow Heatmap (lower left) - The shading represents the proportion of GCFs formed from clustering with specific parameters which contain BGCs with distinct annotations from AntiSMASH (e.g. an ectoine BGC and a terpene BGC being grouped together). BGCs which are not annotated are accounted for and considered as belonging to a distinct annotation category "unknown". The larger this proportion, the darker the shading.
Red Heatmap (lower right) - The shading represents the proportion of GCFs formed from clustering with specific parameters which contain BGCs with distinct annotations from AntiSMASH (e.g. an ectoine BGC and a terpene BGC being grouped together). BGCs which are not annotated are ignored and not treated as conflicting evidence for multiple annotation categories being present in a single GCF. The larger this proportion, the darker the shading.

3rd Page:

The 3rd page of the report features another set of overview plots, this time in the form of adjacent grids of boxplots. Each grid corresponds to a specific Jaccard Similarity cutoff values, whereas the individual boxplots correspond to different MCL Inflation parameter values. The boxplots depict a range of statistics across each GCF resulting from the respective clustering.

First Row - Number of core homolog groups which are found in all samples with the GCF.
Second Row - Number of samples which have each GCF.
Third Row - Average number of genes observed in each GCF.
Fourth Row - Standard deviation for number of genes observed in each GCF.

As will often be apparent from these plots, the Jaccard Similarity cutoff parameter is by far more critical in impacting GCF clustering results than the MCL Inflation parameter which can be used more for fine tuning clustering results.

4th Page:

The 4th page of the report features yet another set of overview plots, this time they are alluvian plots showcasing how GCFs get collapsed or split up between two "parameter-adjacent" clustering attempts. By "parameter-adjacent", we mean that as one changes the MCL Inflation parameter you can see the influence this will have on the individual GCFs. GCFs are filled in as black and singleton BGCs are filled in as red. You will see that as the clustering becomes more strict, the number of singletons which will result increases. For lineages with tons of GCFs, this visualization will not be very informative and I might incorporate a flag to filter its production if the case later on.

Clustering Parameter Combination Specific Views 5th Page

The 5th page and onwards, the bulk of the report, shows the individual GCF level details for each clustering attempt with a parameter combination. These pages comprise of 4 subplots.

The top row features two distinct plots. The left plot is a histogram of sample counts for each GCF (how many samples each GCF has), colored according to whether the GCF has a core gene set. BGCs which are singletons are colored in grey as the notion of possessing a core gene set doesn't apply to them. The right plot is a simple histogram of the standard deviation of gene counts per GCF. If GCFs have many small and large BGCs, this might imply the clustering had some issues and might be under-clustering (clustering too coarsely).

The middle row plot is a scatter plot of pie charts. Each pie chart represents a GCF and the coloring of the pie chart represents the annotation class assigned to members of that GCF, BGCs which are unannotated are colored in grey, the rest of the annotation categories are arbitrarily assigned a color. The point of this figure is not to draw biological insight, but simply to assess whether the GCFs are comprising of single annotation categories as would be expected.

The bottom row plot is a bar chart which shows a subset of the largest GCFs, ranked in order. Sometimes, for lineages with lots of BGCs and GCFs, the pie charts overlap in the middle plot and there are many annotations observed, so it becomes difficult to parse the scatter pie plot. This figure aims to alleviate such struggles.

An example report page for the species of M. luteus

Final Remarks for Assessing Parameters

Once again, the purpose of these plots is to help guide which values to select for the two parameters controlling clustering for your specific lineage of interest. However, it might be that during exploration of the individual GCFs, one of them really fascinates you, in such case an iterative process of using lsaBGC-See.py to visualize how individual BGC instances are added to a GCF or removed when using different clustering criterion could be potentially insightful.

Additionally, manual curation of GCF listing files (simple text files listing which BGCs should be included in a GCF) and the tool lsaBGC-Refine.py (which allows for refining BGCs and trimming auxiliary flanking genes which are not of interest) could be applied to improve clustering results for the GCF of interest.

Optional Additional Parameters Controlling Clustering

Annotation Category

Similar to BiG-SCAPE, lsaBGC also supports guiding clustering based on general annotation classes; however, this is implemented at the level of assessing whether pairs of BGCs can be orthologous. Additionally, BGCs which contain multiple classifications, such as those with multiple protoclusters, can still result in GCFs which contain multiple annotation categories.

Global Syntenic Correlation

Additionally, a flag can be specified to assess the global syntenic similarity between BGCs. Unlike BiG-SCAPE, where the adjacency index is based on a measure of pairs of domains which are commonly found adjacent to one another, here we employ a global measure of syntenic similarity of BGCs. This measure is based on ranking shared homolog groups based on their order of occurrence along the BGC and assessing the correlation in ranked order. The reverse order of the secondary BGC is also tested against the order of the primary BGC and the maximum correlation value is retained. Thus, syntenic correlation is the third (optional) parameter which is available if users want to experiment with further clustering BGCs.

Only homolog groups which occur in a single copy in the context of each BGC are used to compute the correlation in ranked order and the correlation is only computed for pairs of BGCs which share at least three such homolog groups.

Usage

usage: lsaBGC-Cluster.py [-h] -b BGC_LISTINGS -m ORTHOFINDER_MATRIX -o OUTPUT_DIRECTORY [-p BGC_PREDICTION_SOFTWARE] [-c CORES] [-i MCL_INFLATION] [-j JACCARD_CUTOFF] [-r SYNTENIC_CORRELATION_CUTOFF]
                         [-s] [-t]

        Program: lsaBGC-Cluster.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        This program will cluster BGC Genbanks using MCL based on similarity exhibited in ortholog group presence/
        absence data. Clustering uses MCL.

optional arguments:
  -h, --help            show this help message and exit
  -b BGC_LISTINGS, --bgc_listings BGC_LISTINGS
                        BGC listing file. Tab delimited 2-column file: (1) sample name (2) path to predicted BGC in Genbank format.
  -m ORTHOFINDER_MATRIX, --orthofinder_matrix ORTHOFINDER_MATRIX
                        OrthoFinder matrix.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Output directory.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -c CORES, --cores CORES
                        Number of cores to use for MCL step.
  -i MCL_INFLATION, --mcl_inflation MCL_INFLATION
                        Inflation parameter to be used for MCL.
  -j JACCARD_CUTOFF, --jaccard_cutoff JACCARD_CUTOFF
                        Cutoff for Jaccard similarity of homolog groups shared between two BGCs.
  -r SYNTENIC_CORRELATION_CUTOFF, --syntenic_correlation_cutoff SYNTENIC_CORRELATION_CUTOFF
                        Minimum absolute correlation coefficient between two BGCs.
  -s, --split_by_annotation
                        Partition BGCs into groups based on annotation first.
  -t, --run_parameter_tests
                        Run tests for selecting the best inflation parameter and jaccard for MCL analysis and exit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly