Skip to content

Commit

Permalink
Update index.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
vlilanl authored Nov 15, 2024
1 parent 4676bf3 commit b2cf317
Showing 1 changed file with 56 additions and 81 deletions.
137 changes: 56 additions & 81 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 8,30 @@ Metagenome Assembly Workflow (v1.0.7)
Workflow Overview
-----------------

This workflow takes in paired-end Illumina short reads or paired-end PacBio long reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The `.wdl` (Workflow Description Language) file includes five tasks: *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*.
This workflow takes in paired-end Illumina short reads or PacBio long reads.

**Short Reads**:

In short reads, the workflow reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The `.wdl` (Workflow Description Language) file includes five tasks: *bbcms*, *assy*, *create_agp*, *read_mapping_pairs*, and *make_output*.

1. The *bbcms* task takes in interleaved FASTQ inputs, performs error correction, and reformats the interleaved FASTQ into two output FASTQ files for paired-end reads for the next tasks.
2. The *assy* task performs metaSPAdes assembly.
3. Contigs and Scaffolds (output of metaSPAdes) are processed by the *create_agp* task to rename the FASTA header and generate an `AGP format <https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/>`_ which describes the assembly.
4. The *read_mapping_pairs* task maps reads back to the final assembly to generate coverage information.
5. The final *make_output* task collects all output files into the specified directory.

**Long Reads**:

In long reads, the workflow uses Flye for assembly, pbmm2 for alignment, Racon for polishing, and minimap2 for read mapping and coverage analysis. The :literal:`.wdl` (Workflow Description Language) file includes six tasks: *combine_fastq*, *assy*, *racon*, *format_assembly*, *map*, and *make_info_file*.

1. The *combine_fastq* task combines the input FASTQ files into a single FASTQ file, which is used as input for polishing and mapping tasks.
2. The *assy* task takes in the input FASTQ files and performs assembly using Flye.
3. The *racon* task cleans up the assembled contigs through two rounds of error correction using :literal:`pbmm2` and :literal:`Racon`.
4. The *format_assembly* task formats the polished assembly using BBTools' :literal:`fungalrelease.sh`, creating release-ready scaffolds and contigs, along with an `AGP format <https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/>`_ file and a legend file that describes the assembly.
5. The *map* task maps the input reads back to the final assembly using minimap2 to generate coverage data.
6. The final *make_info_file* task produces a summary file documenting tool versions, parameters, memory usage, and Docker containers used throughout the workflow.


Workflow Availability
---------------------

Expand All @@ -31,7 47,7 @@ The corresponding Docker images are available on DockerHub:
Requirements for Execution
--------------------------

(recommendations are in **bold**)
(Recommendations are in **bold**)

- WDL-capable Workflow Execution Tool (**Cromwell**)
- Container Runtime that can load Docker images (**Docker v2.1.0.3 or higher**)
Expand Down Expand Up @@ -67,14 83,16 @@ Third-party software: (This is included in the Docker image.)
Sample dataset(s)
-----------------

**Short Reads:**

- Small dataset: `Ecoli 10x (287M) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_small_test_data.tgz>`_ (Input/output included in tar.gz file)
- Large dataset: `Zymobiomics mock-community DNA control (22G) <https://portal.nersc.gov/cfs/m3408/test_data/metaAssembly_large_test_data.tgz>`_ (Input/output included in tar.gz file)
- Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_). The original dataset is ~4 GB. For testing, a 10% subsample of the dataset is used: (`SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_). This dataset is already interleaved.

Long reads dataset: `PacBio <https://portal.nersc.gov/project/m3408//test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz>`_
**Long Reads:**

Zymobiomics mock-community DNA control (`SRR7877884 <https://www.ebi.ac.uk/ena/browser/view/SRR7877884>`_). The original dataset is ~4 GB.
Zymobiomics synthetic metagenome (`SRR13128014 <https://portal.nersc.gov/project/m3408//test_data/SRR13128014.pacbio.subsample.ccs.fastq.gz>`_) For testing we have subsampled the dataset, the original dataset is ~18GB.

For testing, a 10% subsample of the dataset is used: (`SRR7877884-int-0.1.fastq.gz <https://portal.nersc.gov/cfs/m3408/test_data/SRR7877884-int-0.1.fastq.gz>`_). This dataset is already interleaved.

Input
-----
Expand Down Expand Up @@ -115,6 133,7 @@ The output directory will contain the following files for short reads::
output/
├── nmdc_XXXXXX_metaAsm.info
├── nmdc_XXXXXX_covstats.txt
├── nmdc_XXXXXX_contigs.fna
├── nmdc_XXXXXX_bbcms.fastq.gz
├── nmdc_XXXXXX_scaffolds.fna
├── nmdc_XXXXXX_assembly.agp
Expand Down Expand Up @@ -172,86 191,42 @@ Example output stats JSON file::

The table provides all of the output directories, files, and their descriptions.

=================================================== ================================= ===============================================================
Directory File Name Description
=================================================== ================================= ===============================================================
**bbcms** Error correction result directory
bbcms/berkeleylab-jgi-meta-60ade422cd4e directory containing checking resource script
bbcms/ counts.metadata.json bbcms commands and summary statistics in JSON format
bbcms/ input.corr.fastq.gz error corrected reads in interleaved format.
bbcms/ input.corr.left.fastq.gz error corrected forward reads
bbcms/ input.corr.right.fastq.gz error corrected reverse reads
bbcms/ rc cromwell script sbumit return code
bbcms/ readlen.txt error corrected reads statistics
bbcms/ resources.log resource checking log
bbcms/ script Task run commands
bbcms/ script.background Bash script to run script.submit
bbcms/ script.submit cromwell submit commands
bbcms/ stderr standard error where task writes error message to
bbcms/ stderr.background standard error where bash script writes error message to
bbcms/ stderr.log standard error from bbcms command
bbcms/ stdout standard output where task writes error message to
bbcms/ stdout.background standard output where bash script writes error message(s)
bbcms/ stdout.log standard output from bbcms command
bbcms/ unique31mer.txt the count of unique kmer, K=31
**spades3** metaSPAdes assembly result directory
spades3/K33 directory containing intermediate files from the run with K=33
spades3/K55 directory containing intermediate files from the run with K=55
spades3/K77 directory containing intermediate files from the run with K=77
spades3/K99 directory containing intermediate files from the run with K=99
spades3/K127 directory containing intermediate files from the run with K=127
spades3/misc directory containing miscellaneous files
spades3/tmp directory for temp files
spades3/ assembly_graph.fastg metaSPAdes assembly graph in FASTG format
spades3/ assembly_graph_with_scaffolds.gfa metaSPAdes assembly graph and scaffolds paths in GFA 1.0 format
spades3/ before_rr.fasta contigs before repeat resolution
spades3/ contigs.fasta metaSPAdes resulting contigs
spades3/ contigs.paths paths in the assembly graph corresponding to contigs.fasta
spades3/ dataset.info internal configuration file
spades3/ first_pe_contigs.fasta preliminary contigs of iterative kmers assembly
spades3/ input_dataset.yaml internal YAML data set file
spades3/ params.txt information about SPAdes parameters in this run
spades3/ scaffolds.fasta metaSPAdes resulting scaffolds
spades3/ scaffolds.paths paths in the assembly graph corresponding to scaffolds.fasta
spades3/ spades.log metaSPAdes log
**final_assembly** create_agp task result directory
final_assembly/berkeleylab-jgi-meta-60ade422cd4e directory containing checking resource script
final_assembly/ assembly.agp an AGP format file describes the assembly
final_assembly/ assembly_contigs.fna Final assembly contig fasta
final_assembly/ assembly_scaffolds.fna Final assembly scaffolds fasta
final_assembly/ assembly_scaffolds.legend name mapping file from spades node name to new name
final_assembly/ rc cromwell script sbumit return code
final_assembly/ resources.log resource checking log
final_assembly/ script Task run commands
final_assembly/ script.background Bash script to run script.submit
final_assembly/ script.submit cromwell submit commands
final_assembly/ stats.json assembly statistics in json format
final_assembly/ stderr standard error where task writes error message to
final_assembly/ stderr.background standard error where bash script writes error message to
final_assembly/ stdout standard output where task writes error message to
final_assembly/ stdout.background standard output where bash script writes error message to
**mapping** maps reads back to the final assembly result directory
mapping/ covstats.txt contigs coverage informaiton
mapping/ mapping_stats.txt contigs coverage informaiton (same as covstats.txt)
mapping/ pairedMapped.bam reads mapping back to the final assembly bam file
mapping/ pairedMapped.sam.gz reads mapping back to the final assembly sam.gz file
mapping/ pairedMapped_sorted.bam reads mapping back to the final assembly sorted bam file
mapping/ pairedMapped_sorted.bam.bai reads mapping back to the final assembly sorted bam index file
mapping/ rc cromwell script sbumit return code
mapping/ resources.log resource checking log
mapping/ script Task run commands
mapping/ script.background Bash script to run script.submit
mapping/ script.submit cromwell submit commands
mapping/ stderr standard error where task writes error message to
mapping/ stderr.background standard error where bash script writes error message to
mapping/ stdout standard output where task writes error message to
mapping/ stdout.background standard output where bash script writes error message to
=================================================== ================================= ===============================================================

=================================================== ===================================================== ===============================================================
Directory File Name Description
=================================================== ===================================================== ===============================================================
**Short Reads** Short reads assembly output directory
/make_info_file nmdc_XXXXXX_metaAsm.info Summary information about the short reads assembly process
/finish_asm nmdc_XXXXXX_covstats.txt Coverage statistics for assembled contigs
/finish_asm nmdc_XXXXXX_contigs.fna Final contig sequences in FASTA format
/finish_asm nmdc_XXXXXX_bbcms.fastq.gz Error-corrected FASTQ file from bbcms
/finish_asm nmdc_XXXXXX_scaffolds.fna Final scaffold sequences in FASTA format
/finish_asm nmdc_XXXXXX_assembly.agp Assembly information in AGP format
/finish_asm stats.json Assembly statistics in JSON format
/finish_asm nmdc_XXXXXX_pairedMapped.sam.gz SAM file with reads mapped back to assembly
/finish_asm nmdc_XXXXXX_pairedMapped_sorted.bam Sorted BAM file with reads mapped back to assembly

**Long Reads** Long reads assembly output directory
/finish_lrasm nmdc_XXXXXX_assembly.legend Mapping file from contig to scaffold names
/finish_lrasm nmdc_XXXXXX_contigs.fna Final contig sequences in FASTA format
/finish_lrasm nmdc_XXXXXX_pairedMapped_sorted.bam Sorted BAM file with reads mapped back to assembly
/finish_lrasm nmdc_XXXXXX_read_count_report.txt Read count report for validation
/make_info_file nmdc_XXXXXX_metaAsm.info Summary information about the long reads assembly process
/finish_lrasm nmdc_XXXXXX_summary.stats Summary statistics for assembly
/finish_lrasm nmdc_XXXXXX_scaffolds.fna Final scaffold sequences in FASTA format
/finish_lrasm nmdc_XXXXXX_pairedMapped.sam.gz SAM file with reads mapped back to assembly
/finish_lrasm stats.json Assembly statistics in JSON format
/finish_lrasm nmdc_XXXXXX_contigs.sam.stats SAM file statistics for contigs
/finish_lrasm nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov Base coverage information for contigs
/finish_lrasm nmdc_XXXXXX_assembly.agp Assembly information in AGP format
/finish_lrasm nmdc_XXXXXX_contigs.sorted.bam.pileup.out BAM file pileup output for contigs
=================================================== ===================================================== ===============================================================


Version History
---------------

- 1.0.7 (release date **11/12/24**; previous versions: 1.0.6)
- 1.0.7 (release date **11/14/24**; previous versions: 1.0.6)

Point of contact
----------------
Expand Down

0 comments on commit b2cf317

Please sign in to comment.