Skip to content
William Rowell edited this page Nov 21, 2024 · 17 revisions

PacBio WGS Variant Pipeline

PacBio WGS Variant Pipeline

Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL).

Workflow

Starting in v2, this repo contains two related workflows. The singleton workflow is designed to analyze a single sample, while the family workflow is designed to analyze a family of related samples. With the exception of the joint calling tasks in the family workflow, both workflows make use of the same tasks, although the input and output structure differ.

The family workflow will be best for most use cases. The singleton workflow inputs and output structures are relatively flat, which should improve compatibility with platforms like Terra.

Both workflows are designed to analyze human PacBio whole genome sequencing (WGS) data. The workflows are designed to be run on Azure, AWS HealthOmics, GCP, or HPC backends.

Workflow entrypoint:

Setup

This is an actively developed workflow with multiple versioned releases, and we make use of git submodules for common tasks that are shared by multiple workflows. There are two ways to ensure you are using a supported release of the workflow and ensure that the submodules are correctly initialized:

  1. Download the release zips directly from a supported release:
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.7/hifi-human-wgs-singleton.zip
wget https://github.com/PacificBiosciences/HiFi-human-WGS-WDL/releases/download/v2.0.7/hifi-human-wgs-family.zip
  1. Clone the repository and initialize the submodules:
git clone \
  --depth 1 --branch v2.0.7 \
  --recursive \
  https://github.com/PacificBiosciences/HiFi-human-WGS-WDL.git

Resource requirements

The most resource-heavy step in the workflow requires 64 cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

On some backends, you may be able to make use of a GPU to accelerate the DeepVariant step. The GPU is not required, but it can significantly speed up the workflow. If you have access to a GPU, you can set the gpu parameter to true in the inputs JSON file.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Setting up and executing the workflow

  1. Select a backend environment
  2. Configure a workflow execution engine in the chosen environment
  3. Fill out the inputs JSON file for your cohort
  4. Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine and container runtime

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.

See the backend-specific documentation for details on setting up an engine.

Engine Azure AWS GCP HPC
miniwdl Unsupported Supported via AWS HealthOmics Unsupported (SLURM only) Supported via the miniwdl-slurm plugin
Cromwell Supported via Cromwell on Azure Unsupported Supported via Google's Pipelines API Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:

Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).

If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix> in the input template file with the local path to the reference datasets on your HPC. If using Amazon HealthOmics, you will need to download the reference bundle, upload it to your S3 bucket, and adjust paths accordingly.

Running the workflow

Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).

Note that the calls to miniwdl and Cromwell assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).

Run directly using miniwdl

miniwdl run workflows/singleton.wdl -i <input_file_path.json>

Run directly using Cromwell

java -jar <cromwell_jar_path> run workflows/singleton.wdl -i <input_file_path.json>

If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:

Workflow inputs

This section describes the inputs required for a run of the workflow. Typically, only the sample-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.

Workflow inputs for each entrypoint are described in singleton and family documentation.

At a high level, we have two types of inputs files:

  • maps are TSV files describing inputs that will be used for every execution of the workflow, like reference genome FASTA files and genome interval BED files.
  • inputs.json files are JSON files that describe the samples to be analyzed and the paths to the input files for each sample.

The resource bundle containing the GRCh38 reference and other files used in this workflow can be downloaded from Zenodo:

10.5281/zenodo.14027047

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io repo. Docker images used in the workflow are pinned to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/pb_wdl_base@sha256:4b889a1f ... b70a8e87

Tool versions and Docker images used in these workflows can be found in the tools and containers documentation.


DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.