Portal Pipeline Utilities

Documentation for portal-pipeline-utils, a collection of utilities for deploying pipelines and interfacing with portal infrastructure.

Contents

Pipeline Utils

Install

PyPI

The package is available on pypi:

pip install portal-pipeline-utils
Source

The version on pypi may be outdated or may not be the required version. To install the latest version from source:

git clone https://github.com/dbmi-bgm/portal-pipeline-utils.git
cd portal-pipeline-utils
make configure
make update
make build

Please refer to pyproject.toml for the supported Python version.

pipeline_utils

This is the entry point for a collection of utilities available as commands:

Usage:

pipeline_utils [COMMAND] [ARGS]
pipeline_deploy

Utility to automatically deploy pipeline’s components from a target repository. It is possible to specify multiple target repositories to deploy multiple pipelines at the same time.

Usage:

pipeline_utils pipeline_deploy --ff-env FF_ENV --repos REPO [REPO ...] [OPTIONAL ARGS]

Arguments:

Argument

Definition

--ff-env

Environment to use for deployment

--repos

List of directories for the repositories to deploy, each repository must follow the expected structure (see docs)

Optional Arguments:

Argument

Definition

--builder

Builder to use to deploy Docker containers to AWS ECR through AWS CodeBuild [<ff-env>-pipeline-builder]

--branch

Branch to use to deploy Docker containers to AWS ECR through AWS CodeBuild [main]

--local-build

Trigger a local build for Docker containers instead of using AWS CodeBuild

--keydicts-json

Path to file with keys for portal auth in JSON format [~/.cgap-keys.json]

--wfl-bucket

Bucket to use for upload of Workflow Description files (CWL or WDL)

--account

AWS account to use for deployment

--region

AWS account region to use for deployment

--project

Project to use for deployment [cgap-core]

--institution

Institution to use for deployment [hms-dbmi]

--post-software

DEPLOY | UPDATE Software objects (.yaml or .yml)

--post-file-format

DEPLOY | UPDATE File Format objects (.yaml or .yml)

--post-file-reference

DEPLOY | UPDATE File Reference objects (.yaml or .yml)

--post-workflow

DEPLOY | UPDATE Workflow objects (.yaml or .yml)

--post-metaworkflow

DEPLOY | UPDATE Pipeline objects (.yaml or .yml)

--post-wfl

Upload Workflow Description files (.cwl or .wdl)

--post-ecr

Build Docker container images and push to AWS ECR. By default will use AWS CodeBuild unless --local-build flag is set

--debug

Turn off DEPLOY | UPDATE action

--verbose

Print the JSON structure created for the objects

--validate

Validate YAML objects against schemas. Turn off DEPLOY | UPDATE action

--sentieon-server

Address for Sentieon license server

API

In development.

Contribute Pipelines

Contribute a Pipeline

Welcome to the documentation on how to contribute pipelines.

We’re glad that you’re interested in contributing a pipeline, and we appreciate your help in expanding and improving our offering. This document will guide you through the process of building and deploying a new pipeline in the portal infrastructure.

Building a Pipeline

A pipeline requires several components to be compatible and run within our infrastructure:

  • Workflow description files

  • Docker containers

  • Portal objects

  • A name and a version for the pipeline

These components need to be organized following a validated structure to enable automated deployment. More information on this specific structure is available here.

Although it’s not strictly necessary, it is highly recommended to set up a GitHub repository to store and organize all the components.

Workflow Description Files

Workflow description languages are standards for describing data analysis pipelines that are portable across different platforms.

Each step of the pipeline that needs to execute in a single computing environment must be defined in a corresponding workflow description file using one of the supported languages. At the moment we are supporting two standards, Common Workflow Language (CWL) and Workflow Description Language (WDL), and we are working to add support for more standards (e.g., Snakemake).

Each step codified as a workflow description file will execute on a single EC2 machine through our executioner software, Tibanna.

Note: the workflow description file must have a .wdl or .cwl extension to be recognized during the automated deployment.

The following example implement the steps foo and bar for the foo_bar pipeline. Each step will execute independently on a single EC2 machine.

pipeline-foo_bar
│
├── descriptions
│   ├── foo.cwl
│   └── bar.wdl
..

Typically, when creating a workflow description file, the code will make reference to a Docker container. To store these containers, we use private ECR repositories that are specific to each AWS account. To ensure that the description file points to the appropriate image, we utilize two placeholders, VERSION and ACCOUNT, which will be automatically substituted in the file with the relevant account information during deployment. If the code runs Sentieon software and requires the SENTIEON_LICENSE environmental variable to be set, the LICENSEID placeholder will be substituted by the code with the server address provided to the deploy command.

Example of a CWL code with the placeholders

#!/usr/bin/env cwl-runner

cwlVersion: v1.0

class: CommandLineTool

requirements:
  - class: EnvVarRequirement
    envDef:
      -
        envName: SENTIEON_LICENSE
        envValue: LICENSEID

hints:
  - class: DockerRequirement
    dockerPull: ACCOUNT/upstream_sentieon:VERSION

baseCommand: [sentieon, driver]

...
Docker Containers

As we are using temporary EC2 machines, all code to be executed must be packaged and distributed in Docker containers.

Each pipeline can have multiple containers, and each container requires its own directory with all the related components and the corresponding Dockerfile.

During the automated deployment, each image will be automatically built, tagged based on the name of the directory, and pushed to the corresponding ECR repository within AWS. More information on the deployment process here.

The following example will build the images image_foo and image_bar, and push them to ECR during the deployment.

pipeline-foo_bar
│
├── dockerfiles
│   │
│   ├── image_foo
│   │   ├── foo.sh
│   │   └── Dockerfile
│   │
│   └── image_bar
│       ├── bar.py
│       └── Dockerfile
..
Portal Objects

Workflow description files and Docker containers are necessary to execute the code and run each step of the pipeline in isolation. However, a pipeline is a complex object that consists of multiple steps chained together.

To create these dependencies and specify the necessary details for the execution of each individual workflow and the end-to-end processing of the pipeline, we need additional supporting metadata in the form of YAML objects. The objects currently available are:

  • Pipeline, this object defines dependencies between workflows, scatter and gather parameters to parallelize execution, reference files and constant input parameters, and EC2 configurations for each of the workflows.

  • Workflow, this object represents a pipeline step and stores metadata to track its inputs, outputs, software, and description files (e.g., WDL or CWL).

  • Software, this object stores information to track and version a specific software used by the pipeline.

  • File Reference, this object stores information to track and version a specific reference file used by the pipeline.

  • File Format, this object stores information to represent a file format used by the pipeline.

Please refer to each of the linked pages for details on the schema definitions specific to the object and the available code templates.

Note: the files defining portal objects must have a .yaml or .yml extension to be recognized during the automated deployment.

The following example implements workflow objects for the steps foo and bar and a pipeline object for the foo_bar pipeline. Additional metadata to track reference files, file formats, and software used by the pipeline are also implemented as corresponding YAML objects.

pipeline-foo_bar
│
├── portal_objects
│   │
│   ├── workflows
│   │   ├── foo.yaml
│   │   └── bar.yaml
│   │
│   ├── metaworkflows
│   │   └── foo_bar.yaml
│   │
│   ├── file_format.yaml
│   ├── file_reference.yaml
│   └── software.yaml
..
PIPELINE and VERSION Files

Finally, automated deployment requires a pipeline version and name. These will also be used to tag some of the components deployed with the pipeline (i.e., Docker containers, workflow description files, Pipeline and Workflow objects).

This information must be provided in separate VERSION and PIPELINE one-line files.

Example

pipeline-foo_bar
│
..
├── PIPELINE
└── VERSION
Examples

Real examples of implemented pipeline modules can be found linked as submodules in our main pipeline repository for the CGAP project here: https://github.com/dbmi-bgm/cgap-pipeline-main.

Deploy Pipelines to AWS Environment

This document describes how to deploy pipelines to a target AWS environment. Although it’s possible to run the deployment from a local machine, we highly recommend using an AWS EC2 machine.

Setup an EC2 Machine

This step may be skipped if you have an EC2 already set up.

We recommend using the following configuration:

  • AMI: Use a linux distribution (64-bit, x86)

  • Instance Type: t3.large or higher

  • Storage: 50+GB in main volume

Install Docker

The deployment code will try to trigger remote AWS CodeBuild jobs to build and push the Docker containers implemented for the pipelines directly in AWS. However, if no builder has been setup, it is possible to run a local build using Docker by passing the flag --local-build to the deployment command.

Running a local build requires having a Docker application running on the machine. To install Docker in a EC2 machine, refer to the following instructions based on an Amazon Linux AMI:

Update packages:

sudo yum update -y

Install the Docker Engine package:

sudo yum install docker

Start the docker service:

sudo service docker start

Ensure Docker is installed correctly and has the proper permissions by running a test command:

docker run hello-world

More information on how to setup Docker can be found in the AWS Documentation.

We now need to install the pipeline_utils software to deploy the pipeline components.

Install pipeline_utils

The software is Python-based. To install the software and the required packages, we recommend using a fresh virtual environment. Please refer to pyproject.toml for the supported Python version.

We recommend using pyenv to manage virtual environments. Instructions for installing and using pyenv can be found here.

Once the virtual environment is set up and activated, we can proceed to install portal-pipeline-utils software.

# Install from source
git clone https://github.com/dbmi-bgm/portal-pipeline-utils.git
cd portal-pipeline-utils
make configure
make update
make build
cd ..

# Install from pypi
pip install portal-pipeline-utils

To check that the software is correctly installed, try to run pipeline_utils. If installed from source, this command may fail with a bash “command not found” error, try poetry run pipeline_utils instead.

Set Up Credentials and Environmental Variables
AWS Auth Credentials

To deploy pipelines components in a specific AWS account, we need to setup the following environmental variables to authenticate to the account.

export AWS_ACCOUNT_NUMBER=
export TIBANNA_AWS_REGION=
export GLOBAL_ENV_BUCKET=
export S3_ENCRYPT_KEY=

export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=

# Optional, depending on the account
export S3_ENCRYPT_KEY_ID=
export AWS_SESSION_TOKEN=

Tips:

  • GLOBAL_ENV_BUCKET can be found in the AWS Secrets Manager.

  • S3_ENCRYPT_KEY and S3_ENCRYPT_KEY_ID can be found in the AWS Secrets Manager.

  • AWS_SESSION_TOKEN is used by some single sign-on platforms for managing credentials but may not be required otherwise.

  • TIBANNA_AWS_REGION is the main region for the AWS account.

Portal Credentials

We also need to setup credentials to authenticate to the portal database to push some of the portal components. These credentials need to be stored as a key-pair file as described here.

The default path used by the code to locate this file is ~/.cgap-keys.json. However, it is possible to specify a different key-pair file through a command line argument, if desired.

Example of a key-pair entry:

{
  "<namespace>": {
      "key": "XXXXXXXX",
      "secret": "xxxxxxxxxxxxxxxx",
      "server": "<URL>"
  }
}

<namespace> is the namespace for the environment and can be found in the portal health page (e.g., cgap-wolf).

Target Account Information

Finally we need to setup the information to identify the target environment to use for the deployment.

# Set the namespace of the target environment
#   e.g., cgap-wolf
export ENV_NAME=

# Set the bucket used to store the worklow description files
#   e.g., cgap-biotest-main-application-tibanna-cwls
export WFL_BUCKET=

# Set the path to the keypair file with the portal credential
export KEYDICTS_JSON=~/.cgap-keys.json

# Set up project and institution
#   Project and institution need to correspond to metadata present on the portal
#   e.g., cgap-core and hms-dbmi
export PROJECT=
export INSTITUTION=

# If running sentieon code,
#   specify the address for the server that validate the software license
export SENTIEON_LICENSE=0.0.0.0

Tips:

  • ENV_NAME is the namespace for the environment and can be found in the portal health page under Namespace.

  • WFL_BUCKET can be found in the portal health page under Tibanna CWLs Bucket. This bucket will be used to store the workflow description files.

Running the Deployment

The following code will use the pipeline_deploy command to deploy all the components from the repositories specified by the --repos argument.

pipeline_utils pipeline_deploy \
  --ff-env ${ENV_NAME} \
  --keydicts-json ${KEYDICTS_JSON} \
  --wfl-bucket ${WFL_BUCKET} \
  --account ${AWS_ACCOUNT_NUMBER} \
  --region ${TIBANNA_AWS_REGION} \
  --project ${PROJECT} \
  --institution ${INSTITUTION} \
  --sentieon-server ${SENTIEON_LICENSE} \
  --post-software \
  --post-file-format \
  --post-file-reference \
  --post-workflow \
  --post-metaworkflow \
  --post-wfl \
  --post-ecr \
  --repos REPO [REPO ...]

It is possible to add flags to run the command in various debug modes, to validate the objects and test the pipeline implementation without running a real deployment. For more details on the command line arguments refer to the documentation for the pipeline_deploy command.

An important argument is --branch, this argument specifies the branch to check out for the target GitHub repository to build ECR through AWS CodeBuild. The default is set to the main branch. The --local-build flag will prevent the code from using AWS CodeBuild and force a local build with Docker instead.

Note: we are working to enable more builders with a command line argument for which builder to use to deploy modules from different repositories through AWS CodeBuild.

Deploying CGAP Pipelines

CGAP pipelines are released as a complete package with a customized set up for automated deployment to the desired environment. To deploy the pipelines run the following steps:

1. Clone the main pipeline repository. The submodules will be empty and set to the current commits saved for the main branch.

git clone https://github.com/dbmi-bgm/cgap-pipeline-main.git

2. Check out the desired version. This will set the submodules to the commits saved for that pipeline release.

git checkout <version>

3. Download the content for each submodule. The submodules will be set in detached state on their current commit.

make pull

4. Build pipeline_utils (optional). This will build from source the latest version linked for the current release.

make configure
make update
make build
  1. Set up the auth credentials as described above.

  2. Set the target account information in the .env file (see above).

  3. Test the deployment using the base module only.

make deploy-base
  1. Deploy all the other modules.

make deploy-all
Uploading the Reference Files

After a successful deployment, all required metadata and components for the pipelines are available within the infrastructure. However, we are still missing the reference files necessary to run the pipelines. We need to copy these files to the correct locations in AWS S3 buckets.

This can be done using the AWS Command Line Interface (CLI) (see above how to set the auth credentials):

# Copy the reference file to the right S3 bucket
aws s3 cp <file> s3://<file_upload_bucket>/<file_location>

More details on how to setup the AWS CLI are available here, and documentation for the cp command can be found here.

Tips:

  • <file_upload_bucket> can be found in the portal health page under File Upload Bucket.

  • <file_location> can be found in the metadata page created for the reference file under Upload Key. It follows the structure <uuid>/<accession>.<extension>.

Note: if a reference file has secondary files, these all need to be uploaded as well to the correct S3 location.

Troubleshooting

Some possible errors are described below.

Auth Errors
botocore.exceptions.ClientError: An error occurred (400) when calling
the HeadBucket operation: Bad Request

This may indicate your credentials are out of date. Make sure your AWS credentials are up to date and source them if necessary.

No Space Left on Device Errors

When running a local build, the EC2 may run out of space. You can try one of the following:

  1. Clean up old docker images that are no longer needed with a command such as docker rm -v $(docker ps -aq -f 'status=exited'). More details at https://vsupalov.com/cleaning-up-after-docker/.

  2. Increase the size of your primary EBS volume: details here.

  3. Mount another EBS volume to /var/lib/docker. Instructions to format and mount a volume are described here, but note that you would skip the mkdir step and mount the volume to /var/lib/docker.

Pipeline’s Components

Pipeline’s Repository Structure

To be picked up correctly by some of the commands, a repository needs to be set up as follow:

  • A descriptions folder to store workflow description files (CWL and WDL).

  • A dockerfiles folder to store Docker images. Each image should have its own subfolder with all the required components and the Dockerfile. Subfolder names will be used to tag the corresponding images together with the version from the VERSION file.

  • A portal_objects folder to store the objects representing metadata for the pipeline. This folder should include several subfolders:

    • A workflows folder to store metadata for Workflow objects as YAML files.

    • A metaworkflows folder to store metadata for Pipeline objects as YAML files.

    • A file_format.yaml file to store metadata for File Format objects.

    • A file_reference.yaml file to store metadata for File Reference objects.

    • A software.yaml file to store metadata for Software objects.

  • A PIPELINE one line file with the pipeline name.

  • A VERSION one line file with the pipeline version.

Example foo_bar pipeline:

pipeline-foo_bar
│
├── descriptions
│   ├── foo.cwl
│   └── bar.wdl
│
├── dockerfiles
│   │
│   ├── image_foo
│   │   ├── foo.sh
│   │   └── Dockerfile
│   │
│   └── image_bar
│       ├── bar.py
│       └── Dockerfile
│
├── portal_objects
│   │
│   ├── workflows
│   │   ├── foo.yaml
│   │   └── bar.yaml
│   │
│   ├── metaworkflows
│   │   └── foo_bar.yaml
│   │
│   ├── file_format.yaml
│   ├── file_reference.yaml
│   └── software.yaml
│
├── PIPELINE
└── VERSION

Real examples can be found linked as submodules in our pipelines repository for CGAP project here: https://github.com/dbmi-bgm/cgap-pipeline-main.

Portal Objects

File Format

This documentation provides a comprehensive guide to the template structure necessary for implementing File Format objects. These objects enable users to codify file formats used by the pipeline.

Template
## File Format information ##################################
#     Information for file format
#############################################################
# All the following fields are required
name: <string>
extension: <extension>    # fa, fa.fai, dict, ...
description: <string>

# All the following fields are optional and provided as example,
#   can be expanded to anything accepted by the schema
#   https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
secondary_formats:
  - <format>              # bam, fastq, bwt, ...
file_types:
  - <filetype>            # FileReference, FileProcessed, FileSubmitted
status: <status>          # shared
Fields Definition
Required

All the following fields are required.

name

Name of the file format, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).

extension

Extension used by the file format.

description

Description of the file format.

Optional

All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.

secondary_formats

List of secondary <format> available for the file format. Each <format> needs to match a file format that has been previously defined.

file_types

File types that can use the file format. List of <filetype>. The possible values are FileReference, FileProcessed and FileSubmitted. Default value if not specified is FileReference and FileProcessed.

File Reference

This documentation provides a comprehensive guide to the template structure necessary for implementing File Reference objects. These objects enable users to codify information to track and version the reference files used by the pipeline.

Template
## File Reference information ###############################
#     Information for reference file
#############################################################
# All the following fields are required
name: <string>
description: <string>
format: <format>              # bam, fastq, bwt, ...
version: <string>

# All the following fields are optional and provided as example,
#   can be expanded to anything accepted by the schema
#   https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
secondary_files:
  - <format>                  # bam, fastq, bwt, ...
status: <status>              # uploading, uploaded
license: <string>             # MIT, GPLv3, ...

# Required to enable sync with a reference bucket
uuid: <uuid4>
accession: <accession>
Fields Definition
Required

All the following fields are required.

name

Name of the reference file, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).

description

Description of the reference file.

format

File format used by the reference file. <format> needs to match a file format that has been previously defined, see File Format.

version

Version of the reference file.

Optional

All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.

secondary_files

List of <format> for secondary files associated to the reference file. Each <format> needs to match a file format that has been previously defined, see File Format.

status

Status of the upload. The possible values are uploading and uploaded. If no value is specified, the status will not be updated during patching and set to uploading if posting the object for the first time.

Most likely you don’t want to set this field and just use the default logic automatically applied during deployment.

license

License information.

Software

This documentation provides a comprehensive guide to the template structure necessary for implementing Software objects. These objects enable users to codify information to track and version specific softwares used by the pipeline.

Template
## Software information #####################################
#     Information for software
#############################################################
# All the following fields are required
name: <string>

# Either version or commit is required to identify the software
version: <string>
commit: <string>

# All the following fields are optional and provided as example,
#   can be expanded to anything accepted by the schema
#   https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
title: <string>
source_url: <string>
description: <string>
license: <string>                 # MIT, GPLv3, ...
Fields Definition
Required

All the following fields are required. Either version or commit is required to identify the software.

name

Name of the software, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).

version

Version of the software.

commit

Commit of the software.

Optional

All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.

title

Title for the software.

source_url

URL for the software (e.g, source files, binaries, repository, etc…).

description

Description for the software.

license

License information.

Workflow

This documentation provides a comprehensive guide to the template structure necessary for implementing Workflow objects. These objects enable users to codify pipeline steps and store metadata to track inputs, outputs, software, and description files (e.g., WDL or CWL) for each workflow.

Template
## Workflow information #####################################
#     General information for the workflow
#############################################################
# All the following fields are required
name: <string>
description: <string>

runner:
  language: <language>                # cwl, wdl
  main: <file>                        # .cwl or .wdl file
  child:
    - <file>                          # .cwl or .wdl file

# All the following fields are optional and provided as example,
#   can be expanded to anything accepted by the schema
#   https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
title: <string>

software:
  - <software>@<version|commit>

## Input information ########################################
#     Input files and parameters
#############################################################
input:

  # File argument
  <file_argument_name>:
    argument_type: file.<format>      # bam, fastq, bwt, ...

  # Parameter argument
  <parameter_argument_name>:
    argument_type: parameter.<type>   # string, integer, float, json, boolean

## Output information #######################################
#     Output files and quality controls
#############################################################
output:

  # File output
  <file_output_name>:
    argument_type: file.<format>
    secondary_files:
      - <format>                      # bam, fastq, bwt, ...

  # QC output
  <qc_output_name>:
    argument_type: qc.<type>          # qc_type, e.g. quality_metric_vcfcheck
                                      # none can be used as <type>
                                      #   if a qc_type is not defined
    argument_to_be_attached_to: <file_output_name>
    # All the following fields are optional and provided as example,
    #   can be expanded to anything accepted by the schema
    html: <boolean>
    json: <boolean>
    table: <boolean>
    zipped: <boolean>
    # If the output is a zipped folder with multiple QC files,
    #   fields to define the target files inside the folder
    html_in_zipped: <file>
    tables_in_zipped:
      - <file>
    # Fields still used by tibanna that needs refactoring
    #   listing them as they are
    qc_acl: <string>                  # e.g. private
    qc_unzip_from_ec2: <boolean>

  # Report output
  <report_output_name>:
    argument_type: report.<type>      # report_type, e.g. file
General Fields Definition
Required

All the following fields are required.

name

Name of the workflow, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).

description

Description of the workflow.

runner

Definition of the data processing flow for the workflow. This field is used to specify the standard language and description files used to define the workflow. Several subfields need to be specified:

  • language [required]: Language standard used for workflow description

  • main [required]: Main description file

  • child [optional]: List of supplementary description files used by main

At the moment we support two standards, Common Workflow Language (CWL) and Workflow Description Language (WDL).

input

Description of input files and parameters for the workflow. See Input Definition.

output

Description of expected outputs for the workflow. See Output Definition.

Optional

All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.

title

Title of the workflow.

software

List of software used by the workflow. Each software is specified using the name of the software and the version (either version or commit) in the format <software>@<version|commit>. Each software needs to match a software that has been previously defined, see Software.

Input Definition

Each argument is defined by its name. Additional subfields need to be specified depending on the argument type.

argument_type

Definition of the type of the argument.

For a file argument, the argument type is defined as file.<format>, where <format> is the format used by the file. <format> needs to match a file format that has been previously defined, see File Format.

For a parameter argument, the argument type is defined as parameter.<type>, where <type> is the type of the value expected for the argument [string, integer, float, json, boolean].

Output Definition

Each output is defined by its name. Additional subfields need to be specified depending on the output type.

argument_type

Definition of the type of the output.

For a file output, the argument type is defined as file.<format>, where <format> is the format used by the file. <format> needs to match a file format that has been previously defined, see File Format.

For a QC (Quality Control) output, the argument type is defined as qc.<type>, where <type> is a a qc_type defined in the the schema, see schemas.

For a report output, the argument type is defined as report.<type>, where <type> is the type of the report (e.g., file).

Note: We are currently re-thinking how QC and report outputs work, the current definitions are temporary solutions that may change soon.

secondary_files

This field can be used for output files.

List of <format> for secondary files associated to the output file. Each <format> needs to match a file format that has been previously defined, see File Format.

argument_to_be_attached_to

This field can be used for output QCs.

Name of the output file the QC is calculated for.

Pipeline

This documentation provides a comprehensive guide to the template structure necessary for implementing Pipeline objects. These objects enable users to define workflow dependencies, parallelize execution by defining scattering and gathering parameters, specify reference files and constant input parameters, and configure AWS EC2 instances for executing each workflow within the pipeline.

Template
## Pipeline information #####################################
#     General information for the pipeline
#############################################################
# All the following fields are required
name: <string>
description: <string>

# All the following fields are optional and provided as example,
#   can be expanded to anything accepted by the schema
#   https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
proband_only: <boolean>

## General arguments ########################################
#     Pipeline input, reference files, and general arguments
#       define all arguments for the pipeline here
#############################################################
input:

  # File argument
  <file_argument_name>:
    argument_type: file.<format>        # bam, fastq, bwt, ...
    # All the following fields are optional and provided as example,
    #   can be expanded to anything accepted by the schema
    dimensionality: <integer>
    files:
      - <file_name>@<version>

  # Parameter argument
  <parameter_argument_name>:
    argument_type: parameter.<type>     # string, integer, float, json, boolean
    # All the following fields are optional and provided as example,
    #   can be expanded to anything accepted by the schema
    value: <...>

## Workflows and dependencies ###############################
#     Information for the workflows and their dependencies
#############################################################
workflows:

  ## Workflow definition #####################
  ############################################
  <workflow_name>[@<tag>]:

    ## Hard dependencies ###############
    #   Dependencies that must complete
    ####################################
    dependencies:
      - <workflow_name>[@<tag>]

    ## Lock version ####################
    #   Specific version to use
    #     for the workflow
    ####################################
    version: <string>

    ## Specific arguments ##############
    #   General arguments that need to be referenced and
    #     specific arguments for the workflow:
    #       - file arguments that need to source the output from a previous step
    #       - file arguments that need to scatter or gather
    #       - parameter arguments that need to have a modified value / dimensions
    #       - all arguments that need to source from a general argument with a different name
    ####################################
    input:

      # File argument
      <file_argument_name>:
        argument_type: file.<format>      # bam, fastq, bwt ...
        # Linking fields
        #   These are optional fields
        #   Check https://magma-suite.readthedocs.io/en/latest/meta-workflow.html
        #     for more details
        source: <workflow_name>[@<tag>]
        source_argument_name: <file_argument_name>
        # Input dimension
        #   These are optional fields to specify input argument dimensions to use
        #     when creating the pipeline structure or step specific inputs
        #   See https://magma-suite.readthedocs.io/en/latest/meta-workflow.html
        #     for more details
        scatter: <integer>
        gather: <integer>
        input_dimension: <integer>
        extra_dimension: <integer>
        # All the following fields are optional and provided as example,
        #   can be expanded to anything accepted by the schema
        mount: <boolean>
        rename: formula:<parameter_argument_name>
              #  can be used to specify a name for parameter argument
              #    to use to set a rename field for the file
        unzip: <string>

      # Parameter argument
      <parameter_argument_name>:
        argument_type: parameter.<type>
        # All the following fields are optional and provided as example,
        #   can be expanded to anything accepted by the schema
        value: <...>
        source_argument_name: <parameter_argument_name>

    ## Output ##########################
    #     Output files for the workflow
    ####################################
    output:

      # File output
      <file_output_name>:
        file_type: <file_type>
        # All the following fields are optional and provided as example,
        #   can be expanded to anything accepted by the schema
        description: <string>
        linkto_location:
          - <location>                    # Sample, SampleProcessing
        higlass_file: <boolean>
        variant_type: <variant_type>      # SNV, SV, CNV
        vcf_to_ingest: <boolean>
        s3_lifecycle_category: <string>   # short_term_access_long_term_archive,
                                          # short_term_access, short_term_archive,
                                          # long_term_access_long_term_archive,
                                          # long_term_access, long_term_archive,
                                          # no_storage, ignore

    ## EC2 Configuration to use ########
    ####################################
    config:
      <config_parameter>: <...>
General Fields Definition
Required

All the following fields are required.

name

Name of the pipeline, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).

description

Description of the pipeline.

input

Description of general input files and parameters for the pipeline. See Input Definition.

workflows

Description of workflows that are steps of the pipeline. See Workflows Definition.

Optional

All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.

title

Title of the pipeline.

Workflows Definition

Each workflow is defined by its name and represents a step of the pipeline. Additional subfields need to be specified.

The workflow name must follow the format <workflow_name>[@<tag>]. <workflow_name> needs to match a workflow that has been previously defined, see Workflow. If the same workflow is used for multiple steps in the pipeline, a tag can be added to the name of the workflow after ‘@’ to make it unique (e.g., a QC step that run twice at different moments of the pipeline). If a <tag> is used while defining a workflow, <workflow_name>@<tag> must be used to reference the correct step as dependency.

dependencies

Workflows that must complete before kicking the current step. List of workflows in the the format <workflow_name>[@<tag>].

version

Version to use for the corresponding workflow instead of the default specified for the repository. Allows to lock the workflow to specific version.

input

Description of general arguments that need to be referenced and specific arguments for the step. See Input Definition.

output

Description of expected output files for the workflow.

Each output is defined by its name. Additional subfields can be specified. See schemas.

Each output name needs to match an output name that has been previously defined in the corresponding workflow, see Workflow.

config

Description of configuration parameters to run the workflow. Any parameters can be defined here and will be used to configure the run in AWS (e.g., EC2 type, EBS size, …).

Input Definition

Each argument is defined by its name. Additional subfields need to be specified depending on the argument type. Each argument name needs to match an argument name that has been previously defined in the corresponding workflow, see Workflow.

argument_type

Definition of the type of the argument.

For a file argument, the argument type is defined as file.<format>, where <format> is the format used by the file. <format> needs to match a file format that has been previously defined, see File Format.

For a parameter argument, the argument type is defined as parameter.<type>, where <type> is the type of the value expected for the argument [string, integer, float, json, boolean].

files

This field can be used to assign specific files to a file argument. For example, specific reference files that are constant for the pipeline can be specified for the corresponding argument using this field.

Each file is specified using the name of the file and the version in the format <file_name>@<version>. For reference files, each file needs to match a file reference that has been previously defined, see File Reference.

value

This field can be used to assign a specific value to a parameter argument.

Note: As of now, the value needs to be always encoded as <string>. We are working to improve this and enable usage of real types.

Example

a_float:
argument_type: parameter.float
value: "0.8"

an_integer:
argument_type: parameter.integer
value: "1"

a_string_array:
argument_type: parameter.json
value: "[\"DEL\", \"DUP\"]"
Linking Fields

These are optional fields that can be used when defining workflow specific arguments to describe dependencies and map to arguments with different names. See magma documentation for for more details.

source

This field can be used to assign a dependency for a file argument to a previous workflow. It must follow the format <workflow_name>[@<tag>] to reference the correct step as source.

source_argument_name

This field can be used to source a specific argument by name. It can be used to:

  • Specify the name of an output of a source step to use.

  • Specify the name of a general argument defined in the input section to use when it differs from the argument name.

Input Dimension Fields

These are optional fields that can be used when defining workflow specific arguments to specify the input dimensions to use when creating the pipeline structure or step specific inputs. See magma documentation for more details.

scatter

Input dimension to use to scatter the workflow. This will create multiple shards in the pipeline for the step. The same dimension will be used to subset the input when creating the specific input for each shard.

gather

Increment for input dimension when gathering from previous shards. This will collate multiple shards into a single step. The same increment in dimension will be used when creating the specific input for the step.

input_dimension

Additional dimension used to subset the input when creating the specific input for the step. This will be applied on top of scatter, if any, and will only affect the input. This will not affect the scatter dimension used to create the shards for the step.

extra_dimension

Additional increment to dimension used when creating the specific input for the step. This will be applied on top of gather, if any, and will only affect the input. This will not affect gather dimension in building the pipeline structure.