Portal Pipeline Utilities
Documentation for portal-pipeline-utils, a collection of utilities for deploying pipelines and interfacing with portal infrastructure.
Contents
Pipeline Utils
Install
PyPI
The package is available on pypi:
pip install portal-pipeline-utils
Source
The version on pypi may be outdated or may not be the required version. To install the latest version from source:
git clone https://github.com/dbmi-bgm/portal-pipeline-utils.git
cd portal-pipeline-utils
make configure
make update
make build
Please refer to pyproject.toml for the supported Python version.
pipeline_utils
This is the entry point for a collection of utilities available as commands:
Usage:
pipeline_utils [COMMAND] [ARGS]
pipeline_deploy
Utility to automatically deploy pipeline’s components from a target repository. It is possible to specify multiple target repositories to deploy multiple pipelines at the same time.
Usage:
pipeline_utils pipeline_deploy --ff-env FF_ENV --repos REPO [REPO ...] [OPTIONAL ARGS]
Arguments:
Argument |
Definition |
---|---|
--ff-env |
Environment to use for deployment |
--repos |
List of directories for the repositories to deploy, each repository must follow the expected structure (see docs) |
Optional Arguments:
Argument |
Definition |
---|---|
--builder |
Builder to use to deploy Docker containers to AWS ECR through AWS CodeBuild [<ff-env>-pipeline-builder] |
--branch |
Branch to use to deploy Docker containers to AWS ECR through AWS CodeBuild [main] |
--local-build |
Trigger a local build for Docker containers instead of using AWS CodeBuild |
--keydicts-json |
Path to file with keys for portal auth in JSON format [~/.cgap-keys.json] |
--wfl-bucket |
Bucket to use for upload of Workflow Description files (CWL or WDL) |
--account |
AWS account to use for deployment |
--region |
AWS account region to use for deployment |
--project |
Project to use for deployment [cgap-core] |
--institution |
Institution to use for deployment [hms-dbmi] |
--post-software |
DEPLOY | UPDATE Software objects (.yaml or .yml) |
--post-file-format |
DEPLOY | UPDATE File Format objects (.yaml or .yml) |
--post-file-reference |
DEPLOY | UPDATE File Reference objects (.yaml or .yml) |
--post-workflow |
DEPLOY | UPDATE Workflow objects (.yaml or .yml) |
--post-metaworkflow |
DEPLOY | UPDATE Pipeline objects (.yaml or .yml) |
--post-wfl |
Upload Workflow Description files (.cwl or .wdl) |
--post-ecr |
Build Docker container images and push to AWS ECR. By default will use AWS CodeBuild unless --local-build flag is set |
--debug |
Turn off DEPLOY | UPDATE action |
--verbose |
Print the JSON structure created for the objects |
--validate |
Validate YAML objects against schemas. Turn off DEPLOY | UPDATE action |
--sentieon-server |
Address for Sentieon license server |
API
In development.
Contribute Pipelines
Contribute a Pipeline
Welcome to the documentation on how to contribute pipelines.
We’re glad that you’re interested in contributing a pipeline, and we appreciate your help in expanding and improving our offering. This document will guide you through the process of building and deploying a new pipeline in the portal infrastructure.
Building a Pipeline
A pipeline requires several components to be compatible and run within our infrastructure:
Workflow description files
Docker containers
Portal objects
A name and a version for the pipeline
These components need to be organized following a validated structure to enable automated deployment. More information on this specific structure is available here.
Although it’s not strictly necessary, it is highly recommended to set up a GitHub repository to store and organize all the components.
Workflow Description Files
Workflow description languages are standards for describing data analysis pipelines that are portable across different platforms.
Each step of the pipeline that needs to execute in a single computing environment must be defined in a corresponding workflow description file using one of the supported languages. At the moment we are supporting two standards, Common Workflow Language (CWL) and Workflow Description Language (WDL), and we are working to add support for more standards (e.g., Snakemake).
Each step codified as a workflow description file will execute on a single EC2 machine through our executioner software, Tibanna.
Note: the workflow description file must have a .wdl or .cwl extension to be recognized during the automated deployment.
The following example implement the steps foo
and bar
for the foo_bar
pipeline.
Each step will execute independently on a single EC2 machine.
pipeline-foo_bar
│
├── descriptions
│ ├── foo.cwl
│ └── bar.wdl
..
Typically, when creating a workflow description file, the code will make reference to a Docker container. To store these containers, we use private ECR repositories that are specific to each AWS account. To ensure that the description file points to the appropriate image, we utilize two placeholders, VERSION and ACCOUNT, which will be automatically substituted in the file with the relevant account information during deployment. If the code runs Sentieon software and requires the SENTIEON_LICENSE environmental variable to be set, the LICENSEID placeholder will be substituted by the code with the server address provided to the deploy command.
Example of a CWL code with the placeholders
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
requirements:
- class: EnvVarRequirement
envDef:
-
envName: SENTIEON_LICENSE
envValue: LICENSEID
hints:
- class: DockerRequirement
dockerPull: ACCOUNT/upstream_sentieon:VERSION
baseCommand: [sentieon, driver]
...
Docker Containers
As we are using temporary EC2 machines, all code to be executed must be packaged and distributed in Docker containers.
Each pipeline can have multiple containers, and each container requires its own directory with all the related components and the corresponding Dockerfile.
During the automated deployment, each image will be automatically built, tagged based on the name of the directory, and pushed to the corresponding ECR repository within AWS. More information on the deployment process here.
The following example will build the images image_foo
and image_bar
, and push them to ECR during the deployment.
pipeline-foo_bar
│
├── dockerfiles
│ │
│ ├── image_foo
│ │ ├── foo.sh
│ │ └── Dockerfile
│ │
│ └── image_bar
│ ├── bar.py
│ └── Dockerfile
..
Portal Objects
Workflow description files and Docker containers are necessary to execute the code and run each step of the pipeline in isolation. However, a pipeline is a complex object that consists of multiple steps chained together.
To create these dependencies and specify the necessary details for the execution of each individual workflow and the end-to-end processing of the pipeline, we need additional supporting metadata in the form of YAML objects. The objects currently available are:
Pipeline, this object defines dependencies between workflows, scatter and gather parameters to parallelize execution, reference files and constant input parameters, and EC2 configurations for each of the workflows.
Workflow, this object represents a pipeline step and stores metadata to track its inputs, outputs, software, and description files (e.g., WDL or CWL).
Software, this object stores information to track and version a specific software used by the pipeline.
File Reference, this object stores information to track and version a specific reference file used by the pipeline.
File Format, this object stores information to represent a file format used by the pipeline.
Please refer to each of the linked pages for details on the schema definitions specific to the object and the available code templates.
Note: the files defining portal objects must have a .yaml or .yml extension to be recognized during the automated deployment.
The following example implements workflow objects for the steps foo
and bar
and a pipeline object for the foo_bar
pipeline.
Additional metadata to track reference files, file formats, and software used by the pipeline are also implemented as corresponding YAML objects.
pipeline-foo_bar
│
├── portal_objects
│ │
│ ├── workflows
│ │ ├── foo.yaml
│ │ └── bar.yaml
│ │
│ ├── metaworkflows
│ │ └── foo_bar.yaml
│ │
│ ├── file_format.yaml
│ ├── file_reference.yaml
│ └── software.yaml
..
PIPELINE and VERSION Files
Finally, automated deployment requires a pipeline version and name. These will also be used to tag some of the components deployed with the pipeline (i.e., Docker containers, workflow description files, Pipeline and Workflow objects).
This information must be provided in separate VERSION and PIPELINE one-line files.
Example
pipeline-foo_bar
│
..
├── PIPELINE
└── VERSION
Examples
Real examples of implemented pipeline modules can be found linked as submodules in our main pipeline repository for the CGAP project here: https://github.com/dbmi-bgm/cgap-pipeline-main.
Deploy Pipelines to AWS Environment
This document describes how to deploy pipelines to a target AWS environment. Although it’s possible to run the deployment from a local machine, we highly recommend using an AWS EC2 machine.
Setup an EC2 Machine
This step may be skipped if you have an EC2 already set up.
We recommend using the following configuration:
AMI: Use a linux distribution (64-bit, x86)
Instance Type: t3.large or higher
Storage: 50+GB in main volume
Install Docker
The deployment code will try to trigger remote AWS CodeBuild jobs to build and push the Docker containers implemented for the pipelines directly in AWS.
However, if no builder has been setup, it is possible to run a local build using Docker by passing the flag --local-build
to the deployment command.
Running a local build requires having a Docker application running on the machine. To install Docker in a EC2 machine, refer to the following instructions based on an Amazon Linux AMI:
Update packages:
sudo yum update -y
Install the Docker Engine package:
sudo yum install docker
Start the docker service:
sudo service docker start
Ensure Docker is installed correctly and has the proper permissions by running a test command:
docker run hello-world
More information on how to setup Docker can be found in the AWS Documentation.
We now need to install the pipeline_utils
software to deploy the pipeline components.
Install pipeline_utils
The software is Python-based. To install the software and the required packages, we recommend using a fresh virtual environment. Please refer to pyproject.toml for the supported Python version.
We recommend using pyenv to manage virtual environments. Instructions for installing and using pyenv can be found here.
Once the virtual environment is set up and activated, we can proceed to install portal-pipeline-utils software.
# Install from source
git clone https://github.com/dbmi-bgm/portal-pipeline-utils.git
cd portal-pipeline-utils
make configure
make update
make build
cd ..
# Install from pypi
pip install portal-pipeline-utils
To check that the software is correctly installed, try to run pipeline_utils
.
If installed from source, this command may fail with a bash “command not found” error, try poetry run pipeline_utils
instead.
Set Up Credentials and Environmental Variables
AWS Auth Credentials
To deploy pipelines components in a specific AWS account, we need to setup the following environmental variables to authenticate to the account.
export AWS_ACCOUNT_NUMBER=
export TIBANNA_AWS_REGION=
export GLOBAL_ENV_BUCKET=
export S3_ENCRYPT_KEY=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
# Optional, depending on the account
export S3_ENCRYPT_KEY_ID=
export AWS_SESSION_TOKEN=
Tips:
GLOBAL_ENV_BUCKET can be found in the AWS Secrets Manager.
S3_ENCRYPT_KEY and S3_ENCRYPT_KEY_ID can be found in the AWS Secrets Manager.
AWS_SESSION_TOKEN is used by some single sign-on platforms for managing credentials but may not be required otherwise.
TIBANNA_AWS_REGION is the main region for the AWS account.
Portal Credentials
We also need to setup credentials to authenticate to the portal database to push some of the portal components. These credentials need to be stored as a key-pair file as described here.
The default path used by the code to locate this file is ~/.cgap-keys.json
.
However, it is possible to specify a different key-pair file through a command line argument, if desired.
Example of a key-pair entry:
{
"<namespace>": {
"key": "XXXXXXXX",
"secret": "xxxxxxxxxxxxxxxx",
"server": "<URL>"
}
}
<namespace>
is the namespace for the environment and can be found in the portal health page (e.g., cgap-wolf).
Target Account Information
Finally we need to setup the information to identify the target environment to use for the deployment.
# Set the namespace of the target environment
# e.g., cgap-wolf
export ENV_NAME=
# Set the bucket used to store the worklow description files
# e.g., cgap-biotest-main-application-tibanna-cwls
export WFL_BUCKET=
# Set the path to the keypair file with the portal credential
export KEYDICTS_JSON=~/.cgap-keys.json
# Set up project and institution
# Project and institution need to correspond to metadata present on the portal
# e.g., cgap-core and hms-dbmi
export PROJECT=
export INSTITUTION=
# If running sentieon code,
# specify the address for the server that validate the software license
export SENTIEON_LICENSE=0.0.0.0
Tips:
ENV_NAME is the namespace for the environment and can be found in the portal health page under
Namespace
.WFL_BUCKET can be found in the portal health page under
Tibanna CWLs Bucket
. This bucket will be used to store the workflow description files.
Running the Deployment
The following code will use the pipeline_deploy
command to deploy all the components from the repositories specified
by the --repos
argument.
pipeline_utils pipeline_deploy \
--ff-env ${ENV_NAME} \
--keydicts-json ${KEYDICTS_JSON} \
--wfl-bucket ${WFL_BUCKET} \
--account ${AWS_ACCOUNT_NUMBER} \
--region ${TIBANNA_AWS_REGION} \
--project ${PROJECT} \
--institution ${INSTITUTION} \
--sentieon-server ${SENTIEON_LICENSE} \
--post-software \
--post-file-format \
--post-file-reference \
--post-workflow \
--post-metaworkflow \
--post-wfl \
--post-ecr \
--repos REPO [REPO ...]
It is possible to add flags to run the command in various debug modes, to validate the objects and test the pipeline implementation without running a real deployment. For more details on the command line arguments refer to the documentation for the pipeline_deploy command.
An important argument is --branch
, this argument specifies the branch to check out for the target GitHub repository to build ECR through AWS CodeBuild.
The default is set to the main
branch. The --local-build
flag will prevent the code from using AWS CodeBuild and force a local build with Docker instead.
Note: we are working to enable more builders with a command line argument for which builder to use to deploy modules from different repositories through AWS CodeBuild.
Deploying CGAP Pipelines
CGAP pipelines are released as a complete package with a customized set up for automated deployment to the desired environment. To deploy the pipelines run the following steps:
1. Clone the main pipeline repository. The submodules will be empty and set to the current commits saved for the main branch.
git clone https://github.com/dbmi-bgm/cgap-pipeline-main.git
2. Check out the desired version. This will set the submodules to the commits saved for that pipeline release.
git checkout <version>
3. Download the content for each submodule. The submodules will be set in detached state on their current commit.
make pull
4. Build pipeline_utils (optional). This will build from source the latest version linked for the current release.
make configure
make update
make build
Set up the auth credentials as described above.
Set the target account information in the
.env
file (see above).Test the deployment using the base module only.
make deploy-base
Deploy all the other modules.
make deploy-all
Uploading the Reference Files
After a successful deployment, all required metadata and components for the pipelines are available within the infrastructure. However, we are still missing the reference files necessary to run the pipelines. We need to copy these files to the correct locations in AWS S3 buckets.
This can be done using the AWS Command Line Interface (CLI) (see above how to set the auth credentials):
# Copy the reference file to the right S3 bucket
aws s3 cp <file> s3://<file_upload_bucket>/<file_location>
More details on how to setup the AWS CLI are available here, and documentation for the cp
command can be found here.
Tips:
<file_upload_bucket>
can be found in the portal health page underFile Upload Bucket
.
<file_location>
can be found in the metadata page created for the reference file underUpload Key
. It follows the structure<uuid>/<accession>.<extension>
.
Note: if a reference file has secondary files, these all need to be uploaded as well to the correct S3 location.
Troubleshooting
Some possible errors are described below.
Auth Errors
botocore.exceptions.ClientError: An error occurred (400) when calling
the HeadBucket operation: Bad Request
This may indicate your credentials are out of date. Make sure your AWS credentials are up to date and source them if necessary.
No Space Left on Device Errors
When running a local build, the EC2 may run out of space. You can try one of the following:
Clean up old docker images that are no longer needed with a command such as
docker rm -v $(docker ps -aq -f 'status=exited')
. More details at https://vsupalov.com/cleaning-up-after-docker/.Increase the size of your primary EBS volume: details here.
Mount another EBS volume to
/var/lib/docker
. Instructions to format and mount a volume are described here, but note that you would skip the mkdir step and mount the volume to/var/lib/docker
.
Pipeline’s Components
Pipeline’s Repository Structure
To be picked up correctly by some of the commands, a repository needs to be set up as follow:
A descriptions folder to store workflow description files (CWL and WDL).
A dockerfiles folder to store Docker images. Each image should have its own subfolder with all the required components and the Dockerfile. Subfolder names will be used to tag the corresponding images together with the version from the VERSION file.
A portal_objects folder to store the objects representing metadata for the pipeline. This folder should include several subfolders:
A workflows folder to store metadata for Workflow objects as YAML files.
A metaworkflows folder to store metadata for Pipeline objects as YAML files.
A file_format.yaml file to store metadata for File Format objects.
A file_reference.yaml file to store metadata for File Reference objects.
A software.yaml file to store metadata for Software objects.
A PIPELINE one line file with the pipeline name.
A VERSION one line file with the pipeline version.
Example foo_bar
pipeline:
pipeline-foo_bar
│
├── descriptions
│ ├── foo.cwl
│ └── bar.wdl
│
├── dockerfiles
│ │
│ ├── image_foo
│ │ ├── foo.sh
│ │ └── Dockerfile
│ │
│ └── image_bar
│ ├── bar.py
│ └── Dockerfile
│
├── portal_objects
│ │
│ ├── workflows
│ │ ├── foo.yaml
│ │ └── bar.yaml
│ │
│ ├── metaworkflows
│ │ └── foo_bar.yaml
│ │
│ ├── file_format.yaml
│ ├── file_reference.yaml
│ └── software.yaml
│
├── PIPELINE
└── VERSION
Real examples can be found linked as submodules in our pipelines repository for CGAP project here: https://github.com/dbmi-bgm/cgap-pipeline-main.
Portal Objects
File Format
This documentation provides a comprehensive guide to the template structure necessary for implementing File Format objects. These objects enable users to codify file formats used by the pipeline.
Template
## File Format information ##################################
# Information for file format
#############################################################
# All the following fields are required
name: <string>
extension: <extension> # fa, fa.fai, dict, ...
description: <string>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
# https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
secondary_formats:
- <format> # bam, fastq, bwt, ...
file_types:
- <filetype> # FileReference, FileProcessed, FileSubmitted
status: <status> # shared
Fields Definition
Required
All the following fields are required.
name
Name of the file format, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).
extension
Extension used by the file format.
description
Description of the file format.
Optional
All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.
secondary_formats
List of secondary <format>
available for the file format.
Each <format>
needs to match a file format that has been previously defined.
file_types
File types that can use the file format.
List of <filetype>
. The possible values are FileReference
, FileProcessed
and FileSubmitted
.
Default value if not specified is FileReference
and FileProcessed
.
File Reference
This documentation provides a comprehensive guide to the template structure necessary for implementing File Reference objects. These objects enable users to codify information to track and version the reference files used by the pipeline.
Template
## File Reference information ###############################
# Information for reference file
#############################################################
# All the following fields are required
name: <string>
description: <string>
format: <format> # bam, fastq, bwt, ...
version: <string>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
# https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
secondary_files:
- <format> # bam, fastq, bwt, ...
status: <status> # uploading, uploaded
license: <string> # MIT, GPLv3, ...
# Required to enable sync with a reference bucket
uuid: <uuid4>
accession: <accession>
Fields Definition
Required
All the following fields are required.
name
Name of the reference file, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).
description
Description of the reference file.
format
File format used by the reference file.
<format>
needs to match a file format that has been previously defined, see File Format.
version
Version of the reference file.
Optional
All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.
secondary_files
List of <format>
for secondary files associated to the reference file.
Each <format>
needs to match a file format that has been previously defined, see File Format.
status
Status of the upload.
The possible values are uploading
and uploaded
.
If no value is specified, the status will not be updated during patching and set to uploading
if posting the object for the first time.
Most likely you don’t want to set this field and just use the default logic automatically applied during deployment.
license
License information.
Software
This documentation provides a comprehensive guide to the template structure necessary for implementing Software objects. These objects enable users to codify information to track and version specific softwares used by the pipeline.
Template
## Software information #####################################
# Information for software
#############################################################
# All the following fields are required
name: <string>
# Either version or commit is required to identify the software
version: <string>
commit: <string>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
# https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
title: <string>
source_url: <string>
description: <string>
license: <string> # MIT, GPLv3, ...
Fields Definition
Required
All the following fields are required. Either version or commit is required to identify the software.
name
Name of the software, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).
version
Version of the software.
commit
Commit of the software.
Optional
All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.
title
Title for the software.
source_url
URL for the software (e.g, source files, binaries, repository, etc…).
description
Description for the software.
license
License information.
Workflow
This documentation provides a comprehensive guide to the template structure necessary for implementing Workflow objects. These objects enable users to codify pipeline steps and store metadata to track inputs, outputs, software, and description files (e.g., WDL or CWL) for each workflow.
Template
## Workflow information #####################################
# General information for the workflow
#############################################################
# All the following fields are required
name: <string>
description: <string>
runner:
language: <language> # cwl, wdl
main: <file> # .cwl or .wdl file
child:
- <file> # .cwl or .wdl file
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
# https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
title: <string>
software:
- <software>@<version|commit>
## Input information ########################################
# Input files and parameters
#############################################################
input:
# File argument
<file_argument_name>:
argument_type: file.<format> # bam, fastq, bwt, ...
# Parameter argument
<parameter_argument_name>:
argument_type: parameter.<type> # string, integer, float, json, boolean
## Output information #######################################
# Output files and quality controls
#############################################################
output:
# File output
<file_output_name>:
argument_type: file.<format>
secondary_files:
- <format> # bam, fastq, bwt, ...
# QC output
<qc_output_name>:
argument_type: qc.<type> # qc_type, e.g. quality_metric_vcfcheck
# none can be used as <type>
# if a qc_type is not defined
argument_to_be_attached_to: <file_output_name>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
html: <boolean>
json: <boolean>
table: <boolean>
zipped: <boolean>
# If the output is a zipped folder with multiple QC files,
# fields to define the target files inside the folder
html_in_zipped: <file>
tables_in_zipped:
- <file>
# Fields still used by tibanna that needs refactoring
# listing them as they are
qc_acl: <string> # e.g. private
qc_unzip_from_ec2: <boolean>
# Report output
<report_output_name>:
argument_type: report.<type> # report_type, e.g. file
General Fields Definition
Required
All the following fields are required.
name
Name of the workflow, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).
description
Description of the workflow.
runner
Definition of the data processing flow for the workflow. This field is used to specify the standard language and description files used to define the workflow. Several subfields need to be specified:
language [required]: Language standard used for workflow description
main [required]: Main description file
child [optional]: List of supplementary description files used by main
At the moment we support two standards, Common Workflow Language (CWL) and Workflow Description Language (WDL).
input
Description of input files and parameters for the workflow. See Input Definition.
output
Description of expected outputs for the workflow. See Output Definition.
Optional
All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.
title
Title of the workflow.
software
List of software used by the workflow.
Each software is specified using the name of the software and the version (either version or commit) in the format <software>@<version|commit>
.
Each software needs to match a software that has been previously defined, see Software.
Input Definition
Each argument is defined by its name. Additional subfields need to be specified depending on the argument type.
argument_type
Definition of the type of the argument.
For a file argument, the argument type is defined as file.<format>
, where <format>
is the format used by the file.
<format>
needs to match a file format that has been previously defined, see File Format.
For a parameter argument, the argument type is defined as parameter.<type>
, where <type>
is the type of the value expected for the argument [string, integer, float, json, boolean].
Output Definition
Each output is defined by its name. Additional subfields need to be specified depending on the output type.
argument_type
Definition of the type of the output.
For a file output, the argument type is defined as file.<format>
, where <format>
is the format used by the file.
<format>
needs to match a file format that has been previously defined, see File Format.
For a QC (Quality Control) output, the argument type is defined as qc.<type>
, where <type>
is a a qc_type
defined in the the schema, see schemas.
For a report output, the argument type is defined as report.<type>
, where <type>
is the type of the report (e.g., file).
Note: We are currently re-thinking how QC and report outputs work, the current definitions are temporary solutions that may change soon.
secondary_files
This field can be used for output files.
List of <format>
for secondary files associated to the output file.
Each <format>
needs to match a file format that has been previously defined, see File Format.
argument_to_be_attached_to
This field can be used for output QCs.
Name of the output file the QC is calculated for.
Pipeline
This documentation provides a comprehensive guide to the template structure necessary for implementing Pipeline objects. These objects enable users to define workflow dependencies, parallelize execution by defining scattering and gathering parameters, specify reference files and constant input parameters, and configure AWS EC2 instances for executing each workflow within the pipeline.
Template
## Pipeline information #####################################
# General information for the pipeline
#############################################################
# All the following fields are required
name: <string>
description: <string>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
# https://github.com/dbmi-bgm/cgap-portal/tree/master/src/encoded/schemas
proband_only: <boolean>
## General arguments ########################################
# Pipeline input, reference files, and general arguments
# define all arguments for the pipeline here
#############################################################
input:
# File argument
<file_argument_name>:
argument_type: file.<format> # bam, fastq, bwt, ...
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
dimensionality: <integer>
files:
- <file_name>@<version>
# Parameter argument
<parameter_argument_name>:
argument_type: parameter.<type> # string, integer, float, json, boolean
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
value: <...>
## Workflows and dependencies ###############################
# Information for the workflows and their dependencies
#############################################################
workflows:
## Workflow definition #####################
############################################
<workflow_name>[@<tag>]:
## Hard dependencies ###############
# Dependencies that must complete
####################################
dependencies:
- <workflow_name>[@<tag>]
## Lock version ####################
# Specific version to use
# for the workflow
####################################
version: <string>
## Specific arguments ##############
# General arguments that need to be referenced and
# specific arguments for the workflow:
# - file arguments that need to source the output from a previous step
# - file arguments that need to scatter or gather
# - parameter arguments that need to have a modified value / dimensions
# - all arguments that need to source from a general argument with a different name
####################################
input:
# File argument
<file_argument_name>:
argument_type: file.<format> # bam, fastq, bwt ...
# Linking fields
# These are optional fields
# Check https://magma-suite.readthedocs.io/en/latest/meta-workflow.html
# for more details
source: <workflow_name>[@<tag>]
source_argument_name: <file_argument_name>
# Input dimension
# These are optional fields to specify input argument dimensions to use
# when creating the pipeline structure or step specific inputs
# See https://magma-suite.readthedocs.io/en/latest/meta-workflow.html
# for more details
scatter: <integer>
gather: <integer>
input_dimension: <integer>
extra_dimension: <integer>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
mount: <boolean>
rename: formula:<parameter_argument_name>
# can be used to specify a name for parameter argument
# to use to set a rename field for the file
unzip: <string>
# Parameter argument
<parameter_argument_name>:
argument_type: parameter.<type>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
value: <...>
source_argument_name: <parameter_argument_name>
## Output ##########################
# Output files for the workflow
####################################
output:
# File output
<file_output_name>:
file_type: <file_type>
# All the following fields are optional and provided as example,
# can be expanded to anything accepted by the schema
description: <string>
linkto_location:
- <location> # Sample, SampleProcessing
higlass_file: <boolean>
variant_type: <variant_type> # SNV, SV, CNV
vcf_to_ingest: <boolean>
s3_lifecycle_category: <string> # short_term_access_long_term_archive,
# short_term_access, short_term_archive,
# long_term_access_long_term_archive,
# long_term_access, long_term_archive,
# no_storage, ignore
## EC2 Configuration to use ########
####################################
config:
<config_parameter>: <...>
General Fields Definition
Required
All the following fields are required.
name
Name of the pipeline, MUST BE GLOBALLY UNIQUE (ACROSS THE PORTAL OBJECTS).
description
Description of the pipeline.
input
Description of general input files and parameters for the pipeline. See Input Definition.
workflows
Description of workflows that are steps of the pipeline. See Workflows Definition.
Optional
All the following fields are optional and provided as example. Can be expanded to anything accepted by the schema, see schemas.
title
Title of the pipeline.
Workflows Definition
Each workflow is defined by its name and represents a step of the pipeline. Additional subfields need to be specified.
The workflow name must follow the format <workflow_name>[@<tag>]
.
<workflow_name>
needs to match a workflow that has been previously defined, see Workflow.
If the same workflow is used for multiple steps in the pipeline, a tag can be added to the name of the workflow after ‘@’ to make it unique (e.g., a QC step that run twice at different moments of the pipeline).
If a <tag>
is used while defining a workflow, <workflow_name>@<tag>
must be used to reference the correct step as dependency.
dependencies
Workflows that must complete before kicking the current step.
List of workflows in the the format <workflow_name>[@<tag>]
.
version
Version to use for the corresponding workflow instead of the default specified for the repository. Allows to lock the workflow to specific version.
input
Description of general arguments that need to be referenced and specific arguments for the step. See Input Definition.
output
Description of expected output files for the workflow.
Each output is defined by its name. Additional subfields can be specified. See schemas.
Each output name needs to match an output name that has been previously defined in the corresponding workflow, see Workflow.
config
Description of configuration parameters to run the workflow. Any parameters can be defined here and will be used to configure the run in AWS (e.g., EC2 type, EBS size, …).
Input Definition
Each argument is defined by its name. Additional subfields need to be specified depending on the argument type. Each argument name needs to match an argument name that has been previously defined in the corresponding workflow, see Workflow.
argument_type
Definition of the type of the argument.
For a file argument, the argument type is defined as file.<format>
, where <format>
is the format used by the file.
<format>
needs to match a file format that has been previously defined, see File Format.
For a parameter argument, the argument type is defined as parameter.<type>
, where <type>
is the type of the value expected for the argument [string, integer, float, json, boolean].
files
This field can be used to assign specific files to a file argument. For example, specific reference files that are constant for the pipeline can be specified for the corresponding argument using this field.
Each file is specified using the name of the file and the version in the format <file_name>@<version>
.
For reference files, each file needs to match a file reference that has been previously defined, see File Reference.
value
This field can be used to assign a specific value to a parameter argument.
Note: As of now, the value needs to be always encoded as <string>
.
We are working to improve this and enable usage of real types.
Example
a_float:
argument_type: parameter.float
value: "0.8"
an_integer:
argument_type: parameter.integer
value: "1"
a_string_array:
argument_type: parameter.json
value: "[\"DEL\", \"DUP\"]"
Linking Fields
These are optional fields that can be used when defining workflow specific arguments to describe dependencies and map to arguments with different names. See magma documentation for for more details.
source
This field can be used to assign a dependency for a file argument to a previous workflow.
It must follow the format <workflow_name>[@<tag>]
to reference the correct step as source.
source_argument_name
This field can be used to source a specific argument by name. It can be used to:
Specify the name of an output of a source step to use.
Specify the name of a general argument defined in the input section to use when it differs from the argument name.
Input Dimension Fields
These are optional fields that can be used when defining workflow specific arguments to specify the input dimensions to use when creating the pipeline structure or step specific inputs. See magma documentation for more details.
scatter
Input dimension to use to scatter the workflow. This will create multiple shards in the pipeline for the step. The same dimension will be used to subset the input when creating the specific input for each shard.
gather
Increment for input dimension when gathering from previous shards. This will collate multiple shards into a single step. The same increment in dimension will be used when creating the specific input for the step.
input_dimension
Additional dimension used to subset the input when creating the specific input for the step.
This will be applied on top of scatter
, if any, and will only affect the input.
This will not affect the scatter dimension used to create the shards for the step.
extra_dimension
Additional increment to dimension used when creating the specific input for the step.
This will be applied on top of gather
, if any, and will only affect the input.
This will not affect gather dimension in building the pipeline structure.