HIGH PERFORMANCE PIPELINE ORCHESTRATION WITH WDL AND CROMWELL ON AWS

Published in 26/04/2022 - ISBN: 978-65-5941-645-5

Paper Title
HIGH PERFORMANCE PIPELINE ORCHESTRATION WITH WDL AND CROMWELL ON AWS
Authors
  • Welliton de Souza
  • Gabriel H. I. Moraes
  • Rodrigo S. Reis
  • Murilo C. Cervato
Modality
Xpress presentation
Subject area
Omics
Publishing Date
26/04/2022
Country of Publishing
Brasil
Language of Publishing
Inglês
Paper Page
https://www.even3.com.br/anais/xmeetingxp2021/414208-high-performance-pipeline-orchestration-with-wdl-and-cromwell-on-aws
ISBN
978-65-5941-645-5
Keywords
next-generation sequencing, high-performance computing, cloud computing
Summary
High-throughput sequencing (HTS) technologies have become cheaper over the years allowing research and medical institutions to adopt these methods in daily use, generating a data deluge and high demand to process, analyze and interpret these data. With advances in cloud computing, workflow-specific and portable languages, and systems for orchestrating several processing tasks, on-demand services became an appropriate and cost-effective option for genomics research. In this work we have evaluated the performance aspects of cloud-native solutions using a common scenario: DNA variant calling. We choose Workflow Description Language (WDL) for defining a pipeline that is composed of three tasks: 1) split genomic intervals into a given number of balanced parts, 2) variant calling using GATK 4 HaplotypeCaller with emit reference confidence (gVCF) enabled, and 3) merge GVCF files into a single file. WDL requires an orchestration system to parse and execute these processing tasks, taking advantage of the highly parallel characteristic of our pipeline. We have opted for the Cromwell system, which supports job execution through Amazon Web Service (AWS) Batch service. The AWS Batch compute environment was configured following the recommended configuration, using Elastic Compute Cloud (EC2) spot instances to reduce costs. Each task is executed inside a Docker container, an EC2 instance runs multiple containers to use all the computing resources available in the virtual machine. To evaluate performance of this use case, we selected a whole genome sequencing (WGS) sample, previously aligned according to GATK Best Practices (NA12878, subsampled with about 17M reads). We executed our pipeline by ranging the number of genomic parts: 10, 25, 50 and 100. We ran each setup three times and calculated the average of overall elapsed time (from workflow submission to completion). From a mean of 1.47 hours (10 parts) we were able to gradually reduce our execution time by 54.62% (100 parts). We noted that increasing the number of processing parts also increases the localization stage in the HaplotypeCaller task. File localization is the download of input files (sample BAM and genome reference FASTA) from cloud file system (Simple Storage Service – S3) to container’s working directory before command execution. There is also the delocalization stage that uploads output files to S3, but we have not seen performance impact. To overcome this issue, we implemented a caching system that downloads input files only once, for each EC2 instance, reducing file localization stage from 180 seconds (median) to less than two seconds. Also, it reduced costs with data traffic in virtual private network and disk space. We have shown that cloud-native services combined with an orchestration system and parallelization-enabled workflows provide a highly scalable computing environment for genomics research, reducing data analysis time and saving costs. Using scalable cloud computing resources will enable us to perform studies to better understand outcomes and treatments for our patients.
Title of the Event
X-Meeting XPerience 2021
Title of the Proceedings of the event
X-Meeting presentations
Name of the Publisher
Even3
Means of Dissemination
Meio Digital

How to cite

SOUZA, Welliton de et al.. HIGH PERFORMANCE PIPELINE ORCHESTRATION WITH WDL AND CROMWELL ON AWS.. In: X-Meeting presentations. Anais...São Paulo(SP) AB3C, 2021. Available in: https//www.even3.com.br/anais/xmeetingxp2021/414208-HIGH-PERFORMANCE-PIPELINE-ORCHESTRATION-WITH-WDL-AND-CROMWELL-ON-AWS. Access in: 06/05/2025

Paper

Even3 Publicacoes