PIPEMB-WDL: A WDL-BASED INSTANTIATION OF A VARIANT CALLING SCIENTIFIC WORKFLOW MANAGEMENT SYSTEM

Published in 08/11/2023 - ISBN: 978-65-272-0061-1

Paper Title
PIPEMB-WDL: A WDL-BASED INSTANTIATION OF A VARIANT CALLING SCIENTIFIC WORKFLOW MANAGEMENT SYSTEM
Authors
  • Elvismary Molina de Armas
  • Nicole de Miranda Scherer
  • Sergio Lifschitz
  • Mariana Boroni
Modality
Late poster
Subject area
Database and Software Development
Publishing Date
08/11/2023
Country of Publishing
Brazil | Brasil
Language of Publishing
Inglês
Paper Page
https://www.even3.com.br/anais/xmeeting2023/648688-pipemb-wdl--a-wdl-based-instantiation-of-a-variant-calling-scientific-workflow-management-system
ISBN
978-65-272-0061-1
Keywords
Genomic variant, Germline short variant discovery, Somatic short variant discovery, SNP, Indels, model implementation, DNA, RNA
Summary
An effective workflow implementation for discovering short variants is essential for associating gene mutations with specific illnesses. Especially in oncology, identifying acquired somatic genetic variants can provide important diagnostic and prognostic implications and crucial information for identifying therapeutic agents to help cancer patients. In addition, identifying germline mutations can provide prognostic information about the potential development of future cancers. This work presented an implementation of the PIPEMB abstract model called PIPEMB-WDL. The model allows the workflow to be implemented using different scientific workflow management systems and execution infrastructures, independently of its definition. The PIPEMB workflow integrates the short variant discovery for germline and somatic calling following GATK suite best practices implementation. It includes three main steps: the data pre-processing, which produces mapped sequences against a reference; the short variant calling for germline or/and somatic variants; and the refinement and evaluation step where specific filters are performed and the resulting variants can be annotated. We have implemented the proposed model using software from GATK, Picard toolkit, samtools, vcftools, grep utilities from Linux, and scripts in Python. We also use the workflow execution engine (Cromwell) in conjunction with a workflow specification language (WDL).. A complete list of tools used includes BWA, MergeBamAligments, MarkDuplicates, SortSam, BaseRecalibrator/Spark, ApplyBaseRecalibrator, HaplotypeCaller, GenomicsDBImport, GenotypeGVCFs, Mutect2, LearnReadOrientationModel, GetPileupSummaries, CalculateContamination, CreateSomaticPanelOfNormals, MergeVCFs, VariantRecalibrator, ApplyVQSR, FilterMutectCalls, CNNScoreVariants, FilterVariantTranches, SelectVariants, VariantFiltration, bcftools, Funcotator and VEP. Our implementation includes some additional steps if we compare it with other pipelines used for variants calling, such as the use of additional software like VEP, the selection of some tools in execution time, like BaseRecalibrator or its optimization BaseRecalibratorSpark, and the parallelization by samples and by read groups files, optimizing the execution time of many tasks and the use of the cluster-infrastructure resources. This first implementation proves the model's viability and tests the flows designed on it. It is currently installed in the INCA infrastructure and was widely used in some institute research projects. An essential characteristic of an abstract workflow model is the facility of understanding and the level of modularization that allows the coupling and decoupling of parts and the possibility of being extended, reusing existing flows, and without becoming inconsistent. Moreover, this work presents an extension of the workflow model to include functionalities like quality analysis and manipulation, identification of the short variants in RNA data, and calculation of variant evaluation metrics. The model allows the representation of the new functionalities as easily incorporated and connected blocks. Especially for RNA variant calling, it was possible to reuse all the parts of the variant calling and filtration and annotation already implemented. In that sense, the workflow input was forked in DNA pre-processing and RNA pre-processing. A pre-possessing module for RNA was inserted that was easily integrated with germline variant calling functions. We have extended the model, and this new implementation is in the testing phase. This case of the use of model extension and implementation validates the adequacy of the model regarding the characteristics mentioned above.
Title of the Event
X-Meeting / BSB 2023
City of the Event
Curitiba
Title of the Proceedings of the event
X-Meeting presentations
Name of the Publisher
Even3
Means of Dissemination
Meio Digital

How to cite

ARMAS, Elvismary Molina de et al.. PIPEMB-WDL: A WDL-BASED INSTANTIATION OF A VARIANT CALLING SCIENTIFIC WORKFLOW MANAGEMENT SYSTEM.. In: X-Meeting presentations. Anais...Curitiba(PR) Campus da indústria, 2023. Available in: https//www.even3.com.br/anais/xmeeting2023/648688-PIPEMB-WDL--A-WDL-BASED-INSTANTIATION-OF-A-VARIANT-CALLING-SCIENTIFIC-WORKFLOW-MANAGEMENT-SYSTEM. Access in: 16/07/2025

Paper

Even3 Publicacoes