AN AUTOMATED AND INTEGRATIVE PIPELINE FOR SURVEILLANCE OF VIRUSES IN POLLINATORS USING PUBLIC DATA

Vinícius Castro Santos; Aristóteles Góes Neto; Eric Roberto Guimarães Rocha Aguiar

All Papers

Submission

DOI

10.29327/1331270.2-4

Paper Title

AN AUTOMATED AND INTEGRATIVE PIPELINE FOR SURVEILLANCE OF VIRUSES IN POLLINATORS USING PUBLIC DATA

Authors

Vinícius Castro Santos
Aristóteles Góes Neto
Eric Roberto Guimarães Rocha Aguiar

Modality

Poster

Subject area

RNA and transcriptomics

Publishing Date

08/11/2023

Country of Publishing

Brazil | Brasil

Language of Publishing

Inglês

Paper Page

https://www.even3.com.br/anais/xmeeting2023/635111-an-automated-and-integrative-pipeline-for-surveillance-of-viruses-in-pollinators-using-public-data

ISBN

978-65-272-0061-1

Keywords

assembly, bees, mapping, metatranscriptomics, phylogenetics, RNA-seq

Summary

Insects, particularly bees, play a critical role in pollinating 90% of angiosperms and around 9.5% of global agricultural production. To better understand the interactions between pollinators and viruses or other microbiome organisms, several research projects have been initiated with the aim of identifying the microbiome and RNA viruses in pollinators through bioinformatics analysis. However, despite the availability of numerous pipelines to analyze metatranscriptomics-derived data, none of them provide tools to characterize novel viruses or perform automated phylogenetic analyses in organisms other than humans. This creates difficulties for researchers who lack bioinformatics expertise. To address this issue, our project proposes the development of integrated, modular, and automated pipelines that can analyze large RNA-seq datasets from Illumina NGS of bees and other pollinators. Our strategy takes advantage of multiple software and parameters that allow the customized execution of sensitive steps for virus identification, such as the mapping and assembly processes. The pipeline is currently being developed using the R, Python, and Bash programming languages and executed on the GNU/Linux system. The workflow consists of seven phases, including preprocessing, mapping, de novo assembly, taxonomic and functional annotation, quantification, and data visualization. The pipeline can automatically download public databases from the NCBI’s Sequence Read Archive (SRA) or take users’ sequencing data as input. The quality of the FASTQ files is checked using the Fastq tool and filtered using Fastp, Cutadapt or Trimommatic to remove adapters or low-quality reads. The mapping process and removal of host transcripts can be performed using Bowtie2, STAR, or HISAT2, and unaligned data is further assembled using a specific assembler (e.g. Trinity, SPAdes, Oases) or an integrative assembling strategy, which takes advantage of different tools to improve the quality and completeness of the assembled transcripts. Users who are solely interested in viral sequences can map them against the HoloBee-mop database of non-viral honeybee sequences to enhance the assembly of bee RNA-seq data. However, it is also feasible to identify the complete microbiome. The assembled data is then clustered using the CD-HIT-EST tool, which removes chimeric and redundant sequences. Taxonomic annotation is made using the DIAMOND Blastx tool with the NCBI’s non-redundant protein database (NR), and the full taxonomy of each species is retrieved using the Taxonkit software. Candidate viral transcripts are also annotated using the Blastn tool with the NCBI’s non-redundant nucleotide database (NT), ORFs are predicted using orfipy software, and conserved domains are determined using HMMER with the Pfam database. The abundance of viral sequences is quantified using Salmon. The output of the process includes TSV files with metagenomics and viral data, a FASTA file with viral sequences, and donut, bar, radar, and heatmap graphs. The next steps of the project involve adding phylogenetic analyses for RNA viruses, functional annotation, and implementing a scalable and programmable active search module in public databases of NCBI next-generation sequencing data (SRA) coming from bee species for surveillance purposes. The pipeline will also be tested using public data, and different software and strategies will be tested for the mapping and assembly processes.

Title of the Event: X-Meeting / BSB 2023
City of the Event: Curitiba
Title of the Proceedings of the event: X-Meeting presentations
Name of the Publisher: Even3
Means of Dissemination: Meio Digital
DOI

How to cite

SANTOS, Vinícius Castro; NETO, Aristóteles Góes; AGUIAR, Eric Roberto Guimarães Rocha. AN AUTOMATED AND INTEGRATIVE PIPELINE FOR SURVEILLANCE OF VIRUSES IN POLLINATORS USING PUBLIC DATA.. In: X-Meeting presentations. Anais...Curitiba(PR) Campus da indústria, 2023. Available in: https//www.even3.com.br/anais/xmeeting2023/635111-AN-AUTOMATED-AND-INTEGRATIVE-PIPELINE-FOR-SURVEILLANCE-OF-VIRUSES-IN-POLLINATORS-USING-PUBLIC-DATA. Access in: 12/09/2025