DATABASE SOLUTIONS FOR EFFICIENT COHORT CHARACTERIZATION IN ONCOLOGICAL RESEARCH

Published in 21/11/2024 - ISBN: 978-65-272-0843-3

Paper Title
DATABASE SOLUTIONS FOR EFFICIENT COHORT CHARACTERIZATION IN ONCOLOGICAL RESEARCH
Authors
  • Alessandra Serain
  • Luis Henrique Muniz de Carvalho
  • Marcelo Santos Leite
  • Nicole Scherer
  • Carlos Henrique Fernandes Martins
  • Antônio Augusto Gonçlves
  • Mariana Boroni
Modality
Poster
Subject area
Database and Software Development
Publishing Date
21/11/2024
Country of Publishing
Brazil | Brasil
Language of Publishing
Inglês
Paper Page
https://www.even3.com.br/anais/xmeeting-2024/837313-database-solutions-for-efficient-cohort-characterization-in-oncological-research
ISBN
978-65-272-0843-3
Keywords
Database solutions, Cohort characterization, ETL process, Pentaho Data Integration, Metabase
Summary
Different cancer types have shown a progressive and disproportionate increase in their incidence over the years, in Brazil and worldwide. Thus, the need for molecular studies persists, aiming, for example, at analyzing the expression of different genes and mutations, identifying more precise biomarkers, and personalizing treatments using transcriptomics, genomics, and single-cell analysis, which requires well-characterized samples. Clinical information from patients registered at the Brazilian National Cancer Institute (INCA) and the tumor samples stored at the National Tumor Bank (BNT) constitute a valuable sample resource to enable discoveries to improve public health. To enable the characterization of the INCA cohorts, and to support the selection of samples to be used in research we need a comprehensive data and metadata repository. Therefore, we have listed important information frequently required to select and identify patients to form specific cohorts. We also listed variables collected in standardized databases such as details about treatment, recurrence and general characteristics from patients. Significant challenges involved in integrating databases from public health systems to facilitate effective searches include the sheer volume of variables, ranging from 30 to 40 fields, which complicates data management and query processes. Additionally, the lack of standardized patterns for data entry or the absence of electronic records in many cases further hampers the efficiency and accuracy of data retrieval. Furthermore, the existence of different databases that need to be interlinked but currently are not, adds another layer of complexity to the integration process. For example, internal databases developed within INCA, such as Absolute (hospital administration), Pathological Anatomy, Chemotherapy, RHC (Hospital Cancer Registry), and BNT (National Tumor Bank), serve as essential resources, aiding in the identification of suitable candidates and, consequently, patient samples, for the studies. By leveraging multidimensional data from these sources, an enhanced understanding of cancer biology is sought, facilitating the identification of patients for studying cohorts. This research also employs a suite of robust tools for data management and analysis. INCA's systems utilize relational databases, such as Oracle and PostgreSQL, for data storage. The ETL (extract, transform, and load) process is facilitated by Pentaho Data Integration, enabling the extraction, filtering, transformation, sorting, and merging of data through SQL (structured query language). Processed data is stored in MongoDB, a document-oriented NoSQL database, which serves as the foundation for our data repository. Additionally, for data querying and analysis, we employ Metabase, an open-source application that simplifies data exploration and analysis through intuitive dashboards and visualizations. The application is still under development and will initially be tested by researchers with the assistance of the developers. Our next goal is to prepare anonymized datasets with relevant information to be published as open-access data at the institutional platform, in compliance with the General Data Protection Law (LGPD).
Title of the Event
20º Congresso Brasileiro de Bioinformática: X-Meeting 2024
City of the Event
Salvador
Title of the Proceedings of the event
X-Meeting presentations
Name of the Publisher
Even3
Means of Dissemination
Meio Digital

How to cite

SERAIN, Alessandra et al.. DATABASE SOLUTIONS FOR EFFICIENT COHORT CHARACTERIZATION IN ONCOLOGICAL RESEARCH.. In: X-Meeting presentations. Anais...Salvador(BA) Hotel Deville Prime, 2024. Available in: https//www.even3.com.br/anais/xmeeting-2024/837313-DATABASE-SOLUTIONS-FOR-EFFICIENT-COHORT-CHARACTERIZATION-IN-ONCOLOGICAL-RESEARCH. Access in: 25/05/2025

Paper

Even3 Publicacoes