Work Package 4
Lead Beneficiary CING
​
Data integration and Trustworthy AI model for MM
Objectives
-
To organize the data information flow
-
To integrate omics and other available data generating a comprehensive holistic profile of each biological state.
-
To computationally highlight the underlying molecular mechanisms by using the generated holistic profiles
-
To propose candidate repurposed drugs
-
To discover complex patterns of biomarkers using omic profiling discovered in WP3
-
To develop predictive AI models for MM onset
Description
Description of work:
We intend to use modern bioinformatics and systems bioinformatics techniques as described in the following tasks:
Task 4.1: Implementation of a federated database system to support the integration of all available data and to feed an AI system for risk prediction (CING, ATOS, NKUA, BRFAA, UNITO, MOS, LUH, M1-18)
We will develop a data management plan under the FAIRness principles. All available data obtained (demographic, clinical, omics, etc.) from samples representing the stages in the axis of MM onset and progression will be properly compiled, entered and associated in a federated database system (FDBS) which will transparently map the existing multiple autonomous database/repository systems into a single federated database. Partners NKUA, BRFAA, UNITO, MOS will provide clinical and omics data. LUH will oversee the process and ensure all ethics issues are addressed. The constituent databases will be interconnected via a computer network and may be geographically decentralized. Calculations on the unprocessed data, further analyses and correlations will comprise an adjacent Data Warehouse that will provide the users with pre-compiled data, calculations, partial data integrations on demand, risk prediction models and other information related to the investigated transition from pre-disease to disease. The FDBS Database and the Data Warehouse will support the developed AI system which will facilitate the transition risk prediction. A web interface to the FDBS Database and the Data Warehouse will be developed in a role-based philosophy where there will be various levels of data and tools accessibility. A Report describing the updated Data Management Plan (DMP) will be provided including all the possible amendments/optimizations on the initial DMP. The preparation of the Data Management Plan will include a common chapter on the ‘Understanding’ cluster. The aim of the chapter is to address commonalities in data standards, data validation, data protection and fostering data exchange, in particular when it comes to sharing project data with a future federated UNCAN.eu data platform. Therefore, interactions are envisaged with the 4.UNCAN.eu (http:// uncaninitiative.eu/) coordination and support action which is preparing a blueprint for UNCAN.eu.
Associated deliverables: D4.1 Data Management Plan. Μ3, D4.2 Federated Database System and Data Warehouse. M18, D4.7 Updated Data Management Plan. M48
​
Task 4.2: Omics data integration for creating a molecular model for MM onset and progression (CING, ATOS M19-48)
We will use several well-established bioinformatics methods/tools such as Multi-Omics Factor Analysis (MOFA) and mixOmics as well as our in-house developed methodology for multi-source data integration. Initially, we will integrate transcriptomics and proteomics data and then we will enrich the molecular profiling with data and information from the peptidomics and epitranscriptomics experiments.
The findings will be further used for the discovery of the underlying molecular mechanisms at each stage, for drug repurposing and for feeding the Trustworthy AI models for early diagnosis and monitoring. To achieve the molecular mechanisms-related objective we will utilize several classical pathway analysis tools such as Enrich and GeneTrail3 as well as our own developed tools for network analysis on the pathways’ space, namely PathwayConnector, PathExNet and PathWalks.
Special computational approaches will be applied in the analysis of the progression from MGUS/sMM to symptomatic
MM. We intend to apply two different approaches to capture the biomarkers of this progression: (a) we will check the monotonicity in gene and protein expression from the transcriptomics and proteomics data across the three stages as shown in our recent work (b) we will analyze the network rewiring on gene co-expression networks and protein co- expression networks across the three changes utilizing tools like DyNet and in-house developed scripts.
Associated deliverables: D4.3 Integrated molecular profiles of each biological state linked to the health-to-disease transition in MM onset and progression. M40. D4.4 Computationally highlighted underlying molecular mechanisms related to the health-to-disease transition in MM onset and progression.M48
​
Task 4.3: Computational Drug Repurposing (CING, M19-48)
For the drug repurposing objective, we will follow a transcriptomics-based strategy as well as multiplex strategies to exploit the integrated molecular profiles generated in using publicly available platforms such as the ConnectivityMap from the CLUE cloud-based software platform for the analysis of perturbational datasets generated using gene expression (L1000) and the L100CSS2 Engine for ultra-fast LINCS L1000 Characteristic Direction signature search. The results of the initial drug repurposing will be re-ranked and short listed taking into account the drug targets’ functional proximity to the disease-related mechanisms and the chemical structure diversity of the highlighted compounds as implemented in our tool called CODRES.
Associated deliverables: D4.5 List of candidate drugs for treating MM based on omics profiles (M48)
​
Task 4.4: Trustworthy AI model for MM prediction. (ATOS, CING M31-M48)
Trustworthy AI models will be generated aiming at extracting reliable and reproducible knowledge from clinical data
and enriched with multi-omic based molecular profiles and environmental data (WP2) , to evaluate empirical evidence in the context of clinical decision on MGUs/sMM progression to MM. Within the project multiple “interpretable” AI-based models (decision trees, logistic regression, NNs, random forest, logistic regression, gradient boosting) for prediction of MM progression will be generated searching the best predictive model, using different performance metrics across cohorts and dealing with potential different conclusion in order to build trustworthy and interpretable research outcomes and seeking model prediction stability across data perturbation to define the best AI model. Therefore, different AI models will be trained using the clinical data as well as the results of the experimental and bioinformatic analysis performed in WP1, WP2 and WP3. Prediction calibration methods will be used for better adjusting predictions to improve the error distribution of the predictive models. Once AI models are generated, they will be evaluated and those that provide the best predictions will be selected for being distilled into simple ones, both original complex models and distilled ones will be evaluated for assessing accuracy, precision, sensitivity, and specificity. ROC curve and PR curve representations will be graphically presented.
Therefore, in ELMUMY AI models will be trained using the retrospective and prospective clinical data from the cohorts as well as the results of the experimental and bioinformatic analysis performed in the Task 4.2 of this WP as well as in previous WPs Prediction calibration methods will be used for better adjusting predictions to improve the error distribution of the predictive models. Once AI models are generated, they will be evaluated prospectively and those that provide the best predictions will be selected for being distilled into simple ones, both original complex models and distilled once will be evaluated for assessing accuracy, precision, sensitivity and specificity. ROC curve and PR curve representations will be graphically presented to clinicians.
Associated deliverables: D4.6 Trustworthy AI model for MM prediction. M48