If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Skin Cancer Unit, German Cancer Research Center (DKFZ), Heidelberg, GermanyDepartment of Dermatology, Venereology and Allergology, University Medical Center Mannheim, Ruprecht-Karl University of Heidelberg, Mannheim, GermanyDKFZ Hector Cancer Institute at the University Medical Center Mannheim, Mannheim, Germany
Corresponding author: Digital Biomarkers for Oncology Group, National Centre for Tumour Diseases, German Cancer Research Centre (DKFZ), Im Neuenheimer Feld 280, Heidelberg, 69120, Germany.
A combined DL classifier predicts BRAF-V600 mutations in malignant melanoma.
•
Combining data modalities improves performance and generalisation capability.
•
Multimodal classifiers have the potential to be used as predictive biomarkers.
Abstract
Background
In machine learning, multimodal classifiers can provide more generalised performance than unimodal classifiers. In clinical practice, physicians usually also rely on a range of information from different examinations for diagnosis. In this study, we used BRAF mutation status prediction in melanoma as a model system to analyse the contribution of different data types in a combined classifier because BRAF status can be determined accurately by sequencing as the current gold standard, thus nearly eliminating label noise.
Methods
We trained a deep learning-based classifier by combining individually trained random forests of image, clinical and methylation data to predict BRAF-V600 mutation status in primary and metastatic melanomas of The Cancer Genome Atlas cohort.
Results
With our multimodal approach, we achieved an area under the receiver operating characteristic curve of 0.80, whereas the individual classifiers yielded areas under the receiver operating characteristic curve of 0.63 (histopathologic image data), 0.66 (clinical data) and 0.66 (methylation data) on an independent data set.
Conclusions
Our combined approach can predict BRAF status to some extent by identifying BRAF-V600 specific patterns at the histologic, clinical and epigenetic levels. The multimodal classifiers have improved generalisability in predicting BRAF mutation status.
]. Pilot studies have shown that the addition of other data modalities may improve performance, but these very early studies do not contain external validation [
]. In our study, different data modalities were fused to predict BRAF-V600 mutations in malignant melanoma using deep learning (DL). Malignant melanoma causes over 90% of skin cancer deaths. Therefore, early detection and optimal therapy selection are of crucial importance [
]. While localised melanomas can often be cured by surgical excision alone, local therapy is insufficient if the melanoma has already spread to the local lymph nodes or even to distant organs [
]. At these advanced tumour stages, new systemic therapies are available for these melanomas. Approximately 40–60% of cutaneous melanomas harbour BRAF-V600 mutations that can be targeted with proto-oncogene B-Raf (BRAF) and mitogen-activated protein kinase kinase (MEK) inhibitors [
Population-based analysis of the prevalence of BRAF mutation in patients diagnosed with cutaneous melanoma and its significance as a prognostic factor.
]. The use of the antibody VE1 specific for the most common V600E mutation leads to high accuracy, with advantages in samples with small and scattered tumour cells, low cost and short turnaround times [
]. However, other BRAF protein and pathway alterations are not detected with this antibody. A promising new possibility is the determination of BRAF mutations by liquid biopsies and circulating tumour DNA analysis. This approach is not yet used in the clinic due to its still limited sensitivity, especially in patients with low tumour burden and/or brain metastases [
Due to the low label noise, we chose BRAF status as a very clean use case to evaluate the potential benefits of a DL-based multimodal classifier. Because of their high clinical significance, attempts are already being made to predict BRAF mutations on histopathological slides. Features that have been described to correlate with BRAF status include cell scatter, nesting, pigmentation, size and shape of cells and of nuclei [
]. These factors are assessed by pathologists in routine clinical practice as a standard procedure using histologic specimens stained with haematoxylin and eosin (H&E). However, pathologists are not able to predict BRAF status based on this routine analysis. Moreover, the correlation of these morphologic factors with BRAF status varies widely across studies. The first study has also demonstrated that deep learning-assisted prediction of BRAF mutation on dermoscopic images may be possible [
Predictive biomarkers in melanoma: detection of BRAF mutation using dermoscopy.
in: Artificial intelligence over infrared images for medical applications and medical image assisted biomarker discovery. Cham,
2022: 176-186https://doi.org/10.1007/978-3-031-19660-7_17
]. Several studies have also shown that BRAF-V600 mutation status also correlates with certain patient and melanoma characteristics: patient age, ulceration, location, stage, tumour thickness in mm and gender [
]. In recent years, epigenetic changes have also gained importance as prognostic markers in several cancer types. In particular, DNA methylation data have shown promise in this context [
Thus, the above data modalities (image, epigenetic and clinical features) have the potential to be used for automatised predictions of BRAF mutation status using deep learning methods [
]. Convolutional neural networks are advanced artificial intelligence methods suitable for the analysis and classification of medical image data such as digitised histological slides [
]. In this study, we, therefore, used an individually trained ResNet18 to extract features from the histological slides. Those features were fed into RF together with clinical data and methylation data. The aim was to investigate how the fusion of image, epigenetic and clinical data improves the prediction of BRAF status as a representative biomarker on internal as well as external datasets.
2. Material and methods
2.1 Datasets
For training, validation and internal/in-distribution (InD) testing, we used the publicly available Skin Cutaneous Melanoma (SKCM) dataset of The Cancer Genome Atlas (TCGA). The following data modalities were retrieved: H&E whole slide images (WSI), beta-values from 450K methylation microarrays and clinical data. The external/out-of-distribution (OOD) test dataset was provided by the University Hospital Mannheim and prepared according to the methods described below. Table 1 describes the population of those two datasets in general as well as grouped by patients BRAF status.
Table 1Description of the population included in our datasets for continuous features, we report the mean, minimum and maximum value. For categorical characteristics, we report the total number of observations in our population.
The TCGA-SKCM dataset was additionally split by hospital site into training, validation and internal test dataset. We chose 14 sources for our test dataset and three sources for the training set. The validation datasets differ between the individual models since we used observations with missing data modalities as part of validation datasets for the other modalities. For the validation set which requires all three data modalities, we excluded 20 observations with complete data from the training dataset. Further details of the split are shown in Supplementary Table 1 and Supplementary Table 2.
The ground truth label (BRAF-V600 mutation) was extracted from the genome sequencing files for the TCGA cohort and was extracted from the patient files for the Mannheim cohort, where it had been determined by Sanger sequencing.
2.2 Prediction models
To predict the BRAF status based on different data modalities, the RF, introduced by Breiman [
] was selected as the prediction model for all three data entities, resulting in patient scores for each data modality individually. The hyperparameters of all RFs were optimised with optuna using Bayesian optimisation [
]. The RFs are sets of decision trees, all of which only predict to which class an observation belongs. The score for an observation was calculated from the ratio of trees within the RF that estimates a particular class.
2.3 Image classifier
For the image-based classification, we used a pre-trained feature extractor model to aggregate tiles extracted from the WSIs into feature vectors as shown in the Supplementary Methods. 5-fold cross validation was used to optimise the hyperparameters of our RF. The optimal hyperparameters can be found in Supplementary Table 3. The class dependency score was predicted for each tile. An observation, in the case of WSI, contains a high number of tiles. To obtain one score per observation, tile scores were averaged into a slide score per observation. This slide score was used to evaluate the performance of the image-based classifier and represented the image-based input to the combined classifiers.
2.4 Clinical classifier
The classifier based on clinical data was trained with an RF in a hierarchical way. Since clinical data are well interpretable and causal dependencies can easily be implied or excluded, we decided to combine the optimisation with variable selection. Therefore, the RF was first tuned on the following clinical data: patient age at diagnosis, tumour stage at diagnosis, Breslow depth of the primary lesion, the location of the primary lesion (extremities, trunk and head and neck) as well as the location of the extracted lesion (primary tumour, skin metastasis and lymph node metastasis). Afterwards the permutation variable importance (VIMP) was calculated for the prediction, introduced by Breimann [
]. All variables that did not positively affect predictive performance were excluded. Included features were the patient's age at diagnosis, the location of the primary lesion and the location of the extracted lesion. A new, independent RF was developed on these clinical data alone.
2.5 Methylation classifier
For the methylation classifier, different RF-based approaches were compared. RF performance was evaluated on the full methylation data, as well as on the methylation data after dimension reduction using principal component analysis (PCA) [
K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” Lond Edinb Dublin Philos Mag J Sci, vol. 2, no. 11, pp. 559–572, doi: 10.1080/14786440109462720.
]. Beside this, we also evaluated an approach where we first trained a RF with default values in hyperparameters and 90% features available in each split to calculate the VIMP for all features individually and to ensure that each tree could access the features of importance. Afterwards, we optimised the hyperparameters for our RF-based only on the methylation positions which were important for the BRAF prediction. On the validation set, we evaluated that the RF based on the full methylation data performs best, so our final model for the methylation approach is based on the full methylation arrays.
2.6 Multimodal fusion
For the multimodal approaches, we needed a strategy to combine all individual BRAF scores to one combined score based on more than one data modality (see Fig. 1). We decided to use two different aggregation methods. On the one hand, we investigated a simple fusion method and calculated a weighted average of the scores by optimising the weights based on a disjunct validation dataset. Since this fusion strategy constitutes a convex combination, we call this fusion convex combination fusion (CCF). On the other hand, we tried a more complex approach where we used the different scores as input for a logistic regression which outputs a combined score. This approach is called logistic regression fusion (LRF). The weights of the different combined models are shown in Supplementary Table 5.
Fig. 1Schematic diagram of the multimodal classifier. A separate random forest is trained for each modality. All three individual scores were aggregated into one combined score. Convex combination fusion (CCF) and logistic regression fusion (LRF) were used to construct different multimodal classifiers.
The image, clinical and methylation data models and model combinations were tested on the InD test set and on the OOD test set. Area under the receiver operating characteristic curve (AUROCs) and bootstrap confidence intervals (CIs) for all models are shown in Table 2. Note that 95% CIs can be wide and include the AUROC value for random guessing (0.5) in some cases for the single-modality models. All ROC plots for the models based on a single data modality as well as both fusion methods for the approach where we combined all three data modalities are shown in Fig. 2 for both datasets separately.
Table 2Results of all evolved classifiers the table contains the AUROC values with 95% confidence intervals estimated with bootstrap on the internal and external test sets. For the fused models we used two fusion strategies: the convex combination of the scores (CCF) and the logistic regression with individual scores as input (LRF).
Fig. 2ROC plots for both data sets for the individual models based on one data modality as well as for both fusion models based on all three data modalities. The dotted line is the main diagonal that induces whether or not a classifier is better than random guessing. ROC, receiver operating characteristic curve.
The classifier based on one single data modality mostly reached AUROC values of about 0.65 for both datasets and all modalities, see Table 2. The exception is the methylation classifier on the TCGA test set. Here, an AUROC value of 0.82 was reached. On OOD, the methylation-data-based classifier reached an AUROC value around 0.65 as well. On OOD, all AUROCs differed significantly from random guessing since 0.5 is not contained in any CI. However, the image and clinical-data-based classifiers do not differ significantly from 0.5 on InD. All ROCs for the unimodal approaches are shown in Fig. 2 together with the ROCs for the models based on all data modalities.
3.2 Fusion models with two data modalities
Beside the models which use a single data modality, we developed models with multiple data modalities for all data modality combinations. We fused the different data modalities either by calculating a convex combination of the individual scores or by using a logistic regression with the scores from the individual unimodal models as input. These fusion models show that the performance of the bimodal models improves strongly within OOD and moderately on InD compared to the unimodal models. Using only two data modalities has different effects on OOD and InD. Combining clinical and image data leads to an increase in performance on OOD but InD benefits very little from this integration. However, combining methylation and image data results in the opposite. This approach leads to the best performance on InD but OOD benefits very little. Combining clinical and methylation data leads to an improvement on OOD, but to a drop in performance on InD compared to the methylation classifier.
3.3 Multimodal fusion based on all three data modalities
Combining all three data modalities leads to similar results as the combination of clinical and methylation data only, but results in absolute numbers are slightly better and the CIs for this approach are slightly smaller than the model without the image data so including image data still improves performance robustness. The ROCs of this integration, CCF as well as LRF, are shown in Fig. 2 together with the ROCs of the unimodal models. However, this plot shows the result, that at least on OOD the multimodal approach using all three data entities surpasses all unimodal approaches. On InD, the multimodal fusion approach shows performances similar to the methylation classifier alone. But opposed to all three unimodal ROCs, the ROC of the fusion model never drops below the diagonal dotted line. Thus, independently to the chosen threshold, the balanced accuracy of the multimodal classifier will never drop below 0.5 in the opposite to the methylation classifier. This suggests higher robustness in prediction as well.
4. Discussion
In this work, we were able to better/stably predict BRAF-V600 mutations in malignant melanoma across several datasets with a multimodal approach compared to unimodal models.
Whereas automated tumour tissue recognition based on H&E slides alone has been established very successfully in many studies, it is still not possible to predict all biomarkers at the H&E level alone with high accuracy [
], other biomarkers such as BRAF status pose much more of a challenge. Indeed, previous studies have also only reported moderate performances for this task [
], suggesting that image data may contain very little information about specific mutations. As the InD training set contained images from different clinics than the InD test set, the Images of the InD test set can show unseen stainings and a distribution shift relative to the training set. However, it is plausible that the image classifiers performance shows similar results on InD and OOD.
The clinical classifier used after feature selection has only three features, which suggests that the full information about the BRAF mutation status is only partially represented in the available clinical data. Since clinical data cannot show any technical domain shifts, clinical data classifiers are usually robust and generalise well to data from other sources, as also observed in our study. The results we obtained when selecting clinical features that are important to predict BRAF mutation status are in line with results from previous work [
]. The result that our classifier on the unselected methylation data performs better than on aggregated data by PCA or PLS fit to the results of related work that showed no clear association of BRAF status with methylation features after dimension reduction methods [
“Development and validation of a novel DNA methylation-driven gene based molecular classification and predictive model for overall survival and immunotherapy response in patients with glioblastoma: a multiomic analysis,” front.
] that were important for our prediction (Supplementary Table 4) are known to be affected by BRAF-V600 mutations. Interestingly, some BRAF mutations not located at the V600 position (Supplementary Table 6) were also classified as BRAF-V600-positive (13/55 versus 24/203 that were classified as negative) by the methylation classifier. In particular, the methylation classifier makes a BRAF-V600-positive decision if the mutation probably causes a similar phenotype and methylation profile. Here, the classifier offers the possibility to identify genetic alterations in the BRAF gene that mimic a BRAF-V600 phenotype, which could be further improved by multimodal integration. The methylation classifier performs well InD but the performance drops in the OOD.
The results differ between cohorts with respect to the different data modalities and combinations. The reasons for this phenomenon are probably multifaceted. First, both cohorts are relatively small, so we cannot exclude bias related to sample size. In addition, differences between data provided by participating clinics and laboratories may exist due to differences in tissue collection and H&E staining. Finally, the limited generalisability of the methylation classifier can be explained by differences in sample age, DNA extraction method, data normalisation and chip arrays. Slight differences between these chip arrays were also found in other studies [
]. Because differences between cohorts or data collection can never be completely ruled out, the finding that a combination of multiple data entities yields the most robust results may be of great clinical importance.
We used CCF and LRF for the fusion of all data modalities. Overall, both methods perform similarly. However, both show advantages in different modality combinations. However, the CCF tends to be more accurate than the LRF in our study as indicated by its slightly narrower CIs. The higher complexity of the LRF than the CCF opens the possibility of modelling more complex relationships, but this possibility seems to be untapped as the LRF does not show better performance than the CCF. Since the LRF contains more parameters than the CCF, this likely leads to higher variance in the prediction and thus greater CIs.
For all fusion approaches using convex combination, the weights of the methylation data (if included) are the highest (Supplementary Table 5). Thus, the value of the methylation score contributes the most to the fused score. This observation is plausible since the methylation classifier predicts based on ∼380k input values where a high ratio of probes does not have any causal dependencies to the BRAF mutation. Thus, many of the trees contained in the RF are forced to predict based on probes that are not related to BRAF mutations which lead to scores close to the decision boundary of 0.5 and show less variance compared to both other scores. However, a weighting of 0.97 is not equivalent to the statement that the methylation score contributes 97% to the final decision since the methylation score shows only values with slight distances to the decision boundary. Thus, the methylation score affects the final score less than values close to 0.0 or 1.0 like the scores of both other data modalities.
However, our work shows that multimodal data integration may be a useful strategy to generate biomarkers and improve routine cancer diagnostics. Using a multimodal classifier based on several high accuracy methods, such as Sanger sequencing and immunohistochemistry, could lead to an improvement in therapy selection in clinical practice.
Overall, the major limitation of our study is the limited amount of data available. Nevertheless, our results indicate that the integration of multiple data modalities may lead to better performance and in particular better generalisability than unimodal DL models. The results show that not even one of the investigated data modalities can be substituted by another since the external performance compared to the unimodal classifiers improves in any case of fusion.
5. Conclusions
Using a combination of different data modalities, we were able to predict BRAF-V600 mutation status better than with single data modalities on both cohorts. The integration of different data modalities appears to lead to a better performance as well as to a better generalisability. The more data modalities are combined, performance and generalisability improve, since different data modalities introduce different weaknesses and errors. This is not unexpected, since ambiguous results for one data modality can be compensated by the results of other data modalities. As the main limitation of our study was the low amount of available cohorts and data, our findings will have to be confirmed in larger studies.
Ethics approval
Ethics committee II of Heidelberg University approval 2010-318M-MA and 2014-835R-MA.
Consent for publication
Not applicable.
Data availability
Mannheim cohort data can be provided upon request.
Conflict of interest statement
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests TJB would like to disclose that he is the owner of Smart Health Heidelberg GmbH (Handschuhsheimer Landstr. 9/1, 69120 Heidelberg, Germany, https://smarthealth.de) which develops teledermatology mobile apps such as AppDoc https://online-hautarzt.net and Intimarzt https://intimarzt.de which are operated by dermatologists, outside of the submitted work. No other potential conflicts of interest are reported by any of the authors.
Funding
This study was funded by the Federal Ministry of Health, Berlin, Germany (grant: Tumorverhalten-Prädiktions-Initiative: Smarte Daten für die patientenzentrierte Präzisionsonkologie bei Melanom, Brust- & Prostatakrebs; funding code: 2519DAT712) and by the Ministry of Social Affairs, Health and Integration of the Federal State Baden-Württemberg, Germany (grant: AI-Translation-Initiative (“KI-Translations-Initiative”); grants-holder: Titus J. Brinker, German Cancer Research Center, Heidelberg, Germany).
Author contribution statement
Lucas Schneider and Christoph Wies: Conceptualisation; Data curation; Investigation; Methodology; Formal analysis; Software; Project administration; Visualisation; Writing - Original Draft. Eva Krieghoff-Henning: Conceptualisation; Funding acquisition; Methodology; Supervision; Writing - Review & Editing. Tabea-Clara Bucher and Dirk Schadendorf: Writing - Review & Editing. Jochen Sven Utikal and Titus Josef Brinker: Conceptualisation; Funding acquisition; Methodology; Resources; Writing - Review & Editing.
Appendix A. Supplementary data
The following are the Supplementary data to this article.
Population-based analysis of the prevalence of BRAF mutation in patients diagnosed with cutaneous melanoma and its significance as a prognostic factor.
Predictive biomarkers in melanoma: detection of BRAF mutation using dermoscopy.
in: Artificial intelligence over infrared images for medical applications and medical image assisted biomarker discovery. Cham,
2022: 176-186https://doi.org/10.1007/978-3-031-19660-7_17
K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” Lond Edinb Dublin Philos Mag J Sci, vol. 2, no. 11, pp. 559–572, doi: 10.1080/14786440109462720.
“Development and validation of a novel DNA methylation-driven gene based molecular classification and predictive model for overall survival and immunotherapy response in patients with glioblastoma: a multiomic analysis,” front.